## 3.4 Rectangular data structures

Rectangular data structures generally store data in a *two-dimensional* (2D) format (i.e., a grid containing rows and columns).
When all rows and all columns have the same length, the resulting structure is *rectangular*.

As Table 3.1 has already shown, we distinguish between two main types of 2D-data structures in R:

*matrices*are*homogeneous*with respect to their data (i.e., are atomic vectors that contain only a single data type)*rectangular tables*(called*data frames*or*tibbles*in R) allow for*heterogeneous*data: They can contain different data types in different columns.

The confusion regarding different 2D-data structures is clarified by distinguishing between their form and content: Matrices and rectangular tables both have a rectangular shape (i.e., rows and columns). However, matrices and rectangular tables differ with respect to their contents: Whereas a matrix is just an atomic vector (i.e., data of a single type) with some additional shape attributes, a data frame or tibble is a more complex data structure that combines multiple vectors (i.e., allowing for different data types).

#### Beware of “tables”

We use the clumsy term “rectangular data structure,” as the shorter term “table” is vague and confusing. Whenever speaking of tables, we need to distinguish between the term’s meanings in general language and its specific instantiations as a data structure in R.

Even when only considering rectangular tables, R still distinguishes between those that are of type “data.frame” and tables of type “tibble.” But as tibbles are actually another (simpler) type of data frame, we can ignore this distinction here (and will reconsider it when introducing the **tibble** package in Chapter 5).

A confusing aspect is that the term *table* is sometimes used informally as a super-category for any rectangular data structure (i.e, including data frames and matrices, e.g., in the title of this section).

As tables can extend in more than two dimensions, another term for multi-dimensional tables is an *array*.
In R, however, objects of type “array” are essentially vectors (i.e., atomic) with additional shape attributes (i.e., the array’s dimensions).

In R, the flexibility or vagueness of the term *table* is aggravated further, as R uses the data type “table” to denote a particular type of array (i.e., a multi-dimensional data structure that expresses frequency counts in a contingency table).

Overall, we see that the term *table* is used in many different ways and for different kinds of objects.
However, it makes sense to distinguish between matrices and data frames, which is why we will discuss these two types of tables next.

### 3.4.1 Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we have a *matrix* of data.
In R, a matrix is an atomic vector with additional attributes that determine its shape and the names of its rows or columns.

#### Creating matrices

A way of creating a matrix from an atomic vector is provided by the `matrix()`

function.
It contains arguments for `data`

, for the number of rows `nrow`

, the number of columns `ncol`

, and a logical argument `byrow`

that arranges `data`

in a by-row vs. by-column fashion:

```
# Reshaping an atomic vector into a rectangular matrix:
<- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE))
(m1 #> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
<- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE))
(m2 #> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
```

Matrices can also be created from multiple atomic vectors (of the same data type) by binding them together:

the

`rbind()`

function treats each vector as a row;the

`cbind()`

function treats each vector as a column:

```
# Creating 3 vectors:
<- 1:3
x <- 4:6
y <- 7:9
z
# Combining vectors (of the same length): ----
<- rbind(x, y, z)) # combine as rows
(m3 #> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
<- cbind(x, y, z)) # combine as columns
(m4 #> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
```

When the data to be shaped into a matrix does not match to each other or the size arguments, R tries to recycle vectors or truncates to the dimensions provided. Note that the following commands all create Warning messages, as the number of arguments do not fit together as a matrix (of the required size):

```
<- 1:2
m <- 3:5
n
rbind(m, n) # recycling m
#> [,1] [,2] [,3]
#> m 1 2 1
#> n 3 4 5
cbind(m, n) # recycling m
#> m n
#> [1,] 1 3
#> [2,] 2 4
#> [3,] 1 5
matrix(data = 1:10, nrow = 3, ncol = 4) # recycling data
#> [,1] [,2] [,3] [,4]
#> [1,] 1 4 7 10
#> [2,] 2 5 8 1
#> [3,] 3 6 9 2
matrix(data = 1:10, nrow = 3, ncol = 3) # truncating data
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
```

The matrices `m1`

to `m4`

all contained numeric data.
However, data of type “logical” or “character” can also stored in matrix form:

```
# A matrix of logical values:
<- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE))
(m5 #> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] FALSE FALSE FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE FALSE FALSE
# A matrix of character values:
<- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE))
(m6 #> [,1] [,2] [,3] [,4]
#> [1,] "u" "d" "s" "a"
#> [2,] "j" "c" "z" "f"
#> [3,] "m" "b" "e" "w"
#> [4,] "n" "t" "h" "y"
```

#### Indexing matrices

Retrieving values from a matrix `m`

works similarly to indexing vectors.
First, we will consider *numeric* indexing.
Due to the two-dimensional nature of a matrix, we now need to specify *two* indices in square brackets:
the number of the desired row, and the number of the desired column, separated by a comma.
Thus, to get or change the value of row `r`

and column `c`

of a matrix `m`

we
need to evaluate `m[r, c]`

.
Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns.
When the value of `r`

or `c`

is left unspecified, *all* rows or columns are selected.

```
# Selecting cells, rows, or columns of matrices: ----
2, 3] # in m1: select row 2, column 3
m1[#> [1] 7
3, 1] # in m2: select row 3, column 1
m2[#> [1] 3
2, ] # in m1: select row 2, all columns
m1[#> [1] 5 6 7 8
1] # in m1: select column 1, all rows
m2[ , #> [1] 1 2 3 4 5
m3#> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
2, 2:3] # in m3: select row 2, columns 2 to 3
m3[#> [1] 5 6
1:3, 2] # in m3: select rows 1 to 3, column 2
m3[#> x y z
#> 2 5 8
# in r4: select all rows and all columns (i.e., all of m4)
m4[] #> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
```

Similarly, we can extend the notion of logical indexing to matrices:

```
> 10 # returns a matrix of logical values
m4 #> x y z
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE
typeof(m4 > 10)
#> [1] "logical"
> 10] # indexing of matrices
m4[m4 #> integer(0)
```

Just as with vectors, we can apply functions to matrices. Typical examples include:

```
# Applying functions to matrices: ----
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"
# Note the difference between:
is.numeric(m3) # type of m3? (1 value)
#> [1] TRUE
is.na(m3) # NA values in m3? (many values)
#> [,1] [,2] [,3]
#> x FALSE FALSE FALSE
#> y FALSE FALSE FALSE
#> z FALSE FALSE FALSE
# Computations with matrices:
sum(m1)
#> [1] 210
max(m2)
#> [1] 20
mean(m3)
#> [1] 5
colSums(m3) # column sums of r3
#> [1] 12 15 18
rowSums(m4) # row sums of r4
#> [1] 12 15 18
```

Just as `length()`

provides crucial information about a vector, some functions are specifically designed to provide the dimensions of rectangular data structures:

```
ncol(m4) # number of columns
#> [1] 3
nrow(m4) # number of rows
#> [1] 3
dim(m4) # dimensions as vector c(rows, columns)
#> [1] 3 3
```

A typical function in the context of matrices is `t()`

for transposing (i.e., swap the rows and columns of) a matrix:

```
t(m4)
#> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
t(m5)
#> [,1] [,2] [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,] TRUE FALSE TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE TRUE FALSE
```

### 3.4.2 Data frames

Table 2.1 was rectangular in containing three rows (values for the variables `name`

, `gender`

, and `age`

)
and five columns (one for each person, plus an initial column indicating the variable name of in each row).
This is a perfectly valid table, but not the type of table typically used in R.

Typical tables of data in R also combine several vectors into a larger data structure, but use the individual vectors as columns, rather than rows. Such a combination of several vectors (as columns) is shown in Table 3.2:

Dimensions | Homogeneous data types | Heterogeneous data types |
---|---|---|

1D | atomic vector | list |

2D | matrix | table (data frame/tibble) |

nD | array |

Importantly, Table 3.2 provides exactly the same information as Table 2.1 and as the three individual vectors (`name`

, `gender`

, and `age`

) above, but in the shape of a table that uses our previous vectors as its *columns*, rather than as its rows.

As (atomic) vectors in R need to have the same data type (e.g., `name`

contains character data, whereas `age`

contains numeric data), the information on each person — due to containing multiple data types — cannot be stored as a vector.
Instead, we represent each person as a *row* (aka. an *observation*) of the table.

#### Creating data frames

To create a data frame, we first need the data to be framed in some other form.
The most typical scenario is that we have the data as a set of vectors.
If these vectors have the same length, creating a data frame from vectors can be achieved by the `data.frame()`

function.
For instance, we can define a data frame `df`

from our `name`

, `gender`

, and `age`

vectors (from above) by assigning it to `data.frame(name, gender, age)`

:

```
<- data.frame(name, gender, age))
(df #> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

A remarkable fact about data frame `df`

is that it is an object that combines multiple data types. Internally, R represents data frames as a list of atomic vectors. The vectors form the columns of the data frame `df`

, rather than its rows.
More trivially, we could have called our object `some_data`

or `five_people`

, but the name `df`

is often used as a short and convenient name for a data frame that can be poked and probed later.

The `data.frame()`

function is quite powerful and coerces a variety of objects into data frames.
For instance, we can use it to turn individual vectors or matrices into data frames:

```
# Creating data frames from a vector or matrix:
<- 1:9
v <- matrix(v, nrow = 3)
m
data.frame(v) # from vector
#> v
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
data.frame(m) # from matrix
#> X1 X2 X3
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9
```

Due to the rectangular shape of a data frame, its columns all need to have the same length. If the vectors used to create a data frame do not have the same length, the shorter one(s) are recycled to the length of the longest one:

```
# From vectors of different length:
<- letters[1:3]
abc data.frame(v, abc)
#> v abc
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 a
#> 5 5 b
#> 6 6 c
#> 7 7 a
#> 8 8 b
#> 9 9 c
```

As we learned in Chapter 2 that each object is characterized by its shape, we can ask: What is the shape of the data frame `df`

?
For linear data structures (like vectors or lists), basic shape information is provided by their `length()`

.
For rectangular data structures (like matrices or data frames), we can still ask about their `length()`

, but the more informative function provides their dimensions `dim()`

:

```
length(df) # number of vectors/columns
#> [1] 3
dim(df) # dimensions
#> [1] 5 3
nrow(df) # dim[1]
#> [1] 5
ncol(df) # dim[2]
#> [1] 3
```

As data frames are the most common way of storing data in R, there is a special form of indexing that allows accessing the variables of a data frame (i.e., the columns of a data frame) as vectors.

#### Name-based indexing of data frames

When a table `tb`

has column names (e.g., a column called `name`

), we can retrieve the corresponding vector by *name-based indexing* (aka. *name indexing*). This is the most convenient and most frequent way of accessing variables (i.e., columns) of tables (e.g., data frames).
To use this form of indexing, we use a special dollar sign notation: Adding `$`

and the name of the desired variable `name`

to the table’s object name `tb`

yields its column `name`

as a vector. This sounds complicated, but is actually very easy:

`$name tb`

In case of our data frame `df`

, we can access its 1st and 2nd columns by their respective names:

```
names(df) # prints the (column) names
#> [1] "name" "gender" "age"
$name
df#> [1] "Adam" "Ben" "Cecily" "David" "Evelyn"
$gender
df#> [1] "male" "male" "female" "male" "misc"
```

#### Indexing data frames

Note that everything we have learned about numeric and logical indexing of vectors and matrices (above) also applies to data frames.
Thus, we can also use numerical indexing on a data frame, just as with matrices (above).
For instance, to get all rows of the first column, we can specify the data frame’s name, followed by `[ , 1]`

:

```
1] # get (all rows and) the 1st column of df
df[ , #> [1] "Adam" "Ben" "Cecily" "David" "Evelyn"
2] # get (all rows and) the 2nd column of df
df[ , #> [1] "male" "male" "female" "male" "misc"
```

Thus, these two expressions retrieve the 1st and 2nd column of the data frame `df_1`

(as vectors), respectively.
As this is a very common task in R, there is an easier way of accessing the variables (columns) of a data frame.

Logical indexing on data frames is particularly powerful in allowing us to select particular rows (based on conditions specified on columns of the same data frame):

```
$gender == "male", ]
df[df#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 4 David male 48
$age < 21, ]
df[df#> name gender age
#> 2 Ben male 19
#> 3 Cecily female 20
```

Note that the different types of indexing can be flexibly combined. For instance, the following command uses

- logical indexing (to select rows of
`df`

with an age value below 30) - numerical indexing (to select only columns 1 and 2)
- name indexing (to get the variable
`name`

, as a vector), and - numerical indexing (to select the 3rd element of this vector):

```
$age < 30, c(1, 2)]$name[3]
df[df#> [1] "Cecily"
```

In practice, such complex combinations are rarely necessary or useful. For instance, the following expressions retrieve the exact same result as the complex one, but have much simpler semantics:

```
3, 1]
df[#> [1] "Cecily"
$name[3]
df#> [1] "Cecily"
```

As data frames are lists, we can access their elements as we access list elements:

#### Using lists or data frames?

Knowing that data frames *are* lists may suggest that it does not matter whether we use a data frame or a list to store data.
This impression is false. Although it is possible to store many datasets as both as a list or a rectangular table, it is typically better to opt for the simpler format that is supported by more tools.

In principle, data frames and lists can store the same data (and even are variants of the same R data structure, i.e., a `list`

).
However, pragmatic reasons tip the balance in favor of data frames in most use cases:
Whenever a set of vectors to be combined all have the same length, their combination would create a rectangular shape.
As many R functions assume or are optimized for rectangular data structures, using data frames is typically the better choice.

The lesson to be learned here is that we should aim for the simplest data structure that matches the properties of our data. Although lists are more flexible than data frames, they are rarely needed in applied contexts. As a general rule, simpler structures are to be preferred to more complex ones:

For linear sequences of homogenous data, vectors are preferable to lists.

For rectangular shapes of heterogeneous data, data frames are preferable to lists.

Thus, as long as data fits into the simple and regular shapes of vectors and data frames, there is no need for using lists. Vectors and data frames are typically easier to create and use than corresponding lists. Additionally, many R functions are written and optimized for vectors and data frames, rather than lists. As a consequence, lists should only be used when data requires mixing both different types and shapes of data, or when data objects get so complex, irregular, or unpredictable that they do not fit into a rectangular table.

#### Strings as factors

Note that the `data.frame()`

function has an argument `stringsAsFactors`

.
This argument determines whether so-called *string* variables (i.e., of data type “character”) are converted into *factors* (i.e., categorical variables, which are internally represented as integer values with text labels) when generating a data frame.
To the chagrin of generations of R users, the default of this argument used to be `TRUE`

for several decades — which essentially meant that any character variable in a data frame was converted into a factor unless the user had specified `stringsAsFactors = FALSE`

. As this caused much confusion, the default has been changed with the release of R version 4.0.0 (on 2020-04-24) to `stringsAsFactors = FALSE`

. This shows that the R gods at https://cran.r-project.org/ are responding to user feedback. However, as any such changes are unlikely to happen quickly, it is safer to explicitly set the arguments of a function.
To see the difference between both settings, consider the following example:

```
<- data.frame(name, gender, age,
df_1 stringsAsFactors = FALSE) # new default (since R 4.0.0+)
<- data.frame(name, gender, age,
df_2 stringsAsFactors = TRUE) # old default (up to R 3.6.3)
# Both data frames look identical:
df_1 #> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
df_2#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

Printing the two data frames `df_1`

and `df_2`

shows us no difference between them.
However, as the first two variables (i.e., `name`

and `gender`

) were string variables (i.e., of type “character”), they are represented as factors in `df_1`

versus remained character variables in `df_2`

.

Let’s retrieve the first column of each data frame (as a vector).
Using named indexing, we can easily retrieve and print the first column (i.e., with a name of `name`

) of either data frame:

```
$name
df_1#> [1] "Adam" "Ben" "Cecily" "David" "Evelyn"
$name
df_2#> [1] Adam Ben Cecily David Evelyn
#> Levels: Adam Ben Cecily David Evelyn
```

Note the differences in the printed outputs.
The output of `df_1$name`

looks just any other character vector (with five elements, each consisting of a name).
By contrast, the output of `df_2$name`

also prints the same names, but without the characteristic double quotes around each name, and with a second line starting with “Levels:” before seeming to repeat the names of the first line.
Before clarifying what this means, check the other variable in both `df_1`

and `df_2`

that used to be a character vector `gender`

:

```
$gender
df_1#> [1] "male" "male" "female" "male" "misc"
$gender
df_2#> [1] male male female male misc
#> Levels: female male misc
```

Again, `df_1$gender`

appears to be a characer vector, but `df_2$gender`

has been converted into something else.
This time, the line beginning with “Levels:” only contains each of the gender labels once, and in alphabetical order.

In case you’re not confused yet, compare the outputs of the following commands:

```
typeof(df_1$name)
#> [1] "character"
typeof(df_2$name)
#> [1] "integer"
```

Whereas `df_1$name`

was to be expected to be of type *character*, it should come as a surprise to see that `df_2$name`

is of type *integer*.
Given that `df_2$name`

contains *integers*, we might be tempted to try out arithmetic functions like:

```
max(df_2$name)
sum(df_2$name)
mean(df_2$name)
```

If we try to evaluate these expressions, we get either Warnings or Error messages. How can we make sense of all this?

The magic word here is *factor*.
As the `stringsAsFactors = TRUE`

suggests, the character strings of the `name`

and `gender`

vectors have been converted into *factors* when defining `df_2`

.
Factors are categorical variables that only care about whether two values belong to the same or to different groups.
Actually, R iternally encodes them as numeric values (integers) for each factor level. But as we never want to calculate with these numeric values (as they have no meaning beyond being either the same or different), they are also assigned a label, which is shown when printing the values of a factor.

A quick way of checking that we’re dealing with a factor is the `is.factor()`

function:

```
is.factor(df_1$name)
#> [1] FALSE
is.factor(df_2$name)
#> [1] TRUE
is.factor(df_1$gender)
#> [1] FALSE
is.factor(df_2$gender)
#> [1] TRUE
```

Factor variables are often useful (e.g., for distinguishing between groups in statistical designs).
But it is premature to assume that any character variable should be a factor when including the variable in a data frame.
Thus, it is a good thing that the default argument in the `data.frame() function has been changed to`

stringsAsFactors = FALSE` in R v4.0.0.

Whoever wants factors can still get and use them — but novice users no longer need to deal with them all the time.

### 3.4.3 Practice

The following practice exercises allow you to check your understanding of this section.

#### Accessing and evaluating matrices

Assuming the definitions of the matrices `m5`

and `m6`

from above, i.e.,

```
m5#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] FALSE FALSE FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE FALSE FALSE
m6#> [,1] [,2] [,3] [,4]
#> [1,] "u" "d" "s" "a"
#> [2,] "j" "c" "z" "f"
#> [3,] "m" "b" "e" "w"
#> [4,] "n" "t" "h" "y"
```

- predict, evaluate, and explain the result of the following R expressions:

```
2, 6]
m5[2, ]
m5[== FALSE
m5 sum(m5)
t(t(m5))
2, 3]
m6[4]
m6[ , nrow(m6), (ncol(m6) - 1)]
m6[== "e"
m6 toupper(m6[4, ])
```

#### Numeric indexing of data frames

Assuming the data frame `df_2`

(from above),

```
df_2#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

- predict, evaluate and explain what happens in the following commands (in terms of
*numeric*indexing):

```
df_2[]1]
df_2[ , 1:nrow(df_2), c(1)]
df_2[nrow(df_2):1, c(1)]
df_2[rep(1, 3), c(1, 2)]
df_2[$name[3]
df_2# compare:
1:nrow(df_2), 1:ncol(df_2)]
df_2[1:nrow(df_2), ncol(df_2):1]
df_2[nrow(df_2):1, ncol(df_2):1] df_2[
```

#### Logical indexing of data frames

Assuming the data frame `df_1`

(from above),

```
df_1#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

- predict, evaluate and explain what happens in the following commands (in terms of
*logical*indexing):

```
3] > 30
df_1[ , $age > 30, ]
df_1[df_1$gender != "male", c(1, 3, 2)]
df_1[df_1$name[df_1$gender == "male"]
df_1sum(df_1$age[df_1$gender == "male"])
```

#### Data frames with factors

- Given that our definition of
`df_2`

used`stringsAsFactors = TRUE`

(see above), predict, evaluate and explain what happens in the following commands:

```
nchar(as.character(df_2$name[3]))
as.numeric(df_2$name[3]) + 1
mean(as.numeric(df_2$name))
```

- Why would the following commands (which are simpler variants of the last three expressions) yield errors or warnings?

```
nchar(df_2$name[3])
$name[3] + 1
df_2mean(df_2$name)
```

- What would happen, if the same commands were used on
`df_1`

(from above)?

```
nchar(df_1$name[3])
$name[3] + 1
df_1mean(df_1$name)
```

Additional details:

- data frames:
`data.frame()`

vs.`as.data.frame()`

- tibbles:
`as_tibble()`

(of**tibble**package) converts a data frame into a tibble

Ways of accessing and manipulating tables

Applying functions to tables:

- Checking for
`NA`

values (in vectors or tables) by using`is.na()`

function.