3.4 Rectangular data structures

Rectangular data structures generally store data in a two-dimensional (2D) format (i.e., a grid containing rows and columns). When all rows and all columns have the same length, the resulting structure is rectangular.

As Table 3.1 has already shown, we distinguish between two main types of 2D-data structures in R:

  1. matrices are homogeneous with respect to their data (i.e., are atomic vectors that contain only a single data type)

  2. rectangular tables (called data frames or tibbles in R) allow for heterogeneous data: They can contain different data types in different columns.

The confusion regarding different 2D-data structures is clarified by distinguishing between their form and content: Matrices and rectangular tables both have a rectangular shape (i.e., rows and columns). However, matrices and rectangular tables differ with respect to their contents: Whereas a matrix is just an atomic vector (i.e., data of a single type) with some additional shape attributes, a data frame or tibble is a more complex data structure that combines multiple vectors (i.e., allowing for different data types).

Beware of “tables”

We use the clumsy term “rectangular data structure,” as the shorter term “table” is vague and confusing. Whenever speaking of tables, we need to distinguish between the term’s meanings in general language and its specific instantiations as a data structure in R.

Even when only considering rectangular tables, R still distinguishes between those that are of type “data.frame” and tables of type “tibble.” But as tibbles are actually another (simpler) type of data frame, we can ignore this distinction here (and will reconsider it when introducing the tibble package in Chapter 5).

A confusing aspect is that the term table is sometimes used informally as a super-category for any rectangular data structure (i.e, including data frames and matrices, e.g., in the title of this section).

As tables can extend in more than two dimensions, another term for multi-dimensional tables is an array. In R, however, objects of type “array” are essentially vectors (i.e., atomic) with additional shape attributes (i.e., the array’s dimensions).

In R, the flexibility or vagueness of the term table is aggravated further, as R uses the data type “table” to denote a particular type of array (i.e., a multi-dimensional data structure that expresses frequency counts in a contingency table).

Overall, we see that the term table is used in many different ways and for different kinds of objects. However, it makes sense to distinguish between matrices and data frames, which is why we will discuss these two types of tables next.

3.4.1 Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we have a matrix of data. In R, a matrix is an atomic vector with additional attributes that determine its shape and the names of its rows or columns.

Creating matrices

A way of creating a matrix from an atomic vector is provided by the matrix() function. It contains arguments for data, for the number of rows nrow, the number of columns ncol, and a logical argument byrow that arranges data in a by-row vs. by-column fashion:

# Reshaping an atomic vector into a rectangular matrix:
(m1 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

(m2 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Matrices can also be created from multiple atomic vectors (of the same data type) by binding them together:

  • the rbind() function treats each vector as a row;

  • the cbind() function treats each vector as a column:

# Creating 3 vectors: 
x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ---- 
(m3 <- rbind(x, y, z))  # combine as rows
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

(m4 <- cbind(x, y, z))  # combine as columns
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9

When the data to be shaped into a matrix does not match to each other or the size arguments, R tries to recycle vectors or truncates to the dimensions provided. Note that the following commands all create Warning messages, as the number of arguments do not fit together as a matrix (of the required size):

m <- 1:2
n <- 3:5

rbind(m, n)  # recycling m
#>   [,1] [,2] [,3]
#> m    1    2    1
#> n    3    4    5
cbind(m, n)  # recycling m
#>      m n
#> [1,] 1 3
#> [2,] 2 4
#> [3,] 1 5

matrix(data = 1:10, nrow = 3, ncol = 4)  # recycling data
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    4    7   10
#> [2,]    2    5    8    1
#> [3,]    3    6    9    2
matrix(data = 1:10, nrow = 3, ncol = 3)  # truncating data
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9

The matrices m1 to m4 all contained numeric data. However, data of type “logical” or “character” can also stored in matrix form:

# A matrix of logical values:
(m5 <- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE))
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE

# A matrix of character values:
(m6 <- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE))
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a" 
#> [2,] "j"  "c"  "z"  "f" 
#> [3,] "m"  "b"  "e"  "w" 
#> [4,] "n"  "t"  "h"  "y"

Indexing matrices

Retrieving values from a matrix m works similarly to indexing vectors. First, we will consider numeric indexing. Due to the two-dimensional nature of a matrix, we now need to specify two indices in square brackets: the number of the desired row, and the number of the desired column, separated by a comma. Thus, to get or change the value of row r and column c of a matrix m we need to evaluate m[r, c]. Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns. When the value of r or c is left unspecified, all rows or columns are selected.

# Selecting cells, rows, or columns of matrices: ---- 
m1[2, 3]  # in m1: select row 2, column 3
#> y 
#> 6
m2[3, 1]  # in m2: select row 3, column 1
#> x 
#> 3

m1[2,  ]  # in m1: select row 2, all columns
#> [1] 4 5 6
m2[ , 1]  # in m1: select column 1, all rows
#> [1] 1 2 3

m3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20
m3[2, 3:4] # in m3: select row 2, columns 3 to 4
#> [1] 7 8
m3[3:5, 2] # in m3: select rows 3 to 5, column 2
#> [1] 10 14 18

m4[]  # in r4: select all rows and all columns (i.e., all of m4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Similarly, we can extend the notion of logical indexing to matrices:

m4 > 10  # returns a matrix of logical values
#>       [,1]  [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(m4 > 10)
#> [1] "logical"
m4[m4 > 10]  # indexing of matrices
#>  [1] 11 12 13 14 15 16 17 18 19 20

Just as with vectors, we can apply functions to matrices. Typical examples include:

# Applying functions to matrices: ---- 
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"

# Note the difference between: 
is.numeric(m3) # type of m3? (1 value)
#> [1] TRUE
is.na(m3)      # NA values in m3? (many values)
#>       [,1]  [,2]  [,3]  [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE

# Computations with matrices: 
sum(m1)
#> [1] 45
max(m2)
#> [1] 9
mean(m3)
#> [1] 10.5
colSums(m3)  # column sums of r3
#> [1] 45 50 55 60
rowSums(m4)  # row sums of r4
#> [1] 34 38 42 46 50

Just as length() provides crucial information about a vector, some functions are specifically designed to provide the dimensions of rectangular data structures:

ncol(m4)  # number of columns 
#> [1] 4
nrow(m4)  # number of rows
#> [1] 5
dim(m4)   # dimensions as vector c(rows, columns)
#> [1] 5 4

A typical function in the context of matrices is t() for transposing (i.e., swap the rows and columns of) a matrix:

t(m4)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    3    4    5
#> [2,]    6    7    8    9   10
#> [3,]   11   12   13   14   15
#> [4,]   16   17   18   19   20
t(m5)
#>       [,1]  [,2]  [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE  TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,]  TRUE FALSE  TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE  TRUE FALSE

3.4.2 Data frames

Table ?? was rectangular in containing three rows (values for the variables name, gender, and age) and five columns (one for each person, plus an initial column indicating the variable name of in each row). This is a perfectly valid table, but not the type of table typically used in R.

Typical tables of data in R also combine several vectors into a larger data structure, but use the individual vectors as columns, rather than rows. Such a combination of several vectors (as columns) is shown in Table ??:

Table 3.2: The same facts about five people (as a table with 5 rows and 3 columns).
name gender age
Adam male 21
Ben male 19
Cecily female 20
David male 48
Evelyn misc 45

Importantly, Table ?? provides exactly the same information as Table ?? and as the three individual vectors (name, gender, and age) above, but in the shape of a table that uses our previous vectors as its columns, rather than as its rows.
As (atomic) vectors in R need to have the same data type (e.g., name contains character data, whereas age contains numeric data), the information on each person — due to containing multiple data types — cannot be stored as a vector. Instead, we represent each person as a row (aka. an observation) of the table.

Creating data frames

To create a data frame, we first need the data to be framed in some other form. The most typical scenario is that we have the data as a set of vectors. If these vectors have the same length, creating a data frame from vectors can be achieved by the data.frame() function. For instance, we can define a data frame df from our name, gender, and age vectors (from above) by assigning it to data.frame(name, gender, age):

(df <- data.frame(name, gender, age))
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45

A remarkable fact about data frame df is that it is an object that combines multiple data types. Internally, R represents data frames as a list of atomic vectors. The vectors form the columns of the data frame df, rather than its rows. More trivially, we could have called our object some_data or five_people, but the name df is often used as a short and convenient name for a data frame that can be poked and probed later.

The data.frame() function is quite powerful and coerces a variety of objects into data frames. For instance, we can use it to turn individual vectors or matrices into data frames:

# Creating data frames from a vector or matrix: 
v <- 1:9
m <- matrix(v, nrow = 3)

data.frame(v)  # from vector
#>   v
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
data.frame(m)  # from matrix
#>   X1 X2 X3
#> 1  1  4  7
#> 2  2  5  8
#> 3  3  6  9

Due to the rectangular shape of a data frame, its columns all need to have the same length. If the vectors used to create a data frame do not have the same length, the shorter one(s) are recycled to the length of the longest one:

# From vectors of different length: 
abc <- letters[1:3]
data.frame(v, abc)
#>   v abc
#> 1 1   a
#> 2 2   b
#> 3 3   c
#> 4 4   a
#> 5 5   b
#> 6 6   c
#> 7 7   a
#> 8 8   b
#> 9 9   c

As we learned in Chapter 2 that each object is characterized by its shape, we can ask: What is the shape of the data frame df? For linear data structures (like vectors or lists), basic shape information is provided by their length(). For rectangular data structures (like matrices or data frames), we can still ask about their length(), but the more informative function provides their dimensions dim():

length(df) # number of vectors/columns  
#> [1] 3
dim(df)    # dimensions 
#> [1] 5 3

nrow(df)   # dim[1]
#> [1] 5
ncol(df)   # dim[2]
#> [1] 3

As data frames are the most common way of storing data in R, there is a special form of indexing that allows accessing the variables of a data frame (i.e., the columns of a data frame) as vectors.

Name-based indexing of data frames

When a table tb has column names (e.g., a column called name), we can retrieve the corresponding vector by name-based indexing (aka. name indexing). This is the most convenient and most frequent way of accessing variables (i.e., columns) of tables (e.g., data frames). To use this form of indexing, we use a special dollar sign notation: Adding $ and the name of the desired variable name to the table’s object name tb yields its column name as a vector. This sounds complicated, but is actually very easy:

tb$name

In case of our data frame df, we can access its 1st and 2nd columns by their respective names:

names(df)  # prints the (column) names
#> [1] "name"   "gender" "age"

df$name
#> [1] "Adam"   "Ben"    "Cecily" "David"  "Evelyn"
df$gender
#> [1] "male"   "male"   "female" "male"   "misc"

Indexing data frames

Note that everything we have learned about numeric and logical indexing of vectors and matrices (above) also applies to data frames. Thus, we can also use numerical indexing on a data frame, just as with matrices (above). For instance, to get all rows of the first column, we can specify the data frame’s name, followed by [ , 1]:

df[ , 1]  # get (all rows and) the 1st column of df
#> [1] "Adam"   "Ben"    "Cecily" "David"  "Evelyn"
df[ , 2]  # get (all rows and) the 2nd column of df 
#> [1] "male"   "male"   "female" "male"   "misc"

Thus, these two expressions retrieve the 1st and 2nd column of the data frame df_1 (as vectors), respectively. As this is a very common task in R, there is an easier way of accessing the variables (columns) of a data frame.

Logical indexing on data frames is particularly powerful in allowing us to select particular rows (based on conditions specified on columns of the same data frame):

df[df$gender == "male", ]
#>    name gender age
#> 1  Adam   male  21
#> 2   Ben   male  19
#> 4 David   male  48
df[df$age < 21, ]
#>     name gender age
#> 2    Ben   male  19
#> 3 Cecily female  20

Note that the different types of indexing can be flexibly combined. For instance, the following command uses

  • logical indexing (to select rows of df with an age value below 30)
  • numerical indexing (to select only columns 1 and 2)
  • name indexing (to get the variable name, as a vector), and
  • numerical indexing (to select the 3rd element of this vector):
df[df$age < 30, c(1, 2)]$name[3]
#> [1] "Cecily"

In practice, such complex combinations are rarely necessary or useful. For instance, the following expressions retrieve the exact same result as the complex one, but have much simpler semantics:

df[3, 1]
#> [1] "Cecily"
df$name[3]
#> [1] "Cecily"

As data frames are lists, we can access their elements as we access list elements:

Using lists or data frames?

Knowing that data frames are lists may suggest that it does not matter whether we use a data frame or a list to store data. This impression is false. Although it is possible to store many datasets as both as a list or a rectangular table, it is typically better to opt for the simpler format that is supported by more tools.

In principle, data frames and lists can store the same data (and even are variants of the same R data structure, i.e., a list). However, pragmatic reasons tip the balance in favor of data frames in most use cases: Whenever a set of vectors to be combined all have the same length, their combination would create a rectangular shape. As many R functions assume or are optimized for rectangular data structures, using data frames is typically the better choice.

The lesson to be learned here is that we should aim for the simplest data structure that matches the properties of our data. Although lists are more flexible than data frames, they are rarely needed in applied contexts. As a general rule, simpler structures are to be preferred to more complex ones:

  • For linear sequences of homogenous data, vectors are preferable to lists.

  • For rectangular shapes of heterogeneous data, data frames are preferable to lists.

Thus, as long as data fits into the simple and regular shapes of vectors and data frames, there is no need for using lists. Vectors and data frames are typically easier to create and use than corresponding lists. Additionally, many R functions are written and optimized for vectors and data frames, rather than lists. As a consequence, lists should only be used when data requires mixing both different types and shapes of data, or when data objects get so complex, irregular, or unpredictable that they do not fit into a rectangular table.

Strings as factors

Note that the data.frame() function has an argument stringsAsFactors. This argument determines whether so-called string variables (i.e., of data type “character”) are converted into factors (i.e., categorical variables, which are internally represented as integer values with text labels) when generating a data frame. To the chagrin of generations of R users, the default of this argument used to be TRUE for several decades — which essentially meant that any character variable in a data frame was converted into a factor unless the user had specified stringsAsFactors = FALSE. As this caused much confusion, the default has been changed with the release of R version 4.0.0 (on 2020-04-24) to stringsAsFactors = FALSE. This shows that the R gods at https://cran.r-project.org/ are responding to user feedback. However, as any such changes are unlikely to happen quickly, it is safer to explicitly set the arguments of a function. To see the difference between both settings, consider the following example:

df_1 <- data.frame(name, gender, age, 
                   stringsAsFactors = FALSE)  # new default (since R 4.0.0+)
df_2 <- data.frame(name, gender, age, 
                   stringsAsFactors = TRUE)   # old default (up to R 3.6.3)

# Both data frames look identical:
df_1  
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
df_2
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45

Printing the two data frames df_1 and df_2 shows us no difference between them. However, as the first two variables (i.e., name and gender) were string variables (i.e., of type “character”), they are represented as factors in df_1 versus remained character variables in df_2.

Let’s retrieve the first column of each data frame (as a vector). Using named indexing, we can easily retrieve and print the first column (i.e., with a name of name) of either data frame:

df_1$name
#> [1] "Adam"   "Ben"    "Cecily" "David"  "Evelyn"
df_2$name
#> [1] Adam   Ben    Cecily David  Evelyn
#> Levels: Adam Ben Cecily David Evelyn

Note the differences in the printed outputs. The output of df_1$name looks just any other character vector (with five elements, each consisting of a name). By contrast, the output of df_2$name also prints the same names, but without the characteristic double quotes around each name, and with a second line starting with “Levels:” before seeming to repeat the names of the first line. Before clarifying what this means, check the other variable in both df_1 and df_2 that used to be a character vector gender:

df_1$gender
#> [1] "male"   "male"   "female" "male"   "misc"
df_2$gender
#> [1] male   male   female male   misc  
#> Levels: female male misc

Again, df_1$gender appears to be a characer vector, but df_2$gender has been converted into something else. This time, the line beginning with “Levels:” only contains each of the gender labels once, and in alphabetical order.

In case you’re not confused yet, compare the outputs of the following commands:

typeof(df_1$name)
#> [1] "character"
typeof(df_2$name)
#> [1] "integer"

Whereas df_1$name was to be expected to be of type character, it should come as a surprise to see that df_2$name is of type integer. Given that df_2$name contains integers, we might be tempted to try out arithmetic functions like:

max(df_2$name)
sum(df_2$name)
mean(df_2$name)

If we try to evaluate these expressions, we get either Warnings or Error messages. How can we make sense of all this?

The magic word here is factor. As the stringsAsFactors = TRUE suggests, the character strings of the name and gender vectors have been converted into factors when defining df_2. Factors are categorical variables that only care about whether two values belong to the same or to different groups. Actually, R iternally encodes them as numeric values (integers) for each factor level. But as we never want to calculate with these numeric values (as they have no meaning beyond being either the same or different), they are also assigned a label, which is shown when printing the values of a factor.

A quick way of checking that we’re dealing with a factor is the is.factor() function:

is.factor(df_1$name)
#> [1] FALSE
is.factor(df_2$name)
#> [1] TRUE

is.factor(df_1$gender)
#> [1] FALSE
is.factor(df_2$gender)
#> [1] TRUE

Factor variables are often useful (e.g., for distinguishing between groups in statistical designs). But it is premature to assume that any character variable should be a factor when including the variable in a data frame. Thus, it is a good thing that the default argument in the data.frame() function has been changed tostringsAsFactors = FALSE` in R v4.0.0.
Whoever wants factors can still get and use them — but novice users no longer need to deal with them all the time.

3.4.3 Practice

The following practice exercises allow you to check your understanding of this section.

Accessing and evaluating matrices

Assuming the definitions of the matrices m5 and m6 from above, i.e.,

m5
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE
m6
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a" 
#> [2,] "j"  "c"  "z"  "f" 
#> [3,] "m"  "b"  "e"  "w" 
#> [4,] "n"  "t"  "h"  "y"
  • predict, evaluate, and explain the result of the following R expressions:
m5[2, 6]
m5[2, ]
m5 == FALSE
sum(m5)
t(t(m5))

m6[2, 3]
m6[ , 4]
m6[nrow(m6), (ncol(m6) - 1)]
m6 == "e"
toupper(m6[4, ])

Numeric indexing of data frames

Assuming the data frame df_2 (from above),

df_2
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
  • predict, evaluate and explain what happens in the following commands (in terms of numeric indexing):
df_2[]
df_2[ , 1]
df_2[1:nrow(df_2), c(1)]
df_2[nrow(df_2):1, c(1)]
df_2[rep(1, 3), c(1, 2)]
df_2$name[3]
# compare: 
df_2[1:nrow(df_2), 1:ncol(df_2)]
df_2[1:nrow(df_2), ncol(df_2):1]
df_2[nrow(df_2):1, ncol(df_2):1]

Logical indexing of data frames

Assuming the data frame df_1 (from above),

df_1
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
  • predict, evaluate and explain what happens in the following commands (in terms of logical indexing):
df_1[ , 3] > 30
df_1[df_1$age > 30, ]
df_1[df_1$gender != "male", c(1, 3, 2)]
df_1$name[df_1$gender == "male"]
sum(df_1$age[df_1$gender == "male"])

Data frames with factors

  • Given that our definition of df_2 used stringsAsFactors = TRUE (see above), predict, evaluate and explain what happens in the following commands:
nchar(as.character(df_2$name[3]))
as.numeric(df_2$name[3]) + 1
mean(as.numeric(df_2$name))
  • Why would the following commands (which are simpler variants of the last three expressions) yield errors or warnings?
nchar(df_2$name[3])
df_2$name[3] + 1
mean(df_2$name)
  • What would happen, if the same commands were used on df_1 (from above)?
nchar(df_1$name[3])
df_1$name[3] + 1
mean(df_1$name)

Details:

  • data frames: data.frame() vs. as.data.frame()
  • tibbles: as_tibble() (of tibble package) converts a data frame into a tibble

Ways of accessing and manipulating tables

Applying functions to tables:

  • Checking for NA values (in vectors or tables) by using is.na() function.