2.3 Data structures

So far, we used scalar objects (i.e., objects with a length() of 1). To combine multiple scalars in one object, we need to construct larger data structures. In R, we distinguish between the following structures of data objects (aka. data shapes):

  1. scalars (i.e., individual objects, vectors of length 1)
  2. vectors (one dimension, i.e., 1D)
  3. tables (two dimensions, i.e., 2D)
  4. arrays (\(n\) dimensions, i.e., \(n\)D)
  5. non-rectangular or unstructured data

The key data structures covered in this book are scalars, vectors, and tables (1.–3.). In R, different structures are distinguished based on the fact whether they contain only a single or multiple data types. Thus, Table 2.2 distinguishes between data structures for “homogeneous” vs. “heterogeneous” data types.

Table 2.2: Overview of R data structures (i.e., data shapes, based on data types).
Dimensions Homogeneous data types Heterogeneous data types
1D atomic vector list
2D matrix table (data frame/tibble)
nD array

Although Table 2.2 contains five different data structures, only two of them are by far the most important ones for our purposes:

  • vectors are linear (1-dimensional) data structures. So-called atomic vectors only contain a single data type and have a length of 1 or more elements.

  • tables are rectangular (2-dimensional) data structures that can contain data of different types (in different columns). The terms data frames and tibbles denote two slightly different types of tables.

A good question is: Where are scalar objects in Table 2.2? The answer is: R is a vector-based language. Thus, even scalar objects are represented as (atomic, i.e., homogeneous) vectors (of length \(1\)).

2.3.1 Vectors

Vectors are by far the most common and most important data structure in R. Essentially, a vector is an ordered sequence of elements with three common properties:

  1. its type of elements (tested by typeof());
  2. its length (tested by length());
  3. optional attributes or meta-data (tested by attributes()).

More specifically, there are two types of vectors:

  • in atomic vectors, all elements are of the same type
  • in lists, elements can have different types

The vast majority of vectors we will encounter are atomic vectors (i.e., all elements of the same type), but lists are often used in R for storing a variety of data types in a common object (e.g., in statistical analyses).

Atomic vectors can contain objects of any type. The simplest way of creating a vector is by using the c() function (think chain, combine, or concatenate) on a number of objects:

# Create vectors:
v_lg <- c(TRUE, FALSE)   # logical vector
v_n1 <- c(1, pi, 4.5)    # numeric vector (double)
v_n2 <- c(2L, 3L, 5L)    # numeric vector (integer)
v_cr <- c("hi", "Hallo", "salut")  # character vector

Whenever encountering a new vector, the first things to do is testing for its type and length:

# type:
typeof(v_n1)
#> [1] "double"
typeof(v_cr)
#> [1] "character"
# length: 
length(v_lg)
#> [1] 2
length(v_n2)
#> [1] 3

Beyond these elementary functions, the majority of functions in R can be applied to vectors. However, most functions require a particular data type to work properly. For instance, a common operation that changes an existing vector consists in sorting vectors, which is achieved by the sort() function. An argument decreasing is set to FALSE by default, but can be set to TRUE if sorting in decreasing order is desired:

x <- c(4, 6, 2)

sort(x)
#> [1] 2 4 6
sort(x, decreasing = TRUE)
#> [1] 6 4 2

What happens when we apply sort() to other data types?

y <- c(TRUE, FALSE, TRUE, FALSE)
sort(y)
#> [1] FALSE FALSE  TRUE  TRUE

z <- c("A", "N", "T")
sort(z, decreasing = TRUE)
#> [1] "T" "N" "A"

This shows that generic R functions like sort() often work with multiple data types. However, many functions simply require specific data types and would not work with others. For instance, as most mathematical functions require numeric objects to work, the following would create an error:

sum("A", "B", "C")  # would yield an error

However, remember that vectors of logical values can be interpreted as numbers (FALSE as 0 and TRUE as 1):

v_lg2 <- c(FALSE, TRUE, FALSE)
v_nm2 <- c(4, 5)

c(v_lg2, v_nm2)
#> [1] 0 1 0 4 5
mean(v_lg2)
#> [1] 0.3333333

As attributes are optional, most vectors have no attributes:

v_n2
#> [1] 2 3 5
attributes(v_n2)
#> NULL

The most common attribute of a vector \(v\) are the names of its elements, which can be set or retrieved by names(v):

# Setting names:
names(v_n2) <- c("A", "B", "C")
names(v_cr) <- c("en", "de", "fr")

# Getting names:
names(v_n2)
#> [1] "A" "B" "C"

Other attributes can be defined as name-value pairs using attr(v, name) <- value) and inspected by attributes(), str() or structure():

# Adding attributes:
attr(v_cr, "my_dictionary") <- "Words to greet people"

# Viewing attributes:
attributes(v_n2)
#> $names
#> [1] "A" "B" "C"
attributes(v_cr)
#> $names
#> [1] "en" "de" "fr"
#> 
#> $my_dictionary
#> [1] "Words to greet people"

# Inspecting a vector's structure: 
str(v_cr)
#>  Named chr [1:3] "hi" "Hallo" "salut"
#>  - attr(*, "names")= chr [1:3] "en" "de" "fr"
#>  - attr(*, "my_dictionary")= chr "Words to greet people"
structure(v_cr)
#>      en      de      fr 
#>    "hi" "Hallo" "salut" 
#> attr(,"my_dictionary")
#> [1] "Words to greet people"

There exists an is.vector() function in R, but it does not only test if an object is a vector. Instead, it returns TRUE only if the object is a vector with no attributes other than names.

To test if an object v actually is a vector, we can use is.atomic(v) | is.list(v) (i.e., test if it is an atomic vector or a list) or use an auxiliary is_vector() function of various packages (e.g., purrr):

# (1) A vector with only names:
is.vector(v_n2)  
#> [1] TRUE

# (2) A vector with other attributes:
is.vector(v_cr)
#> [1] FALSE
is.atomic(v_cr)
#> [1] TRUE
purrr::is_vector(v_cr)
#> [1] TRUE

Combining vectors

The c() function used to combine objects into vectors can also used to combine scalars and vectors, or multiple vectors:

# Combining scalar objects and vectors (defined above): 
v1 <- 1
v2 <- c(2, 3)
v3 <- c(4, 5)

v4 <- c(v1, v2, v3)     # but the result is only 1 vector, not 2 or 3: 
v4
#> [1] 1 2 3 4 5

Note that the new vector v4 is still a vector, rather than a vector containing other vectors (i.e., c() flattens hierarchical vector structures into vectors).

Coercion of data types

When combining different data types, they are coerced into a single data type. The result is either a numeric vector (when mixing truth values and numberic objects) or a character vector (when mixing anything with characters):

# Combining different data types:
x <- c(TRUE, 2L, 3.0)  # logical, integer, double
x
#> [1] 1 2 3
typeof(x)
#> [1] "double"

y <- c(TRUE, "two")  # logical, character
y
#> [1] "TRUE" "two"
typeof(y)
#> [1] "character"

z <- c(TRUE, 2, "three")  # logical, numeric, character
z
#> [1] "TRUE"  "2"     "three"
typeof(z)
#> [1] "character"

Vector creation functions

The c() function is used for combining existing vectors. However, for creating vectors that contain more than just a few elements (i.e., vectors with larger length() values), using the c() function and then typing all vector elements becomes impractical. Useful functions and shortcuts to generate continuous or regular sequences are the colon operator :, and the functions seq() and rep():

  • m:n generates a numeric sequence (in steps of \(1\) or \(-1\)) from m to n:
# Colon operator (with by = 1):
s1 <- 0:10
s1
#>  [1]  0  1  2  3  4  5  6  7  8  9 10
s2 <- 10:0
all.equal(s1, rev(s2))
#> [1] TRUE
  • seq() generates numeric sequences from an initial number from to a final number to and allows either setting the step-width by or the length of the sequence length.out:
# Sequences with seq():
s3 <- seq(0, 10, 1)  # is short for: 
s3
#>  [1]  0  1  2  3  4  5  6  7  8  9 10
s4 <- seq(from = 0, to = 10, by = 1)
all.equal(s3, s4)
#> [1] TRUE
all.equal(s1, s3)
#> [1] TRUE

# Note: seq() is more flexible:
s5 <- seq(0, 10, by = 2.5)        # set step size
s5
#> [1]  0.0  2.5  5.0  7.5 10.0
s6 <- seq(0, 10, length.out = 5)  # set output length
all.equal(s5, s6)
#> [1] TRUE
  • rep() replicates the values provided in its first argument x either times times or each element each times:
# Replicating vectors (with rep):
s7 <- rep(c(0, 1), 3)  # is short for:
s7
#> [1] 0 1 0 1 0 1
s8 <- rep(x = c(0, 1), times = 3)
all.equal(s7, s8)
#> [1] TRUE

# but differs from:
s9 <- rep(x = c(0, 1), each = 3)
s9
#> [1] 0 0 0 1 1 1

Whereas : and seq() create numeric vectors, rep() can be used with other data types:

rep(c(TRUE, FALSE), times = 2)
#> [1]  TRUE FALSE  TRUE FALSE
rep(c("A", "B"), each = 2)
#> [1] "A" "A" "B" "B"

Random sampling from a population

A frequent situation when working with R is that we want a sequence of elements (i.e., a vector) that are randomly drawn from a given population. The sample() function allows drawing a sample of size size from a population x. A logical argument replace specifies whether the sample is to be drawn with or without replacement. Not surprisingly, the population x is provided as a vector of elements and the result of sample() is another vector of length size:

# Sampling vector elements (with sample):
sample(x = 1:3, size = 10, replace = TRUE)
#>  [1] 1 2 1 2 3 1 2 2 2 2
# Note:
# sample(1:3, 10)  
# would yield an error (as replace = FALSE by default). 

# Note:
one_to_ten <- 1:10
sample(one_to_ten, size = 10, replace = FALSE)  # drawing without replacement
#>  [1]  3  5  8  1  6  7  2 10  4  9
sample(one_to_ten, size = 10, replace = TRUE)   # drawing with replacement
#>  [1]  2  4  3  7 10  2  5  3  1  3

As the x argument of sample() accepts non-numeric vectors, we can use the function to generate sequences of random events. For instance, we can use character vectors to sample sequences of letters or words (which can be used to represent random events):

# Random letter/word sequences:
sample(x = c("A", "B", "C"), size = 10, replace = TRUE)
#>  [1] "C" "A" "B" "C" "C" "C" "C" "B" "A" "B"
sample(x = c("he", "she", "is", "good", "lucky", "sad"), size = 5, replace = TRUE)
#> [1] "lucky" "she"   "good"  "is"    "good"

# Binary sample (coin flip): 
coin <- c("H", "T")    # 2 events: Heads or Tails
sample(coin, 5, TRUE)  # is short for: 
#> [1] "T" "H" "T" "T" "H"
sample(x = coin, size = 5, replace = TRUE)    # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"

# Flipping 10.000 coins:
coins_10000 <- sample(x = coin, size = 10000, replace = TRUE)  # flip coin 10.000 times
table(coins_10000)  # overview of 10.000 flips
#> coins_10000
#>    H    T 
#> 5049 4951

Accessing and changing vectors

Having found various ways of storing R objects in vectors, we need to ask:

  • How can we access, test for, or replace individual vector elements?

These tasks are summarily known as indexing or subsetting. As this is an extremely common and important tasks, there are many ways of accessing and changing vector elements. We will only cover the two most important ones here (but Chapter 4 Subsetting of Wickham (2019a) lists six different ways):

  1. Numerical indexing/subsetting provides a numeric (vector of) value(s) denoting the position(s) of the desired elements in a vector in square brackets []. Given a character vector ABC (of a length 5):
ABC <- c("Anna", "Ben", "Cecily", "David", "Eve")
ABC
#> [1] "Anna"   "Ben"    "Cecily" "David"  "Eve"

here are two ways of accessing particular elements of this vector:

ABC[3]
#> [1] "Cecily"
ABC[c(2, 4)]
#> [1] "Ben"   "David"

Rather than merely accessing these elements, we can also change these elements by assigning new values to them:

ABC[1] <- "Annabelle"
ABC[c(2, 3)] <- c("Benjamin", "Cecilia")
ABC
#> [1] "Annabelle" "Benjamin"  "Cecilia"   "David"     "Eve"

Providing negative indices yields all elements of a vector expect for the ones at the specified positions:

ABC[-1]
#> [1] "Benjamin" "Cecilia"  "David"    "Eve"
ABC[c(-2, -4, -5)]
#> [1] "Annabelle" "Cecilia"

Even providing non-existent or missing (NA) indices yields sensible results:

ABC[99]  # accessing a non-existent position, vs. 
#> [1] NA
ABC[NA]  # accessing a missing (NA) position
#> [1] NA NA NA NA NA

Note that missing values are addictive in R: Asking for the NA-the element of a vector yields a vector of the same length with only NA values (and names).

  1. Logical indexing/subsetting provides a logical (vector of) value(s) in square brackets []. The provided vector of TRUE or FALSE values is typically of the same length as the indexed vector v.

For instance, assuming a numeric vector one_to_ten:

one_to_ten <- 1:10 
one_to_ten
#>  [1]  1  2  3  4  5  6  7  8  9 10

we could select its elements in the first and third position by:

one_to_ten[c(TRUE, FALSE, TRUE, FALSE, FALSE, 
             FALSE, FALSE, FALSE, FALSE, FALSE)]
#> [1] 1 3

The same can be achieved in two steps by defining a vector of logical indices and then using it as an index to our numeric vector one_to_ten:

my_ix_v <- c(TRUE, FALSE, TRUE, FALSE, FALSE, 
             FALSE, FALSE, FALSE, FALSE, FALSE)
one_to_ten[my_ix_v]
#> [1] 1 3

Explicitly defining a vector of logical values quickly becomes impractical, especially for longer vectors. However, the same can be achieved implicitly by using a logical test of the vector v as the logical index values of vector v:

my_ix_v <- (one_to_ten > 5)
one_to_ten[my_ix_v]
#> [1]  6  7  8  9 10

Using a test on the same vector to generate the indices to a vector is a very powerful tool for getting subsets of a vector (which is why indexing is also referred to as subsetting). Essentially, the R expression within the square brackets [] asks a question about a vector and the logical indexing construct returns the elements for which this question is answered in the affirmative (i.e., the indexing vector yields TRUE). Here are some examples:

one_to_ten[one_to_ten < 3 | one_to_ten > 8]
#> [1]  1  2  9 10
one_to_ten[one_to_ten %% 2 == 0]
#> [1]  2  4  6  8 10
one_to_ten[!is.na(one_to_ten)]
#>  [1]  1  2  3  4  5  6  7  8  9 10

ABC[ABC != "Eve"]
#> [1] "Annabelle" "Benjamin"  "Cecilia"   "David"
ABC[nchar(ABC) == 5]
#> [1] "David"
ABC[substr(ABC, 3, 3) == "n"]
#> [1] "Annabelle" "Benjamin"

The which() function provides a bridge from logical to numerical indexing, as which(v) returns the numeric indices of those elements of v for which an R expression is TRUE:

which(one_to_ten > 8)
#> [1]  9 10
which(nchar(ABC) > 7)
#> [1] 1 2

Thus, the following expressions use both types of indexing to yield identical results:

one_to_ten[which(one_to_ten > 8)]  # numerical indexing
#> [1]  9 10
one_to_ten[one_to_ten > 8]         # logical indexing
#> [1]  9 10

ABC[which(nchar(ABC) > 7)]  # numerical indexing 
#> [1] "Annabelle" "Benjamin"
ABC[nchar(ABC) > 7]         # logical indexing
#> [1] "Annabelle" "Benjamin"

Note that both numerical and logical indexing use square brackets [] directly following the name of the object to be indexed. By contrast, functions always provide their arguments in round parentheses ().

Example

Suppose we know the following facts about five people:

Table 2.3: Some facts about five people (as 3 vectors, each of which has 5 elements).
p_1 p_2 p_3 p_4 p_5
name Adam Ben Cecily David Evelyn
gender male male female male misc
age 21 19 20 48 45

How would we encode this information in R?

Note that we know the same three facts about each person and the leftmost column in Table 2.3 specifies this type of information (i.e., a variable). A straightforward way of representing these facts in R would consist in defining a vector for each variable.

name <- c("Adam", "Ben", "Cecily", "David", "Evelyn")
gender <- c("male", "male", "female", "male", "misc")
age <- c(21, 19, 20, 48, 45)

In this solution, we encode the two vectors name and gender as character data, whereas the vector age encodes numeric data. Note that gender is often encoded as numeric values (e.g., as 0 vs. 1) or as logical value (e.g., female?: TRUE vs. FALSE), but this creates problems — or rather incomplete accounts — when there are more than two gender values to consider.8

Equipped with these three vectors, we can now employ numeric and logical indexing to ask and answer a wide range of questions about these people. Note that the three vectors have the same length (as they describe the same set of people). If we assume that a particular position in a vector always refers to the same person, we can use one of the vectors to index the same or any other vector. This is a very common and immensely powerful idea to select vector elements (or here: properties of people) based on their values on other variables.

As an exercise, try predicting the results of the following expressions and describe what we are asking for in each case in your own words (including the type of indexing). Then evaluate each expression to check your prediction.

name[c(-1)]
name[gender != "male"]
name[age >= 21]

gender[3:5]
gender[nchar(name) > 5]
gender[age > 30]

age[c(1, 3, 5)]
age[(name != "Ben") & (name != "Cecily")]
age[gender == "female"]

Here are the results:

name[c(-1)]              # get names of all non-first people
#> [1] "Ben"    "Cecily" "David"  "Evelyn"
name[gender != "male"]   # get names of non-male people
#> [1] "Cecily" "Evelyn"
name[age >= 21]          # get names of people with an age of 21 or older
#> [1] "Adam"   "David"  "Evelyn"

gender[3:5]              # get 3rd to 5th gender values
#> [1] "female" "male"   "misc"
gender[nchar(name) > 5]  # get gender of people with a name of more than 5 letters
#> [1] "female" "misc"
gender[age > 30]         # get gender of people over 30 
#> [1] "male" "misc"

age[c(1, 3, 5)]          # get age values of certain positions
#> [1] 21 20 45
age[(name != "Ben") & (name != "Cecily")]  # get age of people whose name is not "Ben" and not "Cecily"
#> [1] 21 48 45
age[gender == "female"]  # get age values of all people with "female" gender values
#> [1] 20

The first command in each triple used numerical indexing, whereas the other two commands in each triple used logical indexing.

From this example, it is only a small step to study tables.

2.3.2 Tables

Tables generally store data in a two-dimensional (2D) format (i.e., a grid containing rows and columns). When all rows and all columns have the same length, the resulting structure is rectangular.

As Table 2.2 has already shown, we distinguish between two main types of 2D-data structures in R:

  1. matrices are homogeneous with respect to their data (i.e., contain only a single data type)

  2. tables (called data frames or tibbles in R) are heterogeneous: They can contain different data types in different columns.

R distinguishes between tables that are data frames and tables that are tibbles. But as tibbles are actually another (simpler) type of data frame, we can ignore this distinction here (and will reconsider it when introducing the tibble package in Chapter 4).

Another confusing aspect is that the term “table” is sometimes used as a super-category for any rectangular data structure (i.e, including data frames and matrices, e.g., in the title of this section). In R, the flexibility or vagueness of the term “table” is made possible as R defines no corresponding object type. However, it makes sense to distinguish between matrices and data frames, which is why we will discuss these two types of tables next.

Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we get a matrix of data. Matrices can be created from vectors (of the same data type) by binding them together. The rbind() function treats each vector as a row; the cbind() function treats each vector as a column:

# Creating 3 vectors: 
x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ---- 
m1 <- rbind(x, y, z)  # combine as rows
m1
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

m2 <- cbind(x, y, z)  # combine as columns
m2
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9

A more direct ways of creating matrices is the matrix() function. It contains arguments for data, for the number of rows nrow, the number of columns ncol, and a logical argument byrow that arranges data in a by-row vs. by-column fashion:

# Putting a vector into a rectangular matrix:
m3 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE)
m3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

m4 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE)
m4
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Note that the following commands all create Warning messages, as the number of arguments do not fit together neatly as a matrix (of the required size):

m <- 1:2
n <- 3:5

rbind(m, n)
#>   [,1] [,2] [,3]
#> m    1    2    1
#> n    3    4    5
cbind(m, n)
#>      m n
#> [1,] 1 3
#> [2,] 2 4
#> [3,] 1 5

matrix(data = 1:10, nrow = 3, ncol = 3)
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9

The matrices m1 to m4 all contained numeric data. However, data of type logical or character can also stored in matrix form:

# A matrix of logical values:
m5 <- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE)
m5
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE

# A matrix of character values:
m6 <- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE)
m6
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a" 
#> [2,] "j"  "c"  "z"  "f" 
#> [3,] "m"  "b"  "e"  "w" 
#> [4,] "n"  "t"  "h"  "y"

Indexing matrices

Retrieving values from a matrix m works similarly to indexing vectors. First, we will consider numeric indexing. Due to the two-dimensional nature of a matrix, we now need to specify two indices in square brackets: the number of the desired row, and the number of the desired column, separated by a comma. Thus, to get or change the value of row r and column c of a matrix m we need to evaluate m[r, c]. Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns. When the value of r or c is left unspecified, all rows or columns are selected.

# Selecting cells, rows, or columns of matrices: ---- 
m1[2, 3]  # in m1: select row 2, column 3
#> y 
#> 6
m2[3, 1]  # in m2: select row 3, column 1
#> x 
#> 3

m1[2,  ]  # in m1: select row 2, all columns
#> [1] 4 5 6
m2[ , 1]  # in m1: select column 1, all rows
#> [1] 1 2 3

m3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20
m3[2, 3:4] # in m3: select row 2, columns 3 to 4
#> [1] 7 8
m3[3:5, 2] # in m3: select rows 3 to 5, column 2
#> [1] 10 14 18

m4[]  # in r4: select all rows and all columns (i.e., all of m4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Similarly, we can extend the notion of logical indexing to matrices:

m4 > 10  # returns a matrix of logical values
#>       [,1]  [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(m4 > 10)
#> [1] "logical"
m4[m4 > 10]  # indexing of matrices
#>  [1] 11 12 13 14 15 16 17 18 19 20

Just as with vectors, we can apply functions to matrices. Typical examples include:

# Applying functions to matrices: ---- 
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"

# Note the difference between: 
is.numeric(m3) # type of m3? (1 value)
#> [1] TRUE
is.na(m3)      # NA values in m3? (many values)
#>       [,1]  [,2]  [,3]  [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE

# Computations with matrices: 
sum(m1)
#> [1] 45
max(m2)
#> [1] 9
mean(m3)
#> [1] 10.5
colSums(m3)  # column sums of r3
#> [1] 45 50 55 60
rowSums(m4)  # row sums of r4
#> [1] 34 38 42 46 50

Just as length() provides crucial information about a vector, some functions are specifically designed to provide the dimensions of rectangular data structures:

ncol(m4)  # number of columns 
#> [1] 4
nrow(m4)  # number of rows
#> [1] 5
dim(m4)   # dimensions as vector c(rows, columns)
#> [1] 5 4

A typical function in the context of matrices is t() for transposing (i.e., swap the rows and columns of) a matrix:

t(m4)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    3    4    5
#> [2,]    6    7    8    9   10
#> [3,]   11   12   13   14   15
#> [4,]   16   17   18   19   20
t(m5)
#>       [,1]  [,2]  [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE  TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,]  TRUE FALSE  TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE  TRUE FALSE

Data frames

Table 2.3 was rectangular in containing three rows (values for the variables name, gender, and age) and five columns (one for each person, plus an initial column indicating the variable name of in each row). This is a perfectly valid table, but not the type of table typically used in R.

Typical tables of data in R also combine several vectors into a larger data structure, but use the individual vectors as columns, rather than rows. Such a combination of several vectors (as columns) is shown in Table 2.4:

Table 2.4: The same facts about five people (as a table with 5 rows and 3 columns).
name gender age
Adam male 21
Ben male 19
Cecily female 20
David male 48
Evelyn misc 45

Importantly, Table 2.4 provides exactly the same information as Table 2.3 and as the three individual vectors (name, gender, and age) above, but in the shape of a table that uses our previous vectors as its columns, rather than as its rows.
As (atomic) vectors in R need to have the same data type (e.g., name contains character data, whereas age contains numeric data), the information on each person — due to containing multiple data types — cannot be stored as a vector. Instead, we represent each person as a row (aka. an observation) of the table.

Creating a data frame from vectors works by using the data.frame() function. The following assigns the resulting data frame to a dummy object df, so that we can poke and probe it later:

df <- data.frame(name, gender, age)
df
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45

As data frames are the most common way of storing data in R, there is a special form of indexing that allows retrieving the variables of a data frame (i.e., the columns of a data frame) as vectors.

Name-based indexing of data frames

When a table tb has column names (e.g., a column called nm), we can retrieve the corresponding vector by name-based indexing (aka. name indexing). This is the most convenient and most frequent way of accessing variables (i.e., columns) of tables (e.g., data frames). To use this form of indexing, we use a special dollar sign notation: Adding $ and the name of the desired variable nm to the table’s object name tb yields its column nm as a vector. This sounds complicated, but is actually very easy:

tb$name

In case of our data frame df, we can access its 1st and 2nd columns by their respective names:

names(df)  # prints the (column) names
#> [1] "name"   "gender" "age"

df$name
#> [1] Adam   Ben    Cecily David  Evelyn
#> Levels: Adam Ben Cecily David Evelyn
df$gender
#> [1] male   male   female male   misc  
#> Levels: female male misc

Indexing data frames

Note that everything we have learned about numeric and logical indexing of vectors and matrices (above) also applies to data frames. Thus, we can also use numerical indexing on a data frame, just as with matrices (above). For instance, to get all rows of the first column, we can specify the data frame’s name, followed by [ , 1]:

df[ , 1]  # get (all rows and) the 1st column of df
#> [1] Adam   Ben    Cecily David  Evelyn
#> Levels: Adam Ben Cecily David Evelyn
df[ , 2]  # get (all rows and) the 2nd column of df 
#> [1] male   male   female male   misc  
#> Levels: female male misc

Thus, these two expressions retrieve the 1st and 2nd column of the data frame df_1 (as vectors), respectively. As this is a very common task in R, there is an easier way of accessing the variables (columns) of a data frame.

Logical indexing on data frames is particularly powerful in allowing us to select particular rows (based on conditions specified on columns of the same data frame):

df[df$gender == "male", ]
#>    name gender age
#> 1  Adam   male  21
#> 2   Ben   male  19
#> 4 David   male  48
df[df$age < 21, ]
#>     name gender age
#> 2    Ben   male  19
#> 3 Cecily female  20

Note that the different types of indexing can be flexibly combined. For instance, the following command uses

  • logical indexing (to select rows of df with an age value below 30)
  • numerical indexing (to select only columns 1 and 2)
  • name indexing (to get the variable name, as a vector), and
  • numerical indexing (to select the 3rd element of this vector):
df[df$age < 30, c(1, 2)]$name[3]
#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn

In practice, such complex combinations are rarely necessary or useful. For instance, the following expressions retrieve the exact same result as the complex one, but are semantically very different:

df[3, 1]
#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn
df$name[3]
#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn

Strings as factors

Note that the data.frame() function has an argument stringsAsFactors. This argument determines whether so-called string variables (i.e., of data type “character”) are converted into factors (i.e., categorical variables, which are internally represented as integer values with text labels) when generating a data frame. To the chagrin of generations of R users, the default of this argument used to be TRUE for several decades — which essentially meant that any character variable in a data frame was converted into a factor unless the user had specified stringsAsFactors = FALSE. As this caused much confusion, the default has been changed with the release of R version 4.0.0 (on 2020-04-24) to stringsAsFactors = FALSE. This shows that the R gods at https://cran.r-project.org/ are responding to user feedback. However, as any such changes are unlikely to happen quickly, it is safer to explicitly set the arguments of a function. To see the difference between both settings, consider the following example:

df_1 <- data.frame(name, gender, age, 
                   stringsAsFactors = FALSE)  # new default (since R 4.0.0+)
df_2 <- data.frame(name, gender, age, 
                   stringsAsFactors = TRUE)   # old default (up to R 3.6.3)

# Both data frames look identical:
df_1  
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
df_2
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45

Printing the two data frames df_1 and df_2 shows us no difference between them. However, as the first two variables (i.e., name and gender) were string variables (i.e., of type “character”), they are represented as factors in df_1 versus remained character variables in df_2.

Let’s retrieve the first column of each data frame (as a vector). Using named indexing, we can easily retrieve and print the first column (i.e., with a name of name) of either data frame:

df_1$name
#> [1] "Adam"   "Ben"    "Cecily" "David"  "Evelyn"
df_2$name
#> [1] Adam   Ben    Cecily David  Evelyn
#> Levels: Adam Ben Cecily David Evelyn

Note the differences in the printed outputs. The output of df_1$name looks just any other character vector (with five elements, each consisting of a name). By contrast, the output of df_2$name also prints the same names, but without the characteristic double quotes around each name, and with a second line starting with “Levels:” before seeming to repeat the names of the first line. Before clarifying what this means, check the other variable in both df_1 and df_2 that used to be a character vector gender:

df_1$gender
#> [1] "male"   "male"   "female" "male"   "misc"
df_2$gender
#> [1] male   male   female male   misc  
#> Levels: female male misc

Again, df_1$gender appears to be a characer vector, but df_2$gender has been converted into something else. This time, the line beginning with “Levels:” only contains each of the gender labels once, and in alphabetical order.

In case you’re not confused yet, compare the outputs of the following commands:

typeof(df_1$name)
#> [1] "character"
typeof(df_2$name)
#> [1] "integer"

Whereas df_1$name was to be expected to be of type character, it should come as a surprise to see that df_2$name is of type integer. Given that df_2$name contains integers, we might be tempted to try out arithmetic functions like:

max(df_2$name)
sum(df_2$name)
mean(df_2$name)

If we try to evaluate these expressions, we get either Warnings or Error messages. How can we make sense of all this?

The magic word here is factor. As the stringsAsFactors = TRUE suggests, the character strings of the name and gender vectors have been converted into factors when defining df_2. Factors are categorical variables that only care about whether two values belong to the same or to different groups. Actually, R iternally encodes them as numeric values (integers) for each factor level. But as we never want to calculate with these numeric values (as they have no meaning beyond being either the same or different), they are also assigned a label, which is shown when printing the values of a factor.

A quick way of checking that we’re dealing with a factor is the is.factor() function:

is.factor(df_1$name)
#> [1] FALSE
is.factor(df_2$name)
#> [1] TRUE

is.factor(df_1$gender)
#> [1] FALSE
is.factor(df_2$gender)
#> [1] TRUE

Factor variables are often useful (e.g., for distinguishing between groups in statistical designs). But it is premature to assume that any character variable should be a factor when including the variable in a data frame. Thus, it is a good thing that the default argument in the data.frame() function has been changed tostringsAsFactors = FALSE` in R v4.0.0.
Whoever wants factors can still get and use them — but novice users no longer need to deal with them all the time.

  • data frames: data.frame() vs. as.data.frame()
  • tibbles: as_tibble() (of tibble package) converts a data frame into a tibble

Ways of accessing and manipulating tables:

Applying functions to tables:

  • Checking for NA values (in vectors or tables) by using is.na() function.

2.3.3 Practice

The following practice exercises allow you to check your understanding of this section.

Accessing and evaluating matrices

Assuming the definitions of the matrices m5 and m6 from above, i.e.,

m5
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE
m6
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a" 
#> [2,] "j"  "c"  "z"  "f" 
#> [3,] "m"  "b"  "e"  "w" 
#> [4,] "n"  "t"  "h"  "y"
  • predict, evaluate, and explain the result of the following R expressions:
m5[2, 6]
m5[2, ]
m5 == FALSE
sum(m5)
t(t(m5))

m6[2, 3]
m6[ , 4]
m6[nrow(m6), (ncol(m6) - 1)]
m6 == "e"
toupper(m6[4, ])

Numeric indexing of data frames

Assuming the data frame df_2 (from above),

df_2
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
  • predict, evaluate and explain what happens in the following commands (in terms of numeric indexing):
df_2[]
df_2[ , 1]
df_2[1:nrow(df_2), c(1)]
df_2[nrow(df_2):1, c(1)]
df_2[rep(1, 3), c(1, 2)]
df_2$name[3]
# compare: 
df_2[1:nrow(df_2), 1:ncol(df_2)]
df_2[1:nrow(df_2), ncol(df_2):1]
df_2[nrow(df_2):1, ncol(df_2):1]

Logical indexing of data frames

Assuming the data frame df_1 (from above),

df_1
#>     name gender age
#> 1   Adam   male  21
#> 2    Ben   male  19
#> 3 Cecily female  20
#> 4  David   male  48
#> 5 Evelyn   misc  45
  • predict, evaluate and explain what happens in the following commands (in terms of logical indexing):
df_1[ , 3] > 30
df_1[df_1$age > 30, ]
df_1[df_1$gender != "male", c(1, 3, 2)]
df_1$name[df_1$gender == "male"]
sum(df_1$age[df_1$gender == "male"])

Data frames with factors

  • Given that our definition of df_2 used stringsAsFactors = TRUE (see above), predict, evaluate and explain what happens in the following commands:
nchar(as.character(df_2$name[3]))
as.numeric(df_2$name[3]) + 1
mean(as.numeric(df_2$name))
  • Why would the following commands (which are simpler variants of the last three expressions) yield errors or warnings?
nchar(df_2$name[3])
df_2$name[3] + 1
mean(df_2$name)
  • What would happen, if the same commands were used on df_1 (from above)?
nchar(df_1$name[3])
df_1$name[3] + 1
mean(df_1$name)

  1. Fun fact: The English writer Evelyn Waugh (1903–1966) and his wife Evelyn Gardner (1903–1994) had the same given name.↩︎