1.5 Rectangular data structures

As we have seen (in the previous Section 1.4), vectors are linear (or one-dimensional) sequences: They have a length, but no width — or only a trivial width of 1. By combining several vectors, we get a rectangular data structure — usually a matrix or a rectangular table — that arranges data points in two dimensions: horizontal rows and vertical columns. In a rectangular data structure, all rows and all columns are of equal length.

The most common rectangular data structures in R are matrices (of type matrix) and rectangular tables (of type data.frame or tibble). As all these concepts denote rectangular data structures, the differences between them concern details of their contents and their implementation. Although all three concepts denote rectangular data structures, we mostly distinguish between two sub-types: Matrices (Section 1.5.1) and rectangular tables (Section 1.5.2).

1.5.1 Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we call this a matrix of data. Despite its rectangular shape, a matrix is not a rectangular table (which consists of multiple columns). Instead, an R matrix is a vector (i.e., a linear data structure that stores only a single mode/type of data) with an additional attribute that describes its shape: Rather than just extending in one dimension (i.e., the vector’s length), the matrix distributes data elements in two dimensions (with the dim attribute denoting its number of rows and columns).

Creating matrices

Matrices can easily be created from a vector by a matrix() function that accepts some data (e.g., a numeric vector 1:20) and one or more additional arguments that specify the shape and arrangement of the desired matrix:

# Reshaping a vector into a matrix: 
(m0 <- matrix(data = 1:20, nrow = 5))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

By default, the number of columns of the new matrix is chosen in a sensible way, but can also be set explicitly:

(m1 <- matrix(data = 1:20, ncol = 4))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Similarly, the elements of data are arranged in a by-column fashion by default. If we want to change this default behavior, we can set the argument byrow = TRUE:

(m2 <- matrix(data = 1:20, nrow = 5, byrow = TRUE))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

Altenatively, a matrix can be created by combining multiple vectors — provided that they have the same data type — by binding them together. Based on the desired arrangement of vectors, we have two binding functions: The cbind() function treats each vector as a column; the rbind() function treats each vector as a row:

# Creating 3 vectors: 
x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ---- 
(m3 <- cbind(x, y, z))  # combine as columns
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
(m4 <- rbind(x, y, z))  # combine as rows
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

The resulting matrices m3 and m4 differ not only in the arrangement of their elements (by-column vs. by-row), but also in their names: Combining vectors as columns uses the vector names as column names, whereas combining vectors as rows uses the vector names as row names.

In summary, the matrix() function turns existing data into a matrix, with arguments for the number of rows nrow, the number of columns ncol, and a logical argument byrow that arranges data in a by-row vs. by-column fashion. By contrast, the cbind() and rbind() functions combine multiple vectors into a matrix.

Note that reshaping data into a rectangle is subject to multiple constraints. For instance, what happens when we specify arguments for both nrow and ncol, but the product of nrol * ncol does not match the number of elements in data? And what happens when the vectors that we want to bind into a matrix are not of the same length or type? We can simply find this out by trying.

# Data (as vectors):
v <- 1:10
m <- 1:3
n <- 4:5
o <- letters[1:3]

# Matrices: 
matrix(data = v, nrow = 3, ncol = 3)
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9
matrix(data = v, nrow = 4, ncol = 4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    5    9    3
#> [2,]    2    6   10    4
#> [3,]    3    7    1    5
#> [4,]    4    8    2    6

cbind(m, n)
#>      m n
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 4
rbind(m, n)
#>   [,1] [,2] [,3]
#> m    1    2    3
#> n    4    5    4

cbind(m, o)
#>      m   o  
#> [1,] "1" "a"
#> [2,] "2" "b"
#> [3,] "3" "c"
rbind(m, o)
#>   [,1] [,2] [,3]
#> m "1"  "2"  "3" 
#> o "a"  "b"  "c"

Note that most of these commands created Warning messages, as the number of arguments did not fit together neatly as a matrix (of the specified size). However, R still interpreted each expression and created a matrix in each case. However, the resulting matrices may not always have been the ones we expected to obtain.

The matrices m0 to m4 all contained numeric data. However, data of type “logical” or “character” can also stored in matrix form:

# A matrix of logical values:
(m5 <- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE))
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE

# A matrix of character values:
(m6 <- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE))
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a" 
#> [2,] "j"  "c"  "z"  "f" 
#> [3,] "m"  "b"  "e"  "w" 
#> [4,] "n"  "t"  "h"  "y"

Thus, as long as the matrix() function receives data of only a single data type and is well-specified (in the sense that the number of elements in data and those specified by nrow or ncol do not contradict each other), the resulting matrices are straightforward. And if there are conflicts, R still tries to create some sensible data structure, but it takes some expertise to explain and predict what happens in these cases.

The shape and type of matrices

As R matrices are implemented as re-shaped variants of atomic vectors, we can check their properties by the same generic functions. For instance:

mode(m0)
#> [1] "numeric"
typeof(m0)
#> [1] "integer"
length(m0)
#> [1] 20

However, note that adding the 2-dimensional dim attribute (with elements for the number of rows and columns of a matrix) allows for more precise characterizations of matrices that would not work (or return different results) for vectors:

is.vector(m0)
#> [1] FALSE
is.matrix(m0)
#> [1] TRUE
dim(m0)
#> [1] 5 4

Thus, the shape of a matrix is better described by dim(x) than by length(x).

Practice

  1. What would the following generic functions return for a vector v0?
v0 <- 1:20
is.vector(v0)
is.matrix(v0)
dim(v0)
  1. If x is a matrix, what is the result of dim(x)[1] * dim(x)[2]? Predict and check this for an example.
(x <- matrix(data = 1:20, nrow = 5))
dim(x)[1] * dim(x)[2]
length(x)

Indexing matrices

Retrieving values from a matrix m works similarly to indexing vectors. First, we will consider numeric indexing. Due to the two-dimensional nature of a matrix, we now need to specify two indices in square brackets: the number of the desired row, and the number of the desired column, separated by a comma. Thus, to get or change the value of row r and column c of a matrix m we need to evaluate m[r, c]. Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns. When the value of r or c is left unspecified, all rows or columns are selected.

# Selecting cells, rows, or columns of matrices: ---- 
m1[2, 3]  # in m1: select row 2, column 3
#> [1] 12
m1[2,  ]  # in m1: select row 2, all columns
#> [1]  2  7 12 17

m2[3, 1]  # in m2: select row 3, column 1
#> [1] 9
m2[ , 1]  # in m1: select column 1, all rows
#> [1]  1  5  9 13 17

m3[2, 2:3]  # in m3: select row 2, columns 2 to 3
#> y z 
#> 5 8
m3[1:3, 2]  # in m3: select rows 1 to 3, column 2
#> [1] 4 5 6

m4[2, ]   # in m4: select row 2
#> [1] 4 5 6
m4[ , 2]  # in m4: select col 2
#> x y z 
#> 2 5 8
m4[]      # in m4: select all rows and all columns (i.e., all of m4)
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

Similarly, we can extend the notion of logical indexing to matrices:

m4 > 5  # returns a matrix of logical values
#>    [,1]  [,2]  [,3]
#> x FALSE FALSE FALSE
#> y FALSE FALSE  TRUE
#> z  TRUE  TRUE  TRUE
typeof(m4 > 5)
#> [1] "logical"
m4[m4 > 5]  # indexing of matrices
#> [1] 7 8 6 9

Just as with vectors, we can apply generic functions to matrices, provided that the function is applicable to the type of data stored in the matrix. Typical examples include:

# Applying functions to matrices: ---- 
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"

# Note the differences between: 
is.numeric(m3)  # type of m3? (1 value)
#> [1] TRUE
is.na(m3)       # NA values in m3? (many values)
#>          x     y     z
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE

# Computations with matrices: 
sum(m1)
#> [1] 210
max(m2)
#> [1] 20
mean(m3)
#> [1] 5
colSums(m3)  # column sums of r3
#>  x  y  z 
#>  6 15 24
rowSums(m4)  # row sums of r4
#>  x  y  z 
#>  6 15 24

Note that some of these functions required the matrics m1 to m4 to be of specific data types and would not have worked for matrices of another type.

Regarding the shape of matrices, we saw above that dim(x) provides more specific information about the shape of a matrix x than length(x). As the result of dim(x) is a vector (with elements specifying the number of rows and columns of x), there exist specializations that directly yield the number of rows or columns:

length(m1)  # length (of vector)
#> [1] 20
dim(m1)     # dimensions as vector c(rows, columns)
#> [1] 5 4

nrow(m1)    # number of rows
#> [1] 5
ncol(m1)    # number of columns 
#> [1] 4

Another common function in the context of matrices is t() for transposing (i.e., swap the rows and columns of) a matrix:

t(m4)
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
t(m5)
#>       [,1]  [,2]  [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE  TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,]  TRUE FALSE  TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE  TRUE FALSE

Practice

Assuming the definitions of m5 and m6 from above, predict, evaluate, and explain the result of the following expressions:

m5[2, 6]
m5[2, ]
m5 == FALSE
sum(m5)
t(t(m5))

m6[2, 3]
m6[ , 4]
m6[nrow(m6), (ncol(m6) - 1)]
m6 == "e"
toupper(m6[4, ])

Arrays and tables

Table 1.1 (in Section 1.2.1) mentioned arrays and tables as data structure for storing homogeneous data with more than two dimensions in R. Essentially, objects of the type array are generalizations of vectors and matrices that allow for multiple dimensions (i.e., more complex shapes). By contrast, R objects of type table store the frequency counts of factor combinations (aka. as a “contingency table”) and thus are a special kind of array.

Essentially, arrays and tables are generalizations of vectors and matrices. As they are mostly used for storing multi-dimensional data (with 3 or more dimensions), we do not cover them here. However, much of what we know about vectors and matrices can be generalized to arrays and tables.

1.5.2 Rectangular tables (data frames and tibbles)

As matrices (and vectors) contain data of only one type (e.g., all cells are all numeric, character, or logical data), we need another data structure for heterogeneous data. As we will see, rectangular tables are the most frequent way of storing data throughout this book.

Why do we write “rectangular table”, rather than just “table”? Somewhat confusingly, R provides not only one, but several variants of rectangular tables (i.e., data structures in the shape of a rectangle). And unfortunately, the base R data structure table refers to yet another, \(n\)-dimensional data structure that was briefly mentioned above as a special type of array (see the end of Section 1.5.1 and ?base::table for details). Additionally, the R package data.table (Barrett et al., 2024) provides considerable extensions of R’s data.frame construct and allows faster transformations of large data files. The ubiquity and ambiguity of the “table” concept is why we often use the clumsy term “rectangular table”, rather than just “table”, when referring to the most common shape of data. However, at this early point in our R careers, we can assume that the term “table” will primarily refer to a data frame or tibble.

The need for storing heterogeneous data is nothing exotic or unusual. In fact, even the most simple datasets require mixing multiple types of data. For instance, imagine that we want to store a dataset that contains basic information on a group of people:

  • their names,
  • their gender,
  • their age (in years),
  • their height (in cm).

Each of these four variables can be stored as a vector (the first two of type character, the others of type numeric). To store all four variables in a single data structure, we can combine the four vectors into a rectangular table. In R, the four vectors form the columns of a rectangular table, rather than its rows.

The most common rectangular data structures in R are data frames, whereas so-called tibbles are a simpler version of a data frame that is used in the tidyverse (see Chapter 5 on Tibbles). Here is how we can describe some people on four dimensions (aka. variables) by creating four short vectors and combine them into a data frame:

# Create some vectors (of different types, but same length): -----  
name   <- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
gender <- c("male", "female", "female", "female", "female", "male", "male")
age    <- c(21, 23, 22, 19, 21, 18, 24)
height <- c(165, 170, 168, 172, 158, 185, 182)

# Combine 4 vectors (of equal length) into a data frame: 
df <- data.frame(name, gender, age, height, 
                 stringsAsFactors = TRUE)
df    # Note: Vectors are the columns of the data frame!
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182

The created data frame (named df) is a two-dimensional object consisting out of 7 rows (cases/observations) and 4 columns (variables/measures). As with all data structures, we can apply a range of functions to obtain information about the df object:

is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE

# What are the dimensions of df?
dim(df)  # 7 rows (cases) x 4 columns (variables)
#> [1] 7 4

# Note: 
# sum(df)  # would yield an error

We can easily turn any data frame into a tibble, by using the as_tibble() function of the tidyverse package tibble (Müller & Wickham, 2023):

# Turn df into a tibble tb: 
tb <- tibble::as_tibble(df)
dim(tb)  # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4

At this point, a tibble is just a simpler and more convenient type of data frame. One advantage of tibbles is that they can be printed more easily to the screen. For instance, printing a tibble always shows its dimensions (as in dim(tb)) and the data type of each of its variables (i.e., each column):

tb  # print tb
#> # A tibble: 7 × 4
#>   name   gender   age height
#>   <fct>  <fct>  <dbl>  <dbl>
#> 1 Adam   male      21    165
#> 2 Bertha female    23    170
#> 3 Cecily female    22    168
#> 4 Dora   female    19    172
#> 5 Eve    female    21    158
#> 6 Nero   male      18    185
#> 7 Zeno   male      24    182

We will learn more about tibbles later (see Chapter 5 on Tibbles).

Please remember: Both data frames and tibbles have columns — rather than rows — that consist of atomic and linear vectors. The fact that vectors are homogeneous data constructs (see Table 1.1) is the reason for referring to the columns of a data frame as its variables. By contrast, the rows of a data frame can contain heterogeneous data types and are referred to as its cases or observations.

Practice

  1. Re-thinking rectangular tables:
  • If rectangular tables (i.e., of type data.frame or tibble) consist of columns of variables (vectors), what happens when we provide a numeric index to such tables?

Solution

Let’s try to apply numeric indices to rectangular tables (e.g., our data frame df and tibble tb):

df[1]
#>     name
#> 1   Adam
#> 2 Bertha
#> 3 Cecily
#> 4   Dora
#> 5    Eve
#> 6   Nero
#> 7   Zeno
df[c(1, 3)]
#>     name age
#> 1   Adam  21
#> 2 Bertha  23
#> 3 Cecily  22
#> 4   Dora  19
#> 5    Eve  21
#> 6   Nero  18
#> 7   Zeno  24
tb[1:3]
#> # A tibble: 7 × 3
#>   name   gender   age
#>   <fct>  <fct>  <dbl>
#> 1 Adam   male      21
#> 2 Bertha female    23
#> 3 Cecily female    22
#> 4 Dora   female    19
#> 5 Eve    female    21
#> 6 Nero   male      18
#> 7 Zeno   male      24

Answer: Numeric indexing of a rectangular table (with only one index) selects variables, but returns them as columns of a table, rather than as vectors. The reason for this is that rectangular tables are implemented as lists, which each column as their elements.

  1. Family table:

Create a data frame or tibble that contains the names, ages, and family relations of your (or some famous) family (including at least three generations). What are the dimensions of the resulting tibble and the data types of all variables involved?

1.5.3 Working with rectangular tables

As rectangular tables (of type data.frame or tibble) are two-dimensional data structures, we also need two numeric index values to denote specific cells: A 1st index for specifying rows (or cases), and a 2nd index for specifying columns (or variables). With two indices, selecting cells, rows (cases), or columns (variables) of a data frame or tibble works just like selecting the corresponding cells, rows, or columns in matrices:

# Selecting cells, rows or columns: ----- 
df[5, 3]  # cell in row 5, column 3: 21 (age of Eve)
#> [1] 21
df[6, ]   # row 6
#>   name gender age height
#> 6 Nero   male  18    185
df[ , 4]  # column 4
#> [1] 165 170 168 172 158 185 182

In addition to numeric indexing (i.e., selecting cells, rows, or columns by their numeric indices), we can also select columns by their name. In case we do not know the variable names of a table, we can use the names() function to obtain them (as a vector):

names(df)    # yields the names of all variables (columns)
#> [1] "name"   "gender" "age"    "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"

When selecting a column/variable of a rectangular table by its name, we can combine the name of the data structure (e.g., df) with the $ operator, followed by the name of the variable to be selected:

# Selecting variables (columns) by their name (with the $ operator):
df$gender  # returns gender vector
#> [1] male   female female female female male   male  
#> Levels: female male
df$age     # returns age vector
#> [1] 21 23 22 19 21 18 24

As a rectangular table was created by combining atomic vectors (into the columns of df or tb), selecting a variable by its name yields a vector again. Thus, we can apply any vector function to the variables (columns/vectors) of a rectangular table:

# Applying functions to columns of df:
df$gender == "male"
#> [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
sum(df$gender == "male")  # Note: TRUE is 1, FALSE is 0
#> [1] 3

df$age < 21
#> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
df$age[df$age < 21]   # Which age values are below 21?
#> [1] 19 18
df$name[df$age < 21]  # What are the names of people below 21?
#> [1] Dora Nero
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno

mean(df$height)
#> [1] 171.4286
df$height < 170
#> [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
df$gender[df$height < 170]
#> [1] male   female female
#> Levels: female male

Adding new variables to a rectangular table is easy: To create a new variable x of a given table t, we simply assign something (a vector of length nrow(t)) to a new variable name (using the t$x notation). The variable type of x depends on our assignment:

df
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182
dim(df)   # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
nrow(df)  # 7 rows
#> [1] 7

# Create a new variable:
df$may_drink <- rep(NA, nrow(df))  # initialize a new variable (column) with unknown (NA) values
df # => may_drink was added as a new column to df, all instances are NA
#>     name gender age height may_drink
#> 1   Adam   male  21    165        NA
#> 2 Bertha female  23    170        NA
#> 3 Cecily female  22    168        NA
#> 4   Dora female  19    172        NA
#> 5    Eve female  21    158        NA
#> 6   Nero   male  18    185        NA
#> 7   Zeno   male  24    182        NA

# Assign values: A person may drink (alcohol, in the US),  
df$may_drink <- (df$age >= 21)  # if s/he is 21 (or older)
df
#>     name gender age height may_drink
#> 1   Adam   male  21    165      TRUE
#> 2 Bertha female  23    170      TRUE
#> 3 Cecily female  22    168      TRUE
#> 4   Dora female  19    172     FALSE
#> 5    Eve female  21    158      TRUE
#> 6   Nero   male  18    185     FALSE
#> 7   Zeno   male  24    182      TRUE

# Note:
# - we did not use an if-then statement
# - we did not specify separate TRUE vs. FALSE cases
# - we can assign and set new variables in 1 step:

df$is_female <- (df$gender == "female")
df
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 4   Dora female  19    172     FALSE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE

Practice

  • Add two more logical variables to df: A variable is_tall that is TRUE if and only if someone is taller than 170 cm, and a variable short_name that is TRUE if and only if someone’s name is not more than four characters long.

Hint: The nchar() function yields the length of a character string.

Subsetting rectangular tables

An alternative way for selecting a subset of a table is provided by the subset() function, which typically takes the form subset(x, condition), where x is some table of data and condition is a logical test that imposes criteria on one or more variables in x. For instance, if we wanted to select the cases (rows) of df that satisfy certain requirements regarding age or gender, we could specify those as follows:

# Subsetting by a condition:
subset(df, age > 20)
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 7   Zeno   male  24    182      TRUE     FALSE
subset(df, gender == "male")
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 6 Nero   male  18    185     FALSE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

# Multiple conditions:
subset(df, age > 20 | gender == "male")  # logical OR
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE
subset(df, age > 20 & gender == "male")  # logical AND
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

However, note that the subset() function is only a convenient way of indexing by [...]. We can re-write the four commands from above as:

# Subsetting by a condition:
df[age > 20, ]
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 7   Zeno   male  24    182      TRUE     FALSE
df[gender == "male", ]
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 6 Nero   male  18    185     FALSE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

# Multiple conditions:
df[age > 20 | gender == "male", ]  # logical OR
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE
df[age > 20 & gender == "male", ]  # logical AND
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

As using subset() can have unanticipated consequences, its help page even contains the warning “This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [ …”.25 Thus, indexing data structures by [] is both safer and more general.

1.5.4 Changing variable types

When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character(), as.integer(), as.logical(), as.numeric(), or as.factor()) around it:

levels(df$gender)  # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender)  # of type "integer"
#> [1] "integer"
df$gender <- as.character(df$gender)  # convert into a character variable
typeof(df$gender)  # now of type "character"
#> [1] "character"

df$gender <- as.factor(df$gender)  # convert from "character" into a "factor"
df$gender
#> [1] male   female female female female male   male  
#> Levels: female male
typeof(df$gender)  # again of type "integer"
#> [1] "integer"

typeof(df$age)  # numeric "double"
#> [1] "double"
df$age <- as.integer(df$age)  # convert from "double" to "integer"
typeof(df$age)  # "integer"
#> [1] "integer"
df$age <- as.numeric(df$age)  # convert from "integer" to numeric "double"
typeof(df$age)  # numeric "double"
#> [1] "double"

Practice

  1. What happens when you convert a vector v <- 0:3 into a logical data by using the as.logical() conversion function? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.logical.)
v <- 0:3
v
as.logical(v)
  1. What happens when you convert the outcome of as.logical(v) into numeric data by using the as.numeric() conversion function)? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.numeric.)
vl <- as.logical(v)
vl
as.numeric(vl)

References

Barrett, T., Dowle, M., Srinivasan, A., Gorecki, J., Chirico, M., Hocking, T., & Schwendinger, B. (2024). Data.table: Extension of ‘data.frame‘. Retrieved from https://r-datatable.com
Müller, K., & Wickham, H. (2023). tibble: Simple data frames. Retrieved from https://CRAN.R-project.org/package=tibble
Wickham, H. (2014a). Advanced R (1st ed.). Retrieved from http://adv-r.had.co.nz/

  1. For details, see this discussion of non-standard evaluation (in Wickham, 2014a).↩︎