## 1.5 Rectangular data structures

As we have seen (in the previous Section 1.4), vectors are linear (or one-dimensional) sequences: They have a length, but no width — or only a trivial width of 1. By combining several vectors, we get a rectangular data structure — usually a matrix or a rectangular table — that arranges data points in two dimensions: horizontal rows and vertical columns. In a rectangular data structure, all rows and all columns are of equal length.

The most common rectangular data structures in R are matrices (of type matrix) and rectangular tables (of type data.frame or tibble). As all these concepts denote rectangular data structures, the differences between them concern details of their contents and their implementation. Although all three concepts denote rectangular data structures, we mostly distinguish between two sub-types: Matrices (Section 1.5.1) and rectangular tables (Section 1.5.2).

### 1.5.1 Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we call this a matrix of data. Despite its rectangular shape, a matrix is not a rectangular table (which consists of multiple columns). Instead, an R matrix is a vector (i.e., a linear data structure that stores only a single mode/type of data) with an additional attribute that describes its shape: Rather than just extending in one dimension (i.e., the vector’s length), the matrix distributes data elements in two dimensions (with the dim attribute denoting its number of rows and columns).

#### Creating matrices

Matrices can easily be created from a vector by a matrix() function that accepts some data (e.g., a numeric vector 1:20) and one or more additional arguments that specify the shape and arrangement of the desired matrix:

# Reshaping a vector into a matrix:
(m0 <- matrix(data = 1:20, nrow = 5))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

By default, the number of columns of the new matrix is chosen in a sensible way, but can also be set explicitly:

(m1 <- matrix(data = 1:20, ncol = 4))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Similarly, the elements of data are arranged in a by-column fashion by default. If we want to change this default behavior, we can set the argument byrow = TRUE:

(m2 <- matrix(data = 1:20, nrow = 5, byrow = TRUE))
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

Altenatively, a matrix can be created by combining multiple vectors — provided that they have the same data type — by binding them together. Based on the desired arrangement of vectors, we have two binding functions: The cbind() function treats each vector as a column; the rbind() function treats each vector as a row:

# Creating 3 vectors:
x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ----
(m3 <- cbind(x, y, z))  # combine as columns
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
(m4 <- rbind(x, y, z))  # combine as rows
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

The resulting matrices m3 and m4 differ not only in the arrangement of their elements (by-column vs. by-row), but also in their names: Combining vectors as columns uses the vector names as column names, whereas combining vectors as rows uses the vector names as row names.

In summary, the matrix() function turns existing data into a matrix, with arguments for the number of rows nrow, the number of columns ncol, and a logical argument byrow that arranges data in a by-row vs. by-column fashion. By contrast, the cbind() and rbind() functions combine multiple vectors into a matrix.

Note that reshaping data into a rectangle is subject to multiple constraints. For instance, what happens when we specify arguments for both nrow and ncol, but the product of nrol * ncol does not match the number of elements in data? And what happens when the vectors that we want to bind into a matrix are not of the same length or type? We can simply find this out by trying.

# Data (as vectors):
v <- 1:10
m <- 1:3
n <- 4:5
o <- letters[1:3]

# Matrices:
matrix(data = v, nrow = 3, ncol = 3)
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9
matrix(data = v, nrow = 4, ncol = 4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    5    9    3
#> [2,]    2    6   10    4
#> [3,]    3    7    1    5
#> [4,]    4    8    2    6

cbind(m, n)
#>      m n
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 4
rbind(m, n)
#>   [,1] [,2] [,3]
#> m    1    2    3
#> n    4    5    4

cbind(m, o)
#>      m   o
#> [1,] "1" "a"
#> [2,] "2" "b"
#> [3,] "3" "c"
rbind(m, o)
#>   [,1] [,2] [,3]
#> m "1"  "2"  "3"
#> o "a"  "b"  "c"

Note that most of these commands created Warning messages, as the number of arguments did not fit together neatly as a matrix (of the specified size). However, R still interpreted each expression and created a matrix in each case. However, the resulting matrices may not always have been the ones we expected to obtain.

The matrices m0 to m4 all contained numeric data. However, data of type “logical” or “character” can also stored in matrix form:

# A matrix of logical values:
(m5 <- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE))
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
#> [2,] FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE FALSE FALSE

# A matrix of character values:
(m6 <- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE))
#>      [,1] [,2] [,3] [,4]
#> [1,] "u"  "d"  "s"  "a"
#> [2,] "j"  "c"  "z"  "f"
#> [3,] "m"  "b"  "e"  "w"
#> [4,] "n"  "t"  "h"  "y"

Thus, as long as the matrix() function receives data of only a single data type and is well-specified (in the sense that the number of elements in data and those specified by nrow or ncol do not contradict each other), the resulting matrices are straightforward. And if there are conflicts, R still tries to create some sensible data structure, but it takes some expertise to explain and predict what happens in these cases.

#### The shape and type of matrices

As R matrices are implemented as re-shaped variants of atomic vectors, we can check their properties by the same generic functions. For instance:

mode(m0)
#> [1] "numeric"
typeof(m0)
#> [1] "integer"
length(m0)
#> [1] 20

However, note that adding the 2-dimensional dim attribute (with elements for the number of rows and columns of a matrix) allows for more precise characterizations of matrices that would not work (or return different results) for vectors:

is.vector(m0)
#> [1] FALSE
is.matrix(m0)
#> [1] TRUE
dim(m0)
#> [1] 5 4

Thus, the shape of a matrix is better described by dim(x) than by length(x).

#### Practice

1. What would the following generic functions return for a vector v0?
v0 <- 1:20
is.vector(v0)
is.matrix(v0)
dim(v0)
1. If x is a matrix, what is the result of dim(x)[1] * dim(x)[2]? Predict and check this for an example.
(x <- matrix(data = 1:20, nrow = 5))
dim(x)[1] * dim(x)[2]
length(x)

#### Indexing matrices

Retrieving values from a matrix m works similarly to indexing vectors. First, we will consider numeric indexing. Due to the two-dimensional nature of a matrix, we now need to specify two indices in square brackets: the number of the desired row, and the number of the desired column, separated by a comma. Thus, to get or change the value of row r and column c of a matrix m we need to evaluate m[r, c]. Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns. When the value of r or c is left unspecified, all rows or columns are selected.

# Selecting cells, rows, or columns of matrices: ----
m1[2, 3]  # in m1: select row 2, column 3
#> [1] 12
m1[2,  ]  # in m1: select row 2, all columns
#> [1]  2  7 12 17

m2[3, 1]  # in m2: select row 3, column 1
#> [1] 9
m2[ , 1]  # in m1: select column 1, all rows
#> [1]  1  5  9 13 17

m3[2, 2:3]  # in m3: select row 2, columns 2 to 3
#> y z
#> 5 8
m3[1:3, 2]  # in m3: select rows 1 to 3, column 2
#> [1] 4 5 6

m4[2, ]   # in m4: select row 2
#> [1] 4 5 6
m4[ , 2]  # in m4: select col 2
#> x y z
#> 2 5 8
m4[]      # in m4: select all rows and all columns (i.e., all of m4)
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

Similarly, we can extend the notion of logical indexing to matrices:

m4 > 5  # returns a matrix of logical values
#>    [,1]  [,2]  [,3]
#> x FALSE FALSE FALSE
#> y FALSE FALSE  TRUE
#> z  TRUE  TRUE  TRUE
typeof(m4 > 5)
#> [1] "logical"
m4[m4 > 5]  # indexing of matrices
#> [1] 7 8 6 9

Just as with vectors, we can apply generic functions to matrices, provided that the function is applicable to the type of data stored in the matrix. Typical examples include:

# Applying functions to matrices: ----
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"

# Note the differences between:
is.numeric(m3)  # type of m3? (1 value)
#> [1] TRUE
is.na(m3)       # NA values in m3? (many values)
#>          x     y     z
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE

# Computations with matrices:
sum(m1)
#> [1] 210
max(m2)
#> [1] 20
mean(m3)
#> [1] 5
colSums(m3)  # column sums of r3
#>  x  y  z
#>  6 15 24
rowSums(m4)  # row sums of r4
#>  x  y  z
#>  6 15 24

Note that some of these functions required the matrics m1 to m4 to be of specific data types and would not have worked for matrices of another type.

Regarding the shape of matrices, we saw above that dim(x) provides more specific information about the shape of a matrix x than length(x). As the result of dim(x) is a vector (with elements specifying the number of rows and columns of x), there exist specializations that directly yield the number of rows or columns:

length(m1)  # length (of vector)
#> [1] 20
dim(m1)     # dimensions as vector c(rows, columns)
#> [1] 5 4

nrow(m1)    # number of rows
#> [1] 5
ncol(m1)    # number of columns
#> [1] 4

Another common function in the context of matrices is t() for transposing (i.e., swap the rows and columns of) a matrix:

t(m4)
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
t(m5)
#>       [,1]  [,2]  [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE  TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,]  TRUE FALSE  TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE  TRUE FALSE

#### Practice

Assuming the definitions of m5 and m6 from above, predict, evaluate, and explain the result of the following expressions:

m5[2, 6]
m5[2, ]
m5 == FALSE
sum(m5)
t(t(m5))

m6[2, 3]
m6[ , 4]
m6[nrow(m6), (ncol(m6) - 1)]
m6 == "e"
toupper(m6[4, ])

#### Arrays and tables

Table 1.1 (in Section 1.2.1) mentioned arrays and tables as data structure for storing homogeneous data with more than two dimensions in R. Essentially, objects of the type array are generalizations of vectors and matrices that allow for multiple dimensions (i.e., more complex shapes). By contrast, R objects of type table store the frequency counts of factor combinations (aka. as a “contingency table”) and thus are a special kind of array.

Essentially, arrays and tables are generalizations of vectors and matrices. As they are mostly used for storing multi-dimensional data (with 3 or more dimensions), we do not cover them here. However, much of what we know about vectors and matrices can be generalized to arrays and tables.

### 1.5.2 Rectangular tables (data frames and tibbles)

As matrices (and vectors) contain data of only one type (e.g., all cells are all numeric, character, or logical data), we need another data structure for heterogeneous data. As we will see, rectangular tables are the most frequent way of storing data throughout this book.

Why do we write “rectangular table,” rather than just “table?” Somewhat confusingly, R provides not only one, but several variants of rectangular tables (i.e., data structures in the shape of a rectangle). And unfortunately, the base R data structure table refers to yet another, $$n$$-dimensional data structure that was briefly mentioned above as a special type of array (see the end of Section 1.5.1 and ?base::table for details). Additionally, the R package data.table provides considerable extensions of R’s data.frame construct and allows faster transformations of large data files. The ubiquity and ambiguity of the “table” concept is why we often use the clumsy term “rectangular table,” rather than just “table,” when referring to the most common shape of data. However, at this early point in our R careers, we can assume that the term “table” will primarily refer to a data frame or tibble.

The need for storing heterogeneous data is nothing exotic or unusual. In fact, even the most simple datasets require mixing multiple types of data. For instance, imagine that we want to store a dataset that contains basic information on a group of people:

• their names,
• their gender,
• their age (in years),
• their height (in cm).

Each of these four variables can be stored as a vector (the first two of type character, the others of type numeric). To store all four variables in a single data structure, we can combine the four vectors into a rectangular table. In R, the four vectors form the columns of a rectangular table, rather than its rows.

The most common rectangular data structures in R are data frames, whereas so-called tibbles are a simpler version of a data frame that is used in the tidyverse (see Chapter 5 on Tibbles). Here is how we can describe some people on four dimensions (aka. variables) by creating four short vectors and combine them into a data frame:

# Create some vectors (of different types, but same length): -----
name   <- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
gender <- c("male", "female", "female", "female", "female", "male", "male")
age    <- c(21, 23, 22, 19, 21, 18, 24)
height <- c(165, 170, 168, 172, 158, 185, 182)

# Combine 4 vectors (of equal length) into a data frame:
df <- data.frame(name, gender, age, height,
stringsAsFactors = TRUE)
df    # Note: Vectors are the columns of the data frame!
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182

The created data frame (named df) is a two-dimensional object consisting out of 7 rows (cases/observations) and 4 columns (variables/measures). As with all data structures, we can apply a range of functions to obtain information about the df object:

is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE

# What are the dimensions of df?
dim(df)  # 7 rows (cases) x 4 columns (variables)
#> [1] 7 4

# Note:
# sum(df)  # would yield an error

We can easily turn any data frame into a tibble, by using the as_tibble() function of the tidyverse package tibble :

# Turn df into a tibble tb:
tb <- tibble::as_tibble(df)
dim(tb)  # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4

At this point, a tibble is just a simpler and more convenient type of data frame. One advantage of tibbles is that they can be printed more easily to the screen. For instance, printing a tibble always shows its dimensions (as in dim(tb)) and the data type of each of its variables (i.e., each column):

tb  # print tb
#> # A tibble: 7 × 4
#>   name   gender   age height
#>   <fct>  <fct>  <dbl>  <dbl>
#> 1 Adam   male      21    165
#> 2 Bertha female    23    170
#> 3 Cecily female    22    168
#> 4 Dora   female    19    172
#> 5 Eve    female    21    158
#> 6 Nero   male      18    185
#> 7 Zeno   male      24    182

Please remember: Both data frames and tibbles have columns — rather than rows — that consist of atomic and linear vectors. The fact that vectors are homogeneous data constructs (see Table 1.1) is the reason for referring to the columns of a data frame as its variables. By contrast, the rows of a data frame can contain heterogeneous data types and are referred to as its cases or observations.

#### Practice

1. Re-thinking rectangular tables:
• If rectangular tables (i.e., of type data.frame or tibble) consist of columns of variables (vectors), what happens when we provide a numeric index to such tables?

#### Solution

Let’s try to apply numeric indices to rectangular tables (e.g., our data frame df and tibble tb):

df[1]
#>     name
#> 2 Bertha
#> 3 Cecily
#> 4   Dora
#> 5    Eve
#> 6   Nero
#> 7   Zeno
df[c(1, 3)]
#>     name age
#> 2 Bertha  23
#> 3 Cecily  22
#> 4   Dora  19
#> 5    Eve  21
#> 6   Nero  18
#> 7   Zeno  24
tb[1:3]
#> # A tibble: 7 × 3
#>   name   gender   age
#>   <fct>  <fct>  <dbl>
#> 2 Bertha female    23
#> 3 Cecily female    22
#> 4 Dora   female    19
#> 5 Eve    female    21
#> 6 Nero   male      18
#> 7 Zeno   male      24

Answer: Numeric indexing of a rectangular table (with only one index) selects variables, but returns them as columns of a table, rather than as vectors. The reason for this is that rectangular tables are implemented as lists, which each column as their elements.

1. Family table:

Create a data frame or tibble that contains the names, ages, and family relations of your (or some famous) family (including at least three generations). What are the dimensions of the resulting tibble and the data types of all variables involved?

### 1.5.3 Working with rectangular tables

As rectangular tables (of type data.frame or tibble) are two-dimensional data structures, we also need two numeric index values to denote specific cells: A 1st index for specifying rows (or cases), and a 2nd index for specifying columns (or variables). With two indices, selecting cells, rows (cases), or columns (variables) of a data frame or tibble works just like selecting the corresponding cells, rows, or columns in matrices:

# Selecting cells, rows or columns: -----
df[5, 3]  # cell in row 5, column 3: 21 (age of Eve)
#> [1] 21
df[6, ]   # row 6
#>   name gender age height
#> 6 Nero   male  18    185
df[ , 4]  # column 4
#> [1] 165 170 168 172 158 185 182

In addition to numeric indexing (i.e., selecting cells, rows, or columns by their numeric indices), we can also select columns by their name. In case we do not know the variable names of a table, we can use the names() function to obtain them (as a vector):

names(df)    # yields the names of all variables (columns)
#> [1] "name"   "gender" "age"    "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"

When selecting a column/variable of a rectangular table by its name, we can combine the name of the data structure (e.g., df) with the $ operator, followed by the name of the variable to be selected: # Selecting variables (columns) by their name (with the$ operator):
df$gender # returns gender vector #> [1] male female female female female male male #> Levels: female male df$age     # returns age vector
#> [1] 21 23 22 19 21 18 24

As a rectangular table was created by combining atomic vectors (into the columns of df or tb), selecting a variable by its name yields a vector again. Thus, we can apply any vector function to the variables (columns/vectors) of a rectangular table:

# Applying functions to columns of df:
df$gender == "male" #> [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE sum(df$gender == "male")  # Note: TRUE is 1, FALSE is 0
#> [1] 3

df$age < 21 #> [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE df$age[df$age < 21] # Which age values are below 21? #> [1] 19 18 df$name[df$age < 21] # What are the names of people below 21? #> [1] Dora Nero #> Levels: Adam Bertha Cecily Dora Eve Nero Zeno mean(df$height)
#> [1] 171.4286
df$height < 170 #> [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE df$gender[df$height < 170] #> [1] male female female #> Levels: female male Adding new variables to a rectangular table is easy: To create a new variable x of a given table t, we simply assign something (a vector of length nrow(t)) to a new variable name (using the t$x notation). The variable type of x depends on our assignment:

df
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182
dim(df)   # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
nrow(df)  # 7 rows
#> [1] 7

# Create a new variable:
df$may_drink <- rep(NA, nrow(df)) # initialize a new variable (column) with unknown (NA) values df # => may_drink was added as a new column to df, all instances are NA #> name gender age height may_drink #> 1 Adam male 21 165 NA #> 2 Bertha female 23 170 NA #> 3 Cecily female 22 168 NA #> 4 Dora female 19 172 NA #> 5 Eve female 21 158 NA #> 6 Nero male 18 185 NA #> 7 Zeno male 24 182 NA # Assign values: A person may drink (alcohol, in the US), df$may_drink <- (df$age >= 21) # if s/he is 21 (or older) df #> name gender age height may_drink #> 1 Adam male 21 165 TRUE #> 2 Bertha female 23 170 TRUE #> 3 Cecily female 22 168 TRUE #> 4 Dora female 19 172 FALSE #> 5 Eve female 21 158 TRUE #> 6 Nero male 18 185 FALSE #> 7 Zeno male 24 182 TRUE # Note: # - we did not use an if-then statement # - we did not specify separate TRUE vs. FALSE cases # - we can assign and set new variables in 1 step: df$is_female <- (df$gender == "female") df #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 2 Bertha female 23 170 TRUE TRUE #> 3 Cecily female 22 168 TRUE TRUE #> 4 Dora female 19 172 FALSE TRUE #> 5 Eve female 21 158 TRUE TRUE #> 6 Nero male 18 185 FALSE FALSE #> 7 Zeno male 24 182 TRUE FALSE #### Practice • Add two more logical variables to df: A variable is_tall that is TRUE if and only if someone is taller than 170 cm, and a variable short_name that is TRUE if and only if someone’s name is not more than four characters long. Hint: The nchar() function yields the length of a character string. #### Subsetting rectangular tables An alternative way for selecting a subset of a table is provided by the subset() function, which typically takes the form subset(x, condition), where x is some table of data and condition is a logical test that imposes criteria on one or more variables in x. For instance, if we wanted to select the cases (rows) of df that satisfy certain requirements regarding age or gender, we could specify those as follows: # Subsetting by a condition: subset(df, age > 20) #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 2 Bertha female 23 170 TRUE TRUE #> 3 Cecily female 22 168 TRUE TRUE #> 5 Eve female 21 158 TRUE TRUE #> 7 Zeno male 24 182 TRUE FALSE subset(df, gender == "male") #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 6 Nero male 18 185 FALSE FALSE #> 7 Zeno male 24 182 TRUE FALSE # Multiple conditions: subset(df, age > 20 | gender == "male") # logical OR #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 2 Bertha female 23 170 TRUE TRUE #> 3 Cecily female 22 168 TRUE TRUE #> 5 Eve female 21 158 TRUE TRUE #> 6 Nero male 18 185 FALSE FALSE #> 7 Zeno male 24 182 TRUE FALSE subset(df, age > 20 & gender == "male") # logical AND #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 7 Zeno male 24 182 TRUE FALSE However, note that the subset() function is only a convenient way of indexing by [...]. We can re-write the four commands from above as: # Subsetting by a condition: df[age > 20, ] #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 2 Bertha female 23 170 TRUE TRUE #> 3 Cecily female 22 168 TRUE TRUE #> 5 Eve female 21 158 TRUE TRUE #> 7 Zeno male 24 182 TRUE FALSE df[gender == "male", ] #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 6 Nero male 18 185 FALSE FALSE #> 7 Zeno male 24 182 TRUE FALSE # Multiple conditions: df[age > 20 | gender == "male", ] # logical OR #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 2 Bertha female 23 170 TRUE TRUE #> 3 Cecily female 22 168 TRUE TRUE #> 5 Eve female 21 158 TRUE TRUE #> 6 Nero male 18 185 FALSE FALSE #> 7 Zeno male 24 182 TRUE FALSE df[age > 20 & gender == "male", ] # logical AND #> name gender age height may_drink is_female #> 1 Adam male 21 165 TRUE FALSE #> 7 Zeno male 24 182 TRUE FALSE As using subset() can have unanticipated consequences, its help page even contains the warning “This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [ ….”25 Thus, indexing data structures by [] is both safer and more general. ### 1.5.4 Changing variable types When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character(), as.integer(), as.logical(), as.numeric(), or as.factor()) around it: levels(df$gender)  # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender) # of type "integer" #> [1] "integer" df$gender <- as.character(df$gender) # convert into a character variable typeof(df$gender)  # now of type "character"
#> [1] "character"

df$gender <- as.factor(df$gender)  # convert from "character" into a "factor"
df$gender #> [1] male female female female female male male #> Levels: female male typeof(df$gender)  # again of type "integer"
#> [1] "integer"

typeof(df$age) # numeric "double" #> [1] "double" df$age <- as.integer(df$age) # convert from "double" to "integer" typeof(df$age)  # "integer"
#> [1] "integer"
df$age <- as.numeric(df$age)  # convert from "integer" to numeric "double"
typeof(df\$age)  # numeric "double"
#> [1] "double"

#### Practice

1. What happens when you convert a vector v <- 0:3 into a logical data by using the as.logical() conversion function? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.logical.)
v <- 0:3
v
as.logical(v)
1. What happens when you convert the outcome of as.logical(v) into numeric data by using the as.numeric() conversion function)? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.numeric.)
vl <- as.logical(v)
vl
as.numeric(vl)

### References

Dowle, M., & Srinivasan, A. (2021). data.table: Extension of ‘data.frame‘. Retrieved from https://CRAN.R-project.org/package=data.table
Müller, K., & Wickham, H. (2020). tibble: Simple data frames. Retrieved from https://CRAN.R-project.org/package=tibble