1.5 Rectangular data structures
As we have seen (in the previous Section 1.4), vectors are linear (or one-dimensional) sequences: They have a length, but no width — or only a trivial width of 1. By combining several vectors, we get a rectangular data structure — usually a matrix or a rectangular table — that arranges data points in two dimensions: horizontal rows and vertical columns. In a rectangular data structure, all rows and all columns are of equal length.
The most common rectangular data structures in R are matrices (of type matrix
) and rectangular tables (of type data.frame
or tibble
). As all these concepts denote rectangular data structures, the differences between them concern details of their contents and their implementation. Although all three concepts denote rectangular data structures, we mostly distinguish between two sub-types:
Matrices (Section 1.5.1) and rectangular tables (Section 1.5.2).
1.5.1 Matrices
When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we call this a matrix of data.
Despite its rectangular shape, a matrix is not a rectangular table (which consists of multiple columns).
Instead, an R matrix is a vector (i.e., a linear data structure that stores only a single mode/type of data) with an additional attribute that describes its shape: Rather than just extending in one dimension (i.e., the vector’s length), the matrix distributes data elements in two dimensions (with the dim
attribute denoting its number of rows and columns).
Creating matrices
Matrices can easily be created from a vector by a matrix()
function that accepts some data
(e.g., a numeric vector 1:20
) and one or more additional arguments that specify the shape and arrangement of the desired matrix:
# Reshaping a vector into a matrix:
<- matrix(data = 1:20, nrow = 5))
(m0 #> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
By default, the number of columns of the new matrix is chosen in a sensible way, but can also be set explicitly:
<- matrix(data = 1:20, ncol = 4))
(m1 #> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
Similarly, the elements of data
are arranged in a by-column fashion by default.
If we want to change this default behavior, we can set the argument byrow = TRUE
:
<- matrix(data = 1:20, nrow = 5, byrow = TRUE))
(m2 #> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
Altenatively, a matrix can be created by combining multiple vectors — provided that they have the same data type — by binding them together. Based on the desired arrangement of vectors, we have two binding functions:
The cbind()
function treats each vector as a column; the rbind()
function treats each vector as a row:
# Creating 3 vectors:
<- 1:3
x <- 4:6
y <- 7:9
z
# Combining vectors (of the same length): ----
<- cbind(x, y, z)) # combine as columns
(m3 #> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
<- rbind(x, y, z)) # combine as rows
(m4 #> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
The resulting matrices m3
and m4
differ not only in the arrangement of their elements (by-column vs. by-row), but also in their names:
Combining vectors as columns uses the vector names as column names, whereas
combining vectors as rows uses the vector names as row names.
In summary, the matrix()
function turns existing data
into a matrix, with arguments for the number of rows nrow
, the number of columns ncol
, and a logical argument byrow
that arranges data
in a by-row vs. by-column fashion.
By contrast, the cbind()
and rbind()
functions combine multiple vectors into a matrix.
Note that reshaping data
into a rectangle is subject to multiple constraints.
For instance, what happens when we specify arguments for both nrow
and ncol
, but the product of nrol * ncol
does not match the number of elements in data
?
And what happens when the vectors that we want to bind into a matrix are not of the same length or type?
We can simply find this out by trying.
# Data (as vectors):
<- 1:10
v <- 1:3
m <- 4:5
n <- letters[1:3]
o
# Matrices:
matrix(data = v, nrow = 3, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
matrix(data = v, nrow = 4, ncol = 4)
#> [,1] [,2] [,3] [,4]
#> [1,] 1 5 9 3
#> [2,] 2 6 10 4
#> [3,] 3 7 1 5
#> [4,] 4 8 2 6
cbind(m, n)
#> m n
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 4
rbind(m, n)
#> [,1] [,2] [,3]
#> m 1 2 3
#> n 4 5 4
cbind(m, o)
#> m o
#> [1,] "1" "a"
#> [2,] "2" "b"
#> [3,] "3" "c"
rbind(m, o)
#> [,1] [,2] [,3]
#> m "1" "2" "3"
#> o "a" "b" "c"
Note that most of these commands created Warning messages, as the number of arguments did not fit together neatly as a matrix (of the specified size). However, R still interpreted each expression and created a matrix in each case. However, the resulting matrices may not always have been the ones we expected to obtain.
The matrices m0
to m4
all contained numeric data.
However, data of type “logical” or “character” can also stored in matrix form:
# A matrix of logical values:
<- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE))
(m5 #> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] FALSE FALSE FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE FALSE FALSE
# A matrix of character values:
<- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE))
(m6 #> [,1] [,2] [,3] [,4]
#> [1,] "u" "d" "s" "a"
#> [2,] "j" "c" "z" "f"
#> [3,] "m" "b" "e" "w"
#> [4,] "n" "t" "h" "y"
Thus, as long as the matrix()
function receives data
of only a single data type and is well-specified (in the sense that the number of elements in data
and those specified by nrow
or ncol
do not contradict each other), the resulting matrices are straightforward.
And if there are conflicts, R still tries to create some sensible data structure, but it takes some expertise to explain and predict what happens in these cases.
The shape and type of matrices
As R matrices are implemented as re-shaped variants of atomic vectors, we can check their properties by the same generic functions. For instance:
mode(m0)
#> [1] "numeric"
typeof(m0)
#> [1] "integer"
length(m0)
#> [1] 20
However, note that adding the 2-dimensional dim
attribute (with elements for the number of rows and columns of a matrix) allows for more precise characterizations of matrices that would not work (or return different results) for vectors:
is.vector(m0)
#> [1] FALSE
is.matrix(m0)
#> [1] TRUE
dim(m0)
#> [1] 5 4
Thus, the shape of a matrix is better described by dim(x)
than by length(x)
.
Practice
- What would the following generic functions return for a vector
v0
?
<- 1:20
v0 is.vector(v0)
is.matrix(v0)
dim(v0)
- If
x
is a matrix, what is the result ofdim(x)[1] * dim(x)[2]
? Predict and check this for an example.
<- matrix(data = 1:20, nrow = 5))
(x dim(x)[1] * dim(x)[2]
length(x)
Indexing matrices
Retrieving values from a matrix m
works similarly to indexing vectors.
First, we will consider numeric indexing.
Due to the two-dimensional nature of a matrix, we now need to specify two indices in square brackets:
the number of the desired row, and the number of the desired column, separated by a comma.
Thus, to get or change the value of row r
and column c
of a matrix m
we
need to evaluate m[r, c]
.
Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns.
When the value of r
or c
is left unspecified, all rows or columns are selected.
# Selecting cells, rows, or columns of matrices: ----
2, 3] # in m1: select row 2, column 3
m1[#> [1] 12
2, ] # in m1: select row 2, all columns
m1[#> [1] 2 7 12 17
3, 1] # in m2: select row 3, column 1
m2[#> [1] 9
1] # in m1: select column 1, all rows
m2[ , #> [1] 1 5 9 13 17
2, 2:3] # in m3: select row 2, columns 2 to 3
m3[#> y z
#> 5 8
1:3, 2] # in m3: select rows 1 to 3, column 2
m3[#> [1] 4 5 6
2, ] # in m4: select row 2
m4[#> [1] 4 5 6
2] # in m4: select col 2
m4[ , #> x y z
#> 2 5 8
# in m4: select all rows and all columns (i.e., all of m4)
m4[] #> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
Similarly, we can extend the notion of logical indexing to matrices:
> 5 # returns a matrix of logical values
m4 #> [,1] [,2] [,3]
#> x FALSE FALSE FALSE
#> y FALSE FALSE TRUE
#> z TRUE TRUE TRUE
typeof(m4 > 5)
#> [1] "logical"
> 5] # indexing of matrices
m4[m4 #> [1] 7 8 6 9
Just as with vectors, we can apply generic functions to matrices, provided that the function is applicable to the type of data stored in the matrix. Typical examples include:
# Applying functions to matrices: ----
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"
# Note the differences between:
is.numeric(m3) # type of m3? (1 value)
#> [1] TRUE
is.na(m3) # NA values in m3? (many values)
#> x y z
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE
# Computations with matrices:
sum(m1)
#> [1] 210
max(m2)
#> [1] 20
mean(m3)
#> [1] 5
colSums(m3) # column sums of r3
#> x y z
#> 6 15 24
rowSums(m4) # row sums of r4
#> x y z
#> 6 15 24
Note that some of these functions required the matrics m1
to m4
to be of specific data types and would not have worked for matrices of another type.
Regarding the shape of matrices, we saw above that dim(x)
provides more specific information about the shape of a matrix x
than length(x)
. As the result of dim(x)
is a vector (with elements specifying the number of rows and columns of x
), there exist specializations that directly yield the number of rows or columns:
length(m1) # length (of vector)
#> [1] 20
dim(m1) # dimensions as vector c(rows, columns)
#> [1] 5 4
nrow(m1) # number of rows
#> [1] 5
ncol(m1) # number of columns
#> [1] 4
Another common function in the context of matrices is t()
for transposing (i.e., swap the rows and columns of) a matrix:
t(m4)
#> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
t(m5)
#> [,1] [,2] [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,] TRUE FALSE TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE TRUE FALSE
Practice
Assuming the definitions of m5
and m6
from above, predict, evaluate, and explain the result of the following expressions:
2, 6]
m5[2, ]
m5[== FALSE
m5 sum(m5)
t(t(m5))
2, 3]
m6[4]
m6[ , nrow(m6), (ncol(m6) - 1)]
m6[== "e"
m6 toupper(m6[4, ])
Arrays and tables
Table 1.1 (in Section 1.2.1) mentioned arrays and tables as data structure for storing homogeneous data with more than two dimensions in R. Essentially, objects of the type array
are generalizations of vectors and matrices that allow for multiple dimensions (i.e., more complex shapes). By contrast, R objects of type table
store the frequency counts of factor combinations (aka. as a “contingency table”) and thus are a special kind of array.
Essentially, arrays and tables are generalizations of vectors and matrices. As they are mostly used for storing multi-dimensional data (with 3 or more dimensions), we do not cover them here. However, much of what we know about vectors and matrices can be generalized to arrays and tables.
1.5.2 Rectangular tables (data frames and tibbles)
As matrices (and vectors) contain data of only one type (e.g., all cells are all numeric, character, or logical data), we need another data structure for heterogeneous data. As we will see, rectangular tables are the most frequent way of storing data throughout this book.
Why do we write “rectangular table,” rather than just “table?”
Somewhat confusingly, R provides not only one, but several variants of rectangular tables (i.e., data structures in the shape of a rectangle).
And unfortunately, the base R data structure table
refers to yet another, \(n\)-dimensional data structure that was briefly mentioned above as a special type of array (see the end of Section 1.5.1 and ?base::table
for details).
Additionally, the R package data.table (Dowle & Srinivasan, 2021) provides considerable extensions of R’s data.frame
construct and allows faster transformations of large data files.
The ubiquity and ambiguity of the “table” concept is why we often use the clumsy term “rectangular table,” rather than just “table,” when referring to the most common shape of data. However, at this early point in our R careers, we can assume that the term “table” will primarily refer to a data frame or tibble.
The need for storing heterogeneous data is nothing exotic or unusual. In fact, even the most simple datasets require mixing multiple types of data. For instance, imagine that we want to store a dataset that contains basic information on a group of people:
- their names,
- their gender,
- their age (in years),
- their height (in cm).
Each of these four variables can be stored as a vector (the first two of type character
, the others of type numeric
).
To store all four variables in a single data structure, we can combine the four vectors into a rectangular table.
In R, the four vectors form the columns of a rectangular table, rather than its rows.
The most common rectangular data structures in R are data frames, whereas so-called tibbles are a simpler version of a data frame that is used in the tidyverse (see Chapter 5 on Tibbles). Here is how we can describe some people on four dimensions (aka. variables) by creating four short vectors and combine them into a data frame:
# Create some vectors (of different types, but same length): -----
<- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
name <- c("male", "female", "female", "female", "female", "male", "male")
gender <- c(21, 23, 22, 19, 21, 18, 24)
age <- c(165, 170, 168, 172, 158, 185, 182)
height
# Combine 4 vectors (of equal length) into a data frame:
<- data.frame(name, gender, age, height,
df stringsAsFactors = TRUE)
# Note: Vectors are the columns of the data frame!
df #> name gender age height
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
The created data frame (named df
) is a two-dimensional object consisting out of 7 rows (cases/observations) and 4 columns (variables/measures).
As with all data structures, we can apply a range of functions to obtain information about the df
object:
is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE
# What are the dimensions of df?
dim(df) # 7 rows (cases) x 4 columns (variables)
#> [1] 7 4
# Note:
# sum(df) # would yield an error
We can easily turn any data frame into a tibble, by using the as_tibble()
function of the tidyverse package tibble (Müller & Wickham, 2020):
# Turn df into a tibble tb:
<- tibble::as_tibble(df)
tb dim(tb) # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4
At this point, a tibble is just a simpler and more convenient type of data frame.
One advantage of tibbles is that they can be printed more easily to the screen.
For instance, printing a tibble always shows its dimensions (as in dim(tb)
) and the data type of each of its variables (i.e., each column):
# print tb
tb #> # A tibble: 7 × 4
#> name gender age height
#> <fct> <fct> <dbl> <dbl>
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
We will learn more about tibbles later (see Chapter 5 on Tibbles).
Please remember: Both data frames and tibbles have columns — rather than rows — that consist of atomic and linear vectors. The fact that vectors are homogeneous data constructs (see Table 1.1) is the reason for referring to the columns of a data frame as its variables. By contrast, the rows of a data frame can contain heterogeneous data types and are referred to as its cases or observations.
Practice
- Re-thinking rectangular tables:
- If rectangular tables (i.e., of type
data.frame
ortibble
) consist of columns of variables (vectors), what happens when we provide a numeric index to such tables?
Solution
Let’s try to apply numeric indices to rectangular tables (e.g., our data frame df
and tibble tb
):
1]
df[#> name
#> 1 Adam
#> 2 Bertha
#> 3 Cecily
#> 4 Dora
#> 5 Eve
#> 6 Nero
#> 7 Zeno
c(1, 3)]
df[#> name age
#> 1 Adam 21
#> 2 Bertha 23
#> 3 Cecily 22
#> 4 Dora 19
#> 5 Eve 21
#> 6 Nero 18
#> 7 Zeno 24
1:3]
tb[#> # A tibble: 7 × 3
#> name gender age
#> <fct> <fct> <dbl>
#> 1 Adam male 21
#> 2 Bertha female 23
#> 3 Cecily female 22
#> 4 Dora female 19
#> 5 Eve female 21
#> 6 Nero male 18
#> 7 Zeno male 24
Answer: Numeric indexing of a rectangular table (with only one index) selects variables, but returns them as columns of a table, rather than as vectors. The reason for this is that rectangular tables are implemented as lists, which each column as their elements.
- Family table:
Create a data frame or tibble that contains the names
, ages
, and family relations
of your (or some famous) family (including at least three generations). What are the dimensions of the resulting tibble and the data types of all variables involved?
1.5.3 Working with rectangular tables
As rectangular tables (of type data.frame
or tibble
) are two-dimensional data structures, we also need two numeric index values to denote specific cells: A 1st index for specifying rows (or cases), and a 2nd index for specifying columns (or variables). With two indices, selecting cells, rows (cases), or columns (variables) of a data frame or tibble works just like selecting the corresponding cells, rows, or columns in matrices:
# Selecting cells, rows or columns: -----
5, 3] # cell in row 5, column 3: 21 (age of Eve)
df[#> [1] 21
6, ] # row 6
df[#> name gender age height
#> 6 Nero male 18 185
4] # column 4
df[ , #> [1] 165 170 168 172 158 185 182
In addition to numeric indexing (i.e., selecting cells, rows, or columns by their numeric indices), we can also select columns by their name. In case we do not know the variable names of a table, we can use the names()
function to obtain them (as a vector):
names(df) # yields the names of all variables (columns)
#> [1] "name" "gender" "age" "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"
When selecting a column/variable of a rectangular table by its name, we can combine the name of the data structure (e.g., df
) with the $
operator, followed by the name of the variable to be selected:
# Selecting variables (columns) by their name (with the $ operator):
$gender # returns gender vector
df#> [1] male female female female female male male
#> Levels: female male
$age # returns age vector
df#> [1] 21 23 22 19 21 18 24
As a rectangular table was created by combining atomic vectors (into the columns of df
or tb
), selecting a variable by its name yields a vector again. Thus, we can apply any vector function to the variables (columns/vectors) of a rectangular table:
# Applying functions to columns of df:
$gender == "male"
df#> [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE
sum(df$gender == "male") # Note: TRUE is 1, FALSE is 0
#> [1] 3
$age < 21
df#> [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
$age[df$age < 21] # Which age values are below 21?
df#> [1] 19 18
$name[df$age < 21] # What are the names of people below 21?
df#> [1] Dora Nero
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno
mean(df$height)
#> [1] 171.4286
$height < 170
df#> [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE
$gender[df$height < 170]
df#> [1] male female female
#> Levels: female male
Adding new variables to a rectangular table is easy: To create a new variable x
of a given table t
, we simply assign something (a vector of length nrow(t)
) to a new variable name (using the t$x
notation). The variable type of x
depends on our assignment:
df#> name gender age height
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
dim(df) # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
nrow(df) # 7 rows
#> [1] 7
# Create a new variable:
$may_drink <- rep(NA, nrow(df)) # initialize a new variable (column) with unknown (NA) values
df# => may_drink was added as a new column to df, all instances are NA
df #> name gender age height may_drink
#> 1 Adam male 21 165 NA
#> 2 Bertha female 23 170 NA
#> 3 Cecily female 22 168 NA
#> 4 Dora female 19 172 NA
#> 5 Eve female 21 158 NA
#> 6 Nero male 18 185 NA
#> 7 Zeno male 24 182 NA
# Assign values: A person may drink (alcohol, in the US),
$may_drink <- (df$age >= 21) # if s/he is 21 (or older)
df
df#> name gender age height may_drink
#> 1 Adam male 21 165 TRUE
#> 2 Bertha female 23 170 TRUE
#> 3 Cecily female 22 168 TRUE
#> 4 Dora female 19 172 FALSE
#> 5 Eve female 21 158 TRUE
#> 6 Nero male 18 185 FALSE
#> 7 Zeno male 24 182 TRUE
# Note:
# - we did not use an if-then statement
# - we did not specify separate TRUE vs. FALSE cases
# - we can assign and set new variables in 1 step:
$is_female <- (df$gender == "female")
df
df#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 4 Dora female 19 172 FALSE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
Practice
- Add two more logical variables to
df
: A variableis_tall
that isTRUE
if and only if someone is taller than 170 cm, and a variableshort_name
that isTRUE
if and only if someone’s name is not more than four characters long.
Hint: The nchar()
function yields the length of a character string.
Subsetting rectangular tables
An alternative way for selecting a subset of a table is provided by the subset()
function, which typically takes the form subset(x, condition)
, where x
is some table of data and condition
is a logical test that imposes criteria on one or more variables in x
.
For instance, if we wanted to select the cases (rows) of df
that satisfy certain requirements regarding age
or gender
, we could specify those as follows:
# Subsetting by a condition:
subset(df, age > 20)
#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 7 Zeno male 24 182 TRUE FALSE
subset(df, gender == "male")
#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
# Multiple conditions:
subset(df, age > 20 | gender == "male") # logical OR
#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
subset(df, age > 20 & gender == "male") # logical AND
#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
However, note that the subset()
function is only a convenient way of indexing by [...]
.
We can re-write the four commands from above as:
# Subsetting by a condition:
> 20, ]
df[age #> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 7 Zeno male 24 182 TRUE FALSE
== "male", ]
df[gender #> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
# Multiple conditions:
> 20 | gender == "male", ] # logical OR
df[age #> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
> 20 & gender == "male", ] # logical AND
df[age #> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
As using subset()
can have unanticipated consequences, its help page even contains the warning “This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [
….”25
Thus, indexing data structures by []
is both safer and more general.
1.5.4 Changing variable types
When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character()
, as.integer()
, as.logical()
, as.numeric()
, or as.factor()
) around it:
levels(df$gender) # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender) # of type "integer"
#> [1] "integer"
$gender <- as.character(df$gender) # convert into a character variable
dftypeof(df$gender) # now of type "character"
#> [1] "character"
$gender <- as.factor(df$gender) # convert from "character" into a "factor"
df$gender
df#> [1] male female female female female male male
#> Levels: female male
typeof(df$gender) # again of type "integer"
#> [1] "integer"
typeof(df$age) # numeric "double"
#> [1] "double"
$age <- as.integer(df$age) # convert from "double" to "integer"
dftypeof(df$age) # "integer"
#> [1] "integer"
$age <- as.numeric(df$age) # convert from "integer" to numeric "double"
dftypeof(df$age) # numeric "double"
#> [1] "double"
Practice
- What happens when you convert a vector
v <- 0:3
into a logical data by using theas.logical()
conversion function? (Predict the outcome, then check it, and verify your understanding by reading the documentation of?as.logical
.)
<- 0:3
v
vas.logical(v)
- What happens when you convert the outcome of
as.logical(v)
into numeric data by using theas.numeric()
conversion function)? (Predict the outcome, then check it, and verify your understanding by reading the documentation of?as.numeric
.)
<- as.logical(v)
vl
vlas.numeric(vl)
References
For details, see this discussion of non-standard evaluation (in Wickham, 2014a).↩︎