## 2.3 Data structures

So far, we used *scalar* objects (i.e., objects with a `length()`

of 1).
To combine multiple scalars in one object, we need to construct larger *data structures*.
In R, we distinguish between the following *structures* of data objects (aka. *data shapes*):

- scalars (i.e., individual objects, vectors of length 1)
- vectors (one dimension, i.e., 1D)
- tables (two dimensions, i.e., 2D)
- arrays (\(n\) dimensions, i.e., \(n\)D)
- non-rectangular or unstructured data

The key data structures covered in this book are scalars, vectors, and tables (1.–3.). In R, different structures are distinguished based on the fact whether they contain only a single or multiple data types. Thus, Table 2.2 distinguishes between data structures for “homogeneous” vs. “heterogeneous” data types.

Dimensions | Homogeneous data types | Heterogeneous data types |
---|---|---|

1D | atomic vector | list |

2D | matrix | table (data frame/tibble) |

nD | array |

Although Table 2.2 contains five different data structures, only two of them are by far the most important ones for our purposes:

**vectors**are linear (1-dimensional) data structures. So-called*atomic*vectors only contain a single data type and have a length of 1 or more elements.**tables**are rectangular (2-dimensional) data structures that can contain data of different types (in different columns). The terms*data frames*and*tibbles*denote two slightly different types of tables.

A good question is: Where are scalar objects in Table 2.2? The answer is: R is a vector-based language. Thus, even scalar objects are represented as (atomic, i.e., homogeneous) vectors (of length \(1\)).

### 2.3.1 Vectors

Vectors are by far the most common and most important data structure in R. Essentially, a vector is an ordered sequence of elements with three common properties:

- its
*type*of elements (tested by`typeof()`

);

- its
*length*(tested by`length()`

); - optional
*attributes*or meta-data (tested by`attributes()`

).

More specifically, there are two types of vectors:

- in
*atomic vectors*, all elements are of the same type - in
*lists*, elements can have different types

The vast majority of vectors we will encounter are *atomic vectors* (i.e., all elements of the same type), but lists are often used in R for storing a variety of data types in a common object (e.g., in statistical analyses).

Atomic vectors can contain objects of any type.
The simplest way of creating a vector is by using the `c()`

function (think *chain*, *combine*, or *concatenate*) on a number of objects:

```
# Create vectors:
<- c(TRUE, FALSE) # logical vector
v_lg <- c(1, pi, 4.5) # numeric vector (double)
v_n1 <- c(2L, 3L, 5L) # numeric vector (integer)
v_n2 <- c("hi", "Hallo", "salut") # character vector v_cr
```

Whenever encountering a new vector, the first things to do is testing for its type and length:

```
# type:
typeof(v_n1)
#> [1] "double"
typeof(v_cr)
#> [1] "character"
# length:
length(v_lg)
#> [1] 2
length(v_n2)
#> [1] 3
```

Beyond these elementary functions, the majority of functions in R can be applied to vectors.
However, most functions require a particular data type to work properly.
For instance, a common operation that changes an existing vector consists in *sorting* vectors, which is achieved by the `sort()`

function.
An argument `decreasing`

is set to `FALSE`

by default, but can be set to `TRUE`

if sorting in decreasing order is desired:

```
<- c(4, 6, 2)
x
sort(x)
#> [1] 2 4 6
sort(x, decreasing = TRUE)
#> [1] 6 4 2
```

What happens when we apply `sort()`

to other data types?

```
<- c(TRUE, FALSE, TRUE, FALSE)
y sort(y)
#> [1] FALSE FALSE TRUE TRUE
<- c("A", "N", "T")
z sort(z, decreasing = TRUE)
#> [1] "T" "N" "A"
```

This shows that generic R functions like `sort()`

often work with multiple data types.
However, many functions simply require specific data types and would not work with others.
For instance, as most mathematical functions require numeric objects to work, the following would create an error:

`sum("A", "B", "C") # would yield an error`

However, remember that vectors of logical values can be interpreted as numbers (`FALSE`

as 0 and `TRUE`

as 1):

```
<- c(FALSE, TRUE, FALSE)
v_lg2 <- c(4, 5)
v_nm2
c(v_lg2, v_nm2)
#> [1] 0 1 0 4 5
mean(v_lg2)
#> [1] 0.3333333
```

As *attributes* are optional, most vectors have no attributes:

```
v_n2#> [1] 2 3 5
attributes(v_n2)
#> NULL
```

The most common attribute of a vector \(v\) are the *names* of its elements, which can be set or retrieved by `names(v)`

:

```
# Setting names:
names(v_n2) <- c("A", "B", "C")
names(v_cr) <- c("en", "de", "fr")
# Getting names:
names(v_n2)
#> [1] "A" "B" "C"
```

Other attributes can be defined as name-value pairs using `attr(v, name) <- value`

) and inspected by `attributes()`

, `str()`

or `structure()`

:

```
# Adding attributes:
attr(v_cr, "my_dictionary") <- "Words to greet people"
# Viewing attributes:
attributes(v_n2)
#> $names
#> [1] "A" "B" "C"
attributes(v_cr)
#> $names
#> [1] "en" "de" "fr"
#>
#> $my_dictionary
#> [1] "Words to greet people"
# Inspecting a vector's structure:
str(v_cr)
#> Named chr [1:3] "hi" "Hallo" "salut"
#> - attr(*, "names")= chr [1:3] "en" "de" "fr"
#> - attr(*, "my_dictionary")= chr "Words to greet people"
structure(v_cr)
#> en de fr
#> "hi" "Hallo" "salut"
#> attr(,"my_dictionary")
#> [1] "Words to greet people"
```

There exists an `is.vector()`

function in R, but it does not only test if an object is a vector.
Instead, it returns `TRUE`

only if the object is a vector with no attributes other than names.

To test if an object `v`

actually *is* a vector, we can use `is.atomic(v) | is.list(v)`

(i.e., test if it is an atomic vector or a list) or use an auxiliary `is_vector()`

function of various packages (e.g., **purrr**):

```
# (1) A vector with only names:
is.vector(v_n2)
#> [1] TRUE
# (2) A vector with other attributes:
is.vector(v_cr)
#> [1] FALSE
is.atomic(v_cr)
#> [1] TRUE
::is_vector(v_cr)
purrr#> [1] TRUE
```

#### Combining vectors

The `c()`

function used to combine objects into vectors can also used to combine scalars and vectors, or multiple vectors:

```
# Combining scalar objects and vectors (defined above):
<- 1
v1 <- c(2, 3)
v2 <- c(4, 5)
v3
<- c(v1, v2, v3) # but the result is only 1 vector, not 2 or 3:
v4
v4#> [1] 1 2 3 4 5
```

Note that the new vector `v4`

is still a vector, rather than a vector containing other vectors (i.e., `c()`

*flattens* hierarchical vector structures into vectors).

#### Coercion of data types

When combining different data types, they are *coerced* into a single data type.
The result is either a numeric vector (when mixing truth values and numberic objects) or a character vector (when mixing anything with characters):

```
# Combining different data types:
<- c(TRUE, 2L, 3.0) # logical, integer, double
x
x#> [1] 1 2 3
typeof(x)
#> [1] "double"
<- c(TRUE, "two") # logical, character
y
y#> [1] "TRUE" "two"
typeof(y)
#> [1] "character"
<- c(TRUE, 2, "three") # logical, numeric, character
z
z#> [1] "TRUE" "2" "three"
typeof(z)
#> [1] "character"
```

#### Vector creation functions

The `c()`

function is used for combining existing vectors.
However, for creating vectors that contain more than just a few elements (i.e., vectors with larger `length()`

values), using the `c()`

function and then typing all vector elements becomes impractical.
Useful functions and shortcuts to generate continuous or regular sequences are the colon operator `:`

, and the functions `seq()`

and `rep()`

:

`m:n`

generates a numeric sequence (in steps of \(1\) or \(-1\)) from`m`

to`n`

:

```
# Colon operator (with by = 1):
<- 0:10
s1
s1#> [1] 0 1 2 3 4 5 6 7 8 9 10
<- 10:0
s2 all.equal(s1, rev(s2))
#> [1] TRUE
```

`seq()`

generates numeric sequences from an initial number`from`

to a final number`to`

and allows either setting the step-width`by`

or the length of the sequence`length.out`

:

```
# Sequences with seq():
<- seq(0, 10, 1) # is short for:
s3
s3#> [1] 0 1 2 3 4 5 6 7 8 9 10
<- seq(from = 0, to = 10, by = 1)
s4 all.equal(s3, s4)
#> [1] TRUE
all.equal(s1, s3)
#> [1] TRUE
# Note: seq() is more flexible:
<- seq(0, 10, by = 2.5) # set step size
s5
s5#> [1] 0.0 2.5 5.0 7.5 10.0
<- seq(0, 10, length.out = 5) # set output length
s6 all.equal(s5, s6)
#> [1] TRUE
```

`rep()`

replicates the values provided in its first argument`x`

either`times`

times or each element`each`

times:

```
# Replicating vectors (with rep):
<- rep(c(0, 1), 3) # is short for:
s7
s7#> [1] 0 1 0 1 0 1
<- rep(x = c(0, 1), times = 3)
s8 all.equal(s7, s8)
#> [1] TRUE
# but differs from:
<- rep(x = c(0, 1), each = 3)
s9
s9#> [1] 0 0 0 1 1 1
```

Whereas `:`

and `seq()`

create numeric vectors, `rep()`

can be used with other data types:

```
rep(c(TRUE, FALSE), times = 2)
#> [1] TRUE FALSE TRUE FALSE
rep(c("A", "B"), each = 2)
#> [1] "A" "A" "B" "B"
```

#### Random sampling from a population

A frequent situation when working with R is that we want a sequence of elements (i.e., a vector) that are randomly drawn from a given population. The `sample()`

function allows drawing a sample of size `size`

from a population `x`

.
A logical argument `replace`

specifies whether the sample is to be drawn with or without replacement.
Not surprisingly, the population `x`

is provided as a vector of elements and the result of `sample()`

is another vector of length `size`

:

```
# Sampling vector elements (with sample):
sample(x = 1:3, size = 10, replace = TRUE)
#> [1] 1 2 1 2 3 1 2 2 2 2
# Note:
# sample(1:3, 10)
# would yield an error (as replace = FALSE by default).
# Note:
<- 1:10
one_to_ten sample(one_to_ten, size = 10, replace = FALSE) # drawing without replacement
#> [1] 3 5 8 1 6 7 2 10 4 9
sample(one_to_ten, size = 10, replace = TRUE) # drawing with replacement
#> [1] 2 4 3 7 10 2 5 3 1 3
```

As the `x`

argument of `sample()`

accepts non-numeric vectors, we can use the function to generate sequences of random events. For instance, we can use character vectors to sample sequences of letters or words (which can be used to represent random events):

```
# Random letter/word sequences:
sample(x = c("A", "B", "C"), size = 10, replace = TRUE)
#> [1] "C" "A" "B" "C" "C" "C" "C" "B" "A" "B"
sample(x = c("he", "she", "is", "good", "lucky", "sad"), size = 5, replace = TRUE)
#> [1] "lucky" "she" "good" "is" "good"
# Binary sample (coin flip):
<- c("H", "T") # 2 events: Heads or Tails
coin sample(coin, 5, TRUE) # is short for:
#> [1] "T" "H" "T" "T" "H"
sample(x = coin, size = 5, replace = TRUE) # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"
# Flipping 10.000 coins:
<- sample(x = coin, size = 10000, replace = TRUE) # flip coin 10.000 times
coins_10000 table(coins_10000) # overview of 10.000 flips
#> coins_10000
#> H T
#> 5049 4951
```

#### Accessing and changing vectors

Having found various ways of storing R objects in vectors, we need to ask:

- How can we access, test for, or replace individual vector elements?

These tasks are summarily known as *indexing* or *subsetting*.
As this is an extremely common and important tasks, there are many ways of accessing and changing vector elements.
We will only cover the two most important ones here (but Chapter 4 Subsetting of Wickham (2019a) lists six different ways):

*Numerical indexing/subsetting*provides a numeric (vector of) value(s) denoting the*position(s)*of the desired elements in a vector in square brackets`[]`

. Given a character vector`ABC`

(of a length 5):

```
<- c("Anna", "Ben", "Cecily", "David", "Eve")
ABC
ABC#> [1] "Anna" "Ben" "Cecily" "David" "Eve"
```

here are two ways of accessing particular elements of this vector:

```
3]
ABC[#> [1] "Cecily"
c(2, 4)]
ABC[#> [1] "Ben" "David"
```

Rather than merely accessing these elements, we can also change these elements by assigning new values to them:

```
1] <- "Annabelle"
ABC[c(2, 3)] <- c("Benjamin", "Cecilia")
ABC[
ABC#> [1] "Annabelle" "Benjamin" "Cecilia" "David" "Eve"
```

Providing negative indices yields all elements of a vector expect for the ones at the specified positions:

```
-1]
ABC[#> [1] "Benjamin" "Cecilia" "David" "Eve"
c(-2, -4, -5)]
ABC[#> [1] "Annabelle" "Cecilia"
```

Even providing non-existent or missing (`NA`

) indices yields sensible results:

```
99] # accessing a non-existent position, vs.
ABC[#> [1] NA
NA] # accessing a missing (NA) position
ABC[#> [1] NA NA NA NA NA
```

Note that missing values are addictive in R:
Asking for the `NA`

-the element of a vector yields a vector of the same length with only `NA`

values (and names).

*Logical indexing/subsetting*provides a logical (vector of) value(s) in square brackets`[]`

. The provided vector of`TRUE`

or`FALSE`

values is typically of the same length as the indexed vector`v`

.

For instance, assuming a numeric vector `one_to_ten`

:

```
<- 1:10
one_to_ten
one_to_ten#> [1] 1 2 3 4 5 6 7 8 9 10
```

we could select its elements in the first and third position by:

```
c(TRUE, FALSE, TRUE, FALSE, FALSE,
one_to_ten[FALSE, FALSE, FALSE, FALSE, FALSE)]
#> [1] 1 3
```

The same can be achieved in two steps by defining a vector of logical indices and then using it as an index to our numeric vector `one_to_ten`

:

```
<- c(TRUE, FALSE, TRUE, FALSE, FALSE,
my_ix_v FALSE, FALSE, FALSE, FALSE, FALSE)
one_to_ten[my_ix_v]#> [1] 1 3
```

Explicitly defining a vector of logical values quickly becomes impractical, especially for longer vectors.
However, the same can be achieved implicitly by using a logical test of the vector `v`

as the logical index values of vector `v`

:

```
<- (one_to_ten > 5)
my_ix_v
one_to_ten[my_ix_v]#> [1] 6 7 8 9 10
```

Using a test on the *same* vector to generate the indices to a vector is a very powerful tool for getting subsets of a vector (which is why *indexing* is also referred to as *subsetting*).
Essentially, the R expression within the square brackets `[]`

asks a question about a vector and the logical indexing construct returns the elements for which this question is answered in the affirmative (i.e., the indexing vector yields `TRUE`

).
Here are some examples:

```
< 3 | one_to_ten > 8]
one_to_ten[one_to_ten #> [1] 1 2 9 10
%% 2 == 0]
one_to_ten[one_to_ten #> [1] 2 4 6 8 10
!is.na(one_to_ten)]
one_to_ten[#> [1] 1 2 3 4 5 6 7 8 9 10
!= "Eve"]
ABC[ABC #> [1] "Annabelle" "Benjamin" "Cecilia" "David"
nchar(ABC) == 5]
ABC[#> [1] "David"
substr(ABC, 3, 3) == "n"]
ABC[#> [1] "Annabelle" "Benjamin"
```

The `which()`

function provides a bridge from logical to numerical indexing, as `which(v)`

returns the numeric indices of those elements of `v`

for which an R expression is `TRUE`

:

```
which(one_to_ten > 8)
#> [1] 9 10
which(nchar(ABC) > 7)
#> [1] 1 2
```

Thus, the following expressions use both types of indexing to yield identical results:

```
which(one_to_ten > 8)] # numerical indexing
one_to_ten[#> [1] 9 10
> 8] # logical indexing
one_to_ten[one_to_ten #> [1] 9 10
which(nchar(ABC) > 7)] # numerical indexing
ABC[#> [1] "Annabelle" "Benjamin"
nchar(ABC) > 7] # logical indexing
ABC[#> [1] "Annabelle" "Benjamin"
```

Note that both numerical and logical indexing use square brackets `[]`

directly following the name of the object to be indexed.
By contrast, functions always provide their arguments in round parentheses `()`

.

#### Example

Suppose we know the following facts about five people:

p_1 | p_2 | p_3 | p_4 | p_5 | |
---|---|---|---|---|---|

name | Adam | Ben | Cecily | David | Evelyn |

gender | male | male | female | male | misc |

age | 21 | 19 | 20 | 48 | 45 |

How would we encode this information in R?

Note that we know the same three facts about each person and the leftmost column in Table 2.3 specifies this type of information (i.e., a *variable*).
A straightforward way of representing these facts in R would consist in defining a vector for each variable.

```
<- c("Adam", "Ben", "Cecily", "David", "Evelyn")
name <- c("male", "male", "female", "male", "misc")
gender <- c(21, 19, 20, 48, 45) age
```

In this solution, we encode the two vectors `name`

and `gender`

as character data, whereas the vector `age`

encodes numeric data.
Note that `gender`

is often encoded as numeric values (e.g., as 0 vs. 1) or as logical value (e.g., `female?`

: `TRUE`

vs. `FALSE`

), but this creates problems — or rather incomplete accounts — when there are more than two gender values to consider.^{8}

Equipped with these three vectors, we can now employ numeric and logical indexing to ask and answer a wide range of questions about these people. Note that the three vectors have the same length (as they describe the same set of people). If we assume that a particular position in a vector always refers to the same person, we can use one of the vectors to index the same or any other vector. This is a very common and immensely powerful idea to select vector elements (or here: properties of people) based on their values on other variables.

As an exercise, try predicting the results of the following expressions and describe what we are asking for in each case in your own words (including the type of indexing). Then evaluate each expression to check your prediction.

```
c(-1)]
name[!= "male"]
name[gender >= 21]
name[age
3:5]
gender[nchar(name) > 5]
gender[> 30]
gender[age
c(1, 3, 5)]
age[!= "Ben") & (name != "Cecily")]
age[(name == "female"] age[gender
```

Here are the results:

```
c(-1)] # get names of all non-first people
name[#> [1] "Ben" "Cecily" "David" "Evelyn"
!= "male"] # get names of non-male people
name[gender #> [1] "Cecily" "Evelyn"
>= 21] # get names of people with an age of 21 or older
name[age #> [1] "Adam" "David" "Evelyn"
3:5] # get 3rd to 5th gender values
gender[#> [1] "female" "male" "misc"
nchar(name) > 5] # get gender of people with a name of more than 5 letters
gender[#> [1] "female" "misc"
> 30] # get gender of people over 30
gender[age #> [1] "male" "misc"
c(1, 3, 5)] # get age values of certain positions
age[#> [1] 21 20 45
!= "Ben") & (name != "Cecily")] # get age of people whose name is not "Ben" and not "Cecily"
age[(name #> [1] 21 48 45
== "female"] # get age values of all people with "female" gender values
age[gender #> [1] 20
```

The first command in each triple used numerical indexing, whereas the other two commands in each triple used logical indexing.

From this example, it is only a small step to study tables.

### 2.3.2 Tables

*Tables* generally store data in a *two-dimensional* (2D) format (i.e., a grid containing rows and columns).
When all rows and all columns have the same length, the resulting structure is *rectangular*.

As Table 2.2 has already shown, we distinguish between two main types of 2D-data structures in R:

*matrices*are*homogeneous*with respect to their data (i.e., contain only a single data type)*tables*(called*data frames*or*tibbles*in R) are*heterogeneous*: They can contain different data types in different columns.

R distinguishes between tables that are *data frames* and tables that are *tibbles*.
But as tibbles are actually another (simpler) type of data frame, we can ignore this distinction here (and will reconsider it when introducing the **tibble** package in Chapter 4).

Another confusing aspect is that the term “table” is sometimes used as a super-category for any rectangular data structure (i.e, including data frames and matrices, e.g., in the title of this section). In R, the flexibility or vagueness of the term “table” is made possible as R defines no corresponding object type. However, it makes sense to distinguish between matrices and data frames, which is why we will discuss these two types of tables next.

#### Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we get a *matrix* of data.
Matrices can be created from vectors (of the same data type) by binding them together.
The `rbind()`

function treats each vector as a row;
the `cbind()`

function treats each vector as a column:

```
# Creating 3 vectors:
<- 1:3
x <- 4:6
y <- 7:9
z
# Combining vectors (of the same length): ----
<- rbind(x, y, z) # combine as rows
m1
m1#> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
<- cbind(x, y, z) # combine as columns
m2
m2#> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
```

A more direct ways of creating matrices is the `matrix()`

function.
It contains arguments for `data`

, for the number of rows `nrow`

, the number of columns `ncol`

, and a logical argument `byrow`

that arranges `data`

in a by-row vs. by-column fashion:

```
# Putting a vector into a rectangular matrix:
<- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE)
m3
m3#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
<- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE)
m4
m4#> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
```

Note that the following commands all create Warning messages, as the number of arguments do not fit together neatly as a matrix (of the required size):

```
<- 1:2
m <- 3:5
n
rbind(m, n)
#> [,1] [,2] [,3]
#> m 1 2 1
#> n 3 4 5
cbind(m, n)
#> m n
#> [1,] 1 3
#> [2,] 2 4
#> [3,] 1 5
matrix(data = 1:10, nrow = 3, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
```

The matrices `m1`

to `m4`

all contained numeric data.
However, data of type logical or character can also stored in matrix form:

```
# A matrix of logical values:
<- matrix(data = 1:18 %% 4 == 0, nrow = 3, ncol = 6, byrow = TRUE)
m5
m5#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] FALSE FALSE FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE FALSE FALSE
# A matrix of character values:
<- matrix(sample(letters, size = 16), nrow = 4, ncol = 4, byrow = FALSE)
m6
m6#> [,1] [,2] [,3] [,4]
#> [1,] "u" "d" "s" "a"
#> [2,] "j" "c" "z" "f"
#> [3,] "m" "b" "e" "w"
#> [4,] "n" "t" "h" "y"
```

#### Indexing matrices

Retrieving values from a matrix `m`

works similarly to indexing vectors.
First, we will consider *numeric* indexing.
Due to the two-dimensional nature of a matrix, we now need to specify *two* indices in square brackets:
the number of the desired row, and the number of the desired column, separated by a comma.
Thus, to get or change the value of row `r`

and column `c`

of a matrix `m`

we
need to evaluate `m[r, c]`

.
Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns.
When the value of `r`

or `c`

is left unspecified, *all* rows or columns are selected.

```
# Selecting cells, rows, or columns of matrices: ----
2, 3] # in m1: select row 2, column 3
m1[#> y
#> 6
3, 1] # in m2: select row 3, column 1
m2[#> x
#> 3
2, ] # in m1: select row 2, all columns
m1[#> [1] 4 5 6
1] # in m1: select column 1, all rows
m2[ , #> [1] 1 2 3
m3#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
2, 3:4] # in m3: select row 2, columns 3 to 4
m3[#> [1] 7 8
3:5, 2] # in m3: select rows 3 to 5, column 2
m3[#> [1] 10 14 18
# in r4: select all rows and all columns (i.e., all of m4)
m4[] #> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
```

Similarly, we can extend the notion of logical indexing to matrices:

```
> 10 # returns a matrix of logical values
m4 #> [,1] [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(m4 > 10)
#> [1] "logical"
> 10] # indexing of matrices
m4[m4 #> [1] 11 12 13 14 15 16 17 18 19 20
```

Just as with vectors, we can apply functions to matrices. Typical examples include:

```
# Applying functions to matrices: ----
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"
# Note the difference between:
is.numeric(m3) # type of m3? (1 value)
#> [1] TRUE
is.na(m3) # NA values in m3? (many values)
#> [,1] [,2] [,3] [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
# Computations with matrices:
sum(m1)
#> [1] 45
max(m2)
#> [1] 9
mean(m3)
#> [1] 10.5
colSums(m3) # column sums of r3
#> [1] 45 50 55 60
rowSums(m4) # row sums of r4
#> [1] 34 38 42 46 50
```

Just as `length()`

provides crucial information about a vector, some functions are specifically designed to provide the dimensions of rectangular data structures:

```
ncol(m4) # number of columns
#> [1] 4
nrow(m4) # number of rows
#> [1] 5
dim(m4) # dimensions as vector c(rows, columns)
#> [1] 5 4
```

A typical function in the context of matrices is `t()`

for transposing (i.e., swap the rows and columns of) a matrix:

```
t(m4)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 2 3 4 5
#> [2,] 6 7 8 9 10
#> [3,] 11 12 13 14 15
#> [4,] 16 17 18 19 20
t(m5)
#> [,1] [,2] [,3]
#> [1,] FALSE FALSE FALSE
#> [2,] FALSE TRUE FALSE
#> [3,] FALSE FALSE FALSE
#> [4,] TRUE FALSE TRUE
#> [5,] FALSE FALSE FALSE
#> [6,] FALSE TRUE FALSE
```

#### Data frames

Table 2.3 was rectangular in containing three rows (values for the variables `name`

, `gender`

, and `age`

)
and five columns (one for each person, plus an initial column indicating the variable name of in each row).
This is a perfectly valid table, but not the type of table typically used in R.

Typical tables of data in R also combine several vectors into a larger data structure, but use the individual vectors as columns, rather than rows. Such a combination of several vectors (as columns) is shown in Table 2.4:

name | gender | age |
---|---|---|

Adam | male | 21 |

Ben | male | 19 |

Cecily | female | 20 |

David | male | 48 |

Evelyn | misc | 45 |

Importantly, Table 2.4 provides exactly the same information as Table 2.3 and as the three individual vectors (`name`

, `gender`

, and `age`

) above, but in the shape of a table that uses our previous vectors as its *columns*, rather than as its rows.

As (atomic) vectors in R need to have the same data type (e.g., `name`

contains character data, whereas `age`

contains numeric data), the information on each person — due to containing multiple data types — cannot be stored as a vector.
Instead, we represent each person as a *row* (aka. an *observation*) of the table.

Creating a data frame from vectors works by using the `data.frame()`

function.
The following assigns the resulting data frame to a dummy object `df`

, so that we can poke and probe it later:

```
<- data.frame(name, gender, age)
df
df#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

As data frames are the most common way of storing data in R, there is a special form of indexing that allows retrieving the variables of a data frame (i.e., the columns of a data frame) as vectors.

#### Name-based indexing of data frames

When a table `tb`

has column names (e.g., a column called `nm`

), we can retrieve the corresponding vector by *name-based indexing* (aka. *name indexing*). This is the most convenient and most frequent way of accessing variables (i.e., columns) of tables (e.g., data frames).
To use this form of indexing, we use a special dollar sign notation: Adding `$`

and the name of the desired variable `nm`

to the table’s object name `tb`

yields its column `nm`

as a vector. This sounds complicated, but is actually very easy:

`$name tb`

In case of our data frame `df`

, we can access its 1st and 2nd columns by their respective names:

```
names(df) # prints the (column) names
#> [1] "name" "gender" "age"
$name
df#> [1] Adam Ben Cecily David Evelyn
#> Levels: Adam Ben Cecily David Evelyn
$gender
df#> [1] male male female male misc
#> Levels: female male misc
```

#### Indexing data frames

Note that everything we have learned about numeric and logical indexing of vectors and matrices (above) also applies to data frames.
Thus, we can also use numerical indexing on a data frame, just as with matrices (above).
For instance, to get all rows of the first column, we can specify the data frame’s name, followed by `[ , 1]`

:

```
1] # get (all rows and) the 1st column of df
df[ , #> [1] Adam Ben Cecily David Evelyn
#> Levels: Adam Ben Cecily David Evelyn
2] # get (all rows and) the 2nd column of df
df[ , #> [1] male male female male misc
#> Levels: female male misc
```

Thus, these two expressions retrieve the 1st and 2nd column of the data frame `df_1`

(as vectors), respectively.
As this is a very common task in R, there is an easier way of accessing the variables (columns) of a data frame.

Logical indexing on data frames is particularly powerful in allowing us to select particular rows (based on conditions specified on columns of the same data frame):

```
$gender == "male", ]
df[df#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 4 David male 48
$age < 21, ]
df[df#> name gender age
#> 2 Ben male 19
#> 3 Cecily female 20
```

Note that the different types of indexing can be flexibly combined. For instance, the following command uses

- logical indexing (to select rows of
`df`

with an age value below 30) - numerical indexing (to select only columns 1 and 2)
- name indexing (to get the variable
`name`

, as a vector), and - numerical indexing (to select the 3rd element of this vector):

```
$age < 30, c(1, 2)]$name[3]
df[df#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn
```

In practice, such complex combinations are rarely necessary or useful. For instance, the following expressions retrieve the exact same result as the complex one, but are semantically very different:

```
3, 1]
df[#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn
$name[3]
df#> [1] Cecily
#> Levels: Adam Ben Cecily David Evelyn
```

#### Strings as factors

Note that the `data.frame()`

function has an argument `stringsAsFactors`

.
This argument determines whether so-called *string* variables (i.e., of data type “character”) are converted into *factors* (i.e., categorical variables, which are internally represented as integer values with text labels) when generating a data frame.
To the chagrin of generations of R users, the default of this argument used to be `TRUE`

for several decades — which essentially meant that any character variable in a data frame was converted into a factor unless the user had specified `stringsAsFactors = FALSE`

. As this caused much confusion, the default has been changed with the release of R version 4.0.0 (on 2020-04-24) to `stringsAsFactors = FALSE`

. This shows that the R gods at https://cran.r-project.org/ are responding to user feedback. However, as any such changes are unlikely to happen quickly, it is safer to explicitly set the arguments of a function.
To see the difference between both settings, consider the following example:

```
<- data.frame(name, gender, age,
df_1 stringsAsFactors = FALSE) # new default (since R 4.0.0+)
<- data.frame(name, gender, age,
df_2 stringsAsFactors = TRUE) # old default (up to R 3.6.3)
# Both data frames look identical:
df_1 #> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
df_2#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

Printing the two data frames `df_1`

and `df_2`

shows us no difference between them.
However, as the first two variables (i.e., `name`

and `gender`

) were string variables (i.e., of type “character”), they are represented as factors in `df_1`

versus remained character variables in `df_2`

.

Let’s retrieve the first column of each data frame (as a vector).
Using named indexing, we can easily retrieve and print the first column (i.e., with a name of `name`

) of either data frame:

```
$name
df_1#> [1] "Adam" "Ben" "Cecily" "David" "Evelyn"
$name
df_2#> [1] Adam Ben Cecily David Evelyn
#> Levels: Adam Ben Cecily David Evelyn
```

Note the differences in the printed outputs.
The output of `df_1$name`

looks just any other character vector (with five elements, each consisting of a name).
By contrast, the output of `df_2$name`

also prints the same names, but without the characteristic double quotes around each name, and with a second line starting with “Levels:” before seeming to repeat the names of the first line.
Before clarifying what this means, check the other variable in both `df_1`

and `df_2`

that used to be a character vector `gender`

:

```
$gender
df_1#> [1] "male" "male" "female" "male" "misc"
$gender
df_2#> [1] male male female male misc
#> Levels: female male misc
```

Again, `df_1$gender`

appears to be a characer vector, but `df_2$gender`

has been converted into something else.
This time, the line beginning with “Levels:” only contains each of the gender labels once, and in alphabetical order.

In case you’re not confused yet, compare the outputs of the following commands:

```
typeof(df_1$name)
#> [1] "character"
typeof(df_2$name)
#> [1] "integer"
```

Whereas `df_1$name`

was to be expected to be of type *character*, it should come as a surprise to see that `df_2$name`

is of type *integer*.
Given that `df_2$name`

contains *integers*, we might be tempted to try out arithmetic functions like:

```
max(df_2$name)
sum(df_2$name)
mean(df_2$name)
```

If we try to evaluate these expressions, we get either Warnings or Error messages. How can we make sense of all this?

The magic word here is *factor*.
As the `stringsAsFactors = TRUE`

suggests, the character strings of the `name`

and `gender`

vectors have been converted into *factors* when defining `df_2`

.
Factors are categorical variables that only care about whether two values belong to the same or to different groups.
Actually, R iternally encodes them as numeric values (integers) for each factor level. But as we never want to calculate with these numeric values (as they have no meaning beyond being either the same or different), they are also assigned a label, which is shown when printing the values of a factor.

A quick way of checking that we’re dealing with a factor is the `is.factor()`

function:

```
is.factor(df_1$name)
#> [1] FALSE
is.factor(df_2$name)
#> [1] TRUE
is.factor(df_1$gender)
#> [1] FALSE
is.factor(df_2$gender)
#> [1] TRUE
```

Factor variables are often useful (e.g., for distinguishing between groups in statistical designs).
But it is premature to assume that any character variable should be a factor when including the variable in a data frame.
Thus, it is a good thing that the default argument in the `data.frame() function has been changed to`

stringsAsFactors = FALSE` in R v4.0.0.

Whoever wants factors can still get and use them — but novice users no longer need to deal with them all the time.

- data frames:
`data.frame()`

vs.`as.data.frame()`

- tibbles:
`as_tibble()`

(of**tibble**package) converts a data frame into a tibble

Ways of accessing and manipulating tables:

Applying functions to tables:

- Checking for
`NA`

values (in vectors or tables) by using`is.na()`

function.

### 2.3.3 Practice

The following practice exercises allow you to check your understanding of this section.

#### Accessing and evaluating matrices

Assuming the definitions of the matrices `m5`

and `m6`

from above, i.e.,

```
m5#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] FALSE FALSE FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE FALSE FALSE
m6#> [,1] [,2] [,3] [,4]
#> [1,] "u" "d" "s" "a"
#> [2,] "j" "c" "z" "f"
#> [3,] "m" "b" "e" "w"
#> [4,] "n" "t" "h" "y"
```

- predict, evaluate, and explain the result of the following R expressions:

```
2, 6]
m5[2, ]
m5[== FALSE
m5 sum(m5)
t(t(m5))
2, 3]
m6[4]
m6[ , nrow(m6), (ncol(m6) - 1)]
m6[== "e"
m6 toupper(m6[4, ])
```

#### Numeric indexing of data frames

Assuming the data frame `df_2`

(from above),

```
df_2#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

- predict, evaluate and explain what happens in the following commands (in terms of
*numeric*indexing):

```
df_2[]1]
df_2[ , 1:nrow(df_2), c(1)]
df_2[nrow(df_2):1, c(1)]
df_2[rep(1, 3), c(1, 2)]
df_2[$name[3]
df_2# compare:
1:nrow(df_2), 1:ncol(df_2)]
df_2[1:nrow(df_2), ncol(df_2):1]
df_2[nrow(df_2):1, ncol(df_2):1] df_2[
```

#### Logical indexing of data frames

Assuming the data frame `df_1`

(from above),

```
df_1#> name gender age
#> 1 Adam male 21
#> 2 Ben male 19
#> 3 Cecily female 20
#> 4 David male 48
#> 5 Evelyn misc 45
```

- predict, evaluate and explain what happens in the following commands (in terms of
*logical*indexing):

```
3] > 30
df_1[ , $age > 30, ]
df_1[df_1$gender != "male", c(1, 3, 2)]
df_1[df_1$name[df_1$gender == "male"]
df_1sum(df_1$age[df_1$gender == "male"])
```

#### Data frames with factors

- Given that our definition of
`df_2`

used`stringsAsFactors = TRUE`

(see above), predict, evaluate and explain what happens in the following commands:

```
nchar(as.character(df_2$name[3]))
as.numeric(df_2$name[3]) + 1
mean(as.numeric(df_2$name))
```

- Why would the following commands (which are simpler variants of the last three expressions) yield errors or warnings?

```
nchar(df_2$name[3])
$name[3] + 1
df_2mean(df_2$name)
```

- What would happen, if the same commands were used on
`df_1`

(from above)?

```
nchar(df_1$name[3])
$name[3] + 1
df_1mean(df_1$name)
```

Fun fact: The English writer Evelyn Waugh (1903–1966) and his wife Evelyn Gardner (1903–1994) had the same given name.↩︎