## 2.3 Vectors

We mentioned above that every data object can be described by its shape and its type.
Whereas we addressed the issue of data types (and the related term of *modes*, see Section 2.2.2), we have not yet discussed the *shape* of data objects.

All objects defined so far all shared the same shape:
The were *vectors* that only contained a single element.
In R, vectors of length 1 are known as *scalars*.

Vectors are by far the most common and most important data structure in R. Essentially, a vector is an ordered sequence of elements with three common properties:

- its
*type*of elements (tested by`typeof()`

);

- its
*length*(tested by`length()`

); - optional
*attributes*or meta-data (tested by`attributes()`

).

More specifically, there are two types of vectors:

- in
*atomic vectors*, all elements are of the same type - in
*lists*, elements can have different types

The vast majority of vectors we will encounter are *atomic vectors* (i.e., all elements of the same type), but lists are often used in R for storing a variety of data types in a common object (e.g., in statistical analyses).
It is important to understand that the term “atomic” in “atomic vectors” refers to the *type* of the vector, rather than its *shape* or *length*: Atomic vectors can contain one or more objects of any type (i.e., can have multiple lengths), but not multiple types.

How can we create new vectors?
We already encountered a basic way of creating a vector above:
Creating a new data object by assigning a value to an object name (using the `<-`

operator).
As any scalar object already *is* a vector (of length 1), we actually are asking:
How can we combine objects or vectors into new vectors?
The simplest way of creating a vector is by using the `c()`

function (think *chain*, *combine*, or *concatenate*) on a number of objects:

```
# Create vectors:
<- c(TRUE, FALSE) # logical vector
v_lg <- c(1, pi, 4.5) # numeric vector (double)
v_n1 <- c(2L, 3L, 5L) # numeric vector (integer)
v_n2 <- c("hi", "Hallo", "salut") # character vector v_cr
```

The vectors defined by combining existing vectors with the `c()`

function typically are longer vectors than their constituents.

Whenever encountering a new vector, a typical thing to do is testing for its type and its length:

```
# type:
typeof(v_n1)
#> [1] "double"
typeof(v_cr)
#> [1] "character"
# length:
length(v_lg)
#> [1] 2
length(v_n2)
#> [1] 3
```

Beyond these elementary functions, the majority of functions in R can be applied to vectors.
However, most functions require a particular data type to work properly.
For instance, a common operation that changes an existing vector consists in *sorting* vectors, which is achieved by the `sort()`

function.
An argument `decreasing`

is set to `FALSE`

by default, but can be set to `TRUE`

if sorting in decreasing order is desired:

```
<- c(4, 6, 2)
x
sort(x)
#> [1] 2 4 6
sort(x, decreasing = TRUE)
#> [1] 6 4 2
```

What happens when we apply `sort()`

to other data types?

```
<- c(TRUE, FALSE, TRUE, FALSE)
y sort(y)
#> [1] FALSE FALSE TRUE TRUE
<- c("A", "N", "T")
z sort(z, decreasing = TRUE)
#> [1] "T" "N" "A"
```

This shows that generic R functions like `sort()`

often work with multiple data types.
However, many functions simply require specific data types and would not work with others.
For instance, as most mathematical functions require numeric objects to work, the following would create an error:

`sum("A", "B", "C") # would yield an error`

However, remember that vectors of logical values can be interpreted as numbers (`FALSE`

as 0 and `TRUE`

as 1):

```
<- c(FALSE, TRUE, FALSE)
v_lg2 <- c(4, 5)
v_nm2
c(v_lg2, v_nm2)
#> [1] 0 1 0 4 5
mean(v_lg2)
#> [1] 0.3333333
```

As *attributes* are optional, most (atomic) vectors have no attributes:

```
v_n2#> [1] 2 3 5
attributes(v_n2)
#> NULL
```

The most common attribute of a vector \(v\) are the *names* of its elements, which can be set or retrieved by `names(v)`

:

```
# Setting names:
names(v_n2) <- c("A", "B", "C")
names(v_cr) <- c("en", "de", "fr")
# Getting names:
names(v_n2)
#> [1] "A" "B" "C"
```

Other attributes can be defined as name-value pairs using `attr(v, name) <- value`

) and inspected by `attributes()`

, `str()`

or `structure()`

:

```
# Adding attributes:
attr(v_cr, "my_dictionary") <- "Words to greet people"
# Viewing attributes:
attributes(v_n2)
#> $names
#> [1] "A" "B" "C"
attributes(v_cr)
#> $names
#> [1] "en" "de" "fr"
#>
#> $my_dictionary
#> [1] "Words to greet people"
# Inspecting a vector's structure:
str(v_cr)
#> Named chr [1:3] "hi" "Hallo" "salut"
#> - attr(*, "names")= chr [1:3] "en" "de" "fr"
#> - attr(*, "my_dictionary")= chr "Words to greet people"
structure(v_cr)
#> en de fr
#> "hi" "Hallo" "salut"
#> attr(,"my_dictionary")
#> [1] "Words to greet people"
```

There exists an `is.vector()`

function in R, but it does not only test if an object is a vector.
Instead, it returns `TRUE`

only if the object is a vector with no attributes other than names.

To test if an object `v`

actually *is* a vector, we can use `is.atomic(v) | is.list(v)`

(i.e., test if it is an atomic vector or a list) or use an auxiliary `is_vector()`

function of various packages (e.g., **purrr**):

```
# (1) A vector with only names:
is.vector(v_n2)
#> [1] TRUE
# (2) A vector with other attributes:
is.vector(v_cr)
#> [1] FALSE
is.atomic(v_cr)
#> [1] TRUE
::is_vector(v_cr)
purrr#> [1] TRUE
```

### 2.3.1 Creating vectors

We have already seen that using the assignment operator `<-`

creates new data objects and that the `c()`

function allows combining objects into vectors.
We can think of `c()`

as combining objects into vectors, but when the objects being combined are already stored as vectors, we are actually creating longer vectors out of shorter ones:

```
# Combining scalar objects and vectors (into longer vectors):
<- 1 # is the same as v1 <- c(1)
v1 <- c(2, 3)
v2
<- c(v1, v2, 4) # but the result is only 1 vector, not 2 or 3:
v3
v3#> [1] 1 2 3 4
```

Note that the new vector `v4`

is still a vector, rather than some higher-order structure containing other vectors (i.e., `c()`

*flattens* hierarchical vector structures into vectors).

#### Coercion of data types

When combining different data types, they are *coerced* into a single data type.
The result is either a numeric vector (when mixing truth values and numberic objects) or a character vector (when mixing anything with characters):

```
# Combining different data types:
<- c(TRUE, 2L, 3.0) # logical, integer, double
x
x#> [1] 1 2 3
typeof(x)
#> [1] "double"
<- c(TRUE, "two") # logical, character
y
y#> [1] "TRUE" "two"
typeof(y)
#> [1] "character"
<- c(TRUE, 2, "three") # logical, numeric, character
z
z#> [1] "TRUE" "2" "three"
typeof(z)
#> [1] "character"
```

#### Vector creation functions

The `c()`

function is used for combining existing vectors.
However, for creating vectors that contain more than just a few elements (i.e., vectors with larger `length()`

values), using the `c()`

function and then typing all vector elements becomes impractical.
Useful functions and shortcuts to generate continuous or regular sequences are the colon operator `:`

, and the functions `seq()`

and `rep()`

:

`m:n`

generates a numeric sequence (in steps of \(1\) or \(-1\)) from`m`

to`n`

:

```
# Colon operator (with by = 1):
<- 0:10
s1
s1#> [1] 0 1 2 3 4 5 6 7 8 9 10
<- 10:0
s2 all.equal(s1, rev(s2))
#> [1] TRUE
```

`seq()`

generates numeric sequences from an initial number`from`

to a final number`to`

and allows either setting the step-width`by`

or the length of the sequence`length.out`

:

```
# Sequences with seq():
<- seq(0, 10, 1) # is short for:
s3
s3#> [1] 0 1 2 3 4 5 6 7 8 9 10
<- seq(from = 0, to = 10, by = 1)
s4 all.equal(s3, s4)
#> [1] TRUE
all.equal(s1, s3)
#> [1] TRUE
# Note: seq() is more flexible:
<- seq(0, 10, by = 2.5) # set step size
s5
s5#> [1] 0.0 2.5 5.0 7.5 10.0
<- seq(0, 10, length.out = 5) # set output length
s6 all.equal(s5, s6)
#> [1] TRUE
```

`rep()`

replicates the values provided in its first argument`x`

either`times`

times or each element`each`

times:

```
# Replicating vectors (with rep):
<- rep(c(0, 1), 3) # is short for:
s7
s7#> [1] 0 1 0 1 0 1
<- rep(x = c(0, 1), times = 3)
s8 all.equal(s7, s8)
#> [1] TRUE
# but differs from:
<- rep(x = c(0, 1), each = 3)
s9
s9#> [1] 0 0 0 1 1 1
```

Whereas `:`

and `seq()`

create numeric vectors, `rep()`

can be used with other data types:

```
rep(c(TRUE, FALSE), times = 2)
#> [1] TRUE FALSE TRUE FALSE
rep(c("A", "B"), each = 2)
#> [1] "A" "A" "B" "B"
```

#### Random sampling from a population

A frequent situation when working with R is that we want a sequence of elements (i.e., a vector) that are randomly drawn from a given population. The `sample()`

function allows drawing a sample of size `size`

from a population `x`

.
A logical argument `replace`

specifies whether the sample is to be drawn with or without replacement.
Not surprisingly, the population `x`

is provided as a vector of elements and the result of `sample()`

is another vector of length `size`

:

```
# Sampling vector elements (with sample):
sample(x = 1:3, size = 10, replace = TRUE)
#> [1] 1 2 1 2 3 1 2 2 2 2
# Note:
# sample(1:3, 10)
# would yield an error (as replace = FALSE by default).
# Note:
<- 1:10
one_to_ten sample(one_to_ten, size = 10, replace = FALSE) # drawing without replacement
#> [1] 3 5 8 1 6 7 2 10 4 9
sample(one_to_ten, size = 10, replace = TRUE) # drawing with replacement
#> [1] 2 4 3 7 10 2 5 3 1 3
```

As the `x`

argument of `sample()`

accepts non-numeric vectors, we can use the function to generate sequences of random events. For instance, we can use character vectors to sample sequences of letters or words (which can be used to represent random events):

```
# Random letter/word sequences:
sample(x = c("A", "B", "C"), size = 10, replace = TRUE)
#> [1] "C" "A" "B" "C" "C" "C" "C" "B" "A" "B"
sample(x = c("he", "she", "is", "good", "lucky", "sad"), size = 5, replace = TRUE)
#> [1] "lucky" "she" "good" "is" "good"
# Binary sample (coin flip):
<- c("H", "T") # 2 events: Heads or Tails
coin sample(coin, 5, TRUE) # is short for:
#> [1] "T" "H" "T" "T" "H"
sample(x = coin, size = 5, replace = TRUE) # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"
# Flipping 10.000 coins:
<- sample(x = coin, size = 10000, replace = TRUE) # flip coin 10.000 times
coins_10000 table(coins_10000) # overview of 10.000 flips
#> coins_10000
#> H T
#> 5049 4951
```

### 2.3.2 Accessing and changing vectors

Having found various ways of storing R objects in vectors, we need to ask:

- How can we access, test for, or replace individual vector elements?

These tasks are summarily known as *indexing* or *subsetting*.
As this is an extremely common and important tasks, there are many ways of accessing and changing vector elements.
We will only cover the two most important ones here (but Chapter 4 Subsetting of Wickham (2019a) lists six different ways):

#### 1. Numerical indexing/subsetting

*Numerical indexing/subsetting* provides a numeric (vector of) value(s) denoting the *position(s)* of the desired elements in a vector in square brackets `[]`

.
Given a character vector `ABC`

(of a length 5):

```
<- c("Anna", "Ben", "Cecily", "David", "Eve")
ABC
ABC#> [1] "Anna" "Ben" "Cecily" "David" "Eve"
```

here are two ways of accessing particular elements of this vector:

```
3]
ABC[#> [1] "Cecily"
c(2, 4)]
ABC[#> [1] "Ben" "David"
```

Rather than merely accessing these elements, we can also change these elements by assigning new values to them:

```
1] <- "Annabelle"
ABC[c(2, 3)] <- c("Benjamin", "Cecilia")
ABC[
ABC#> [1] "Annabelle" "Benjamin" "Cecilia" "David" "Eve"
```

Providing negative indices yields all elements of a vector expect for the ones at the specified positions:

```
-1]
ABC[#> [1] "Benjamin" "Cecilia" "David" "Eve"
c(-2, -4, -5)]
ABC[#> [1] "Annabelle" "Cecilia"
```

Even providing non-existent or missing (`NA`

) indices yields sensible results:

```
99] # accessing a non-existent position, vs.
ABC[#> [1] NA
NA] # accessing a missing (NA) position
ABC[#> [1] NA NA NA NA NA
```

Note that missing values are addictive in R:
Asking for the `NA`

-the element of a vector yields a vector of the same length with only `NA`

values (and names).

#### 2. Logical indexing/subsetting

*Logical indexing/subsetting* provides a logical (vector of) value(s) in square brackets `[]`

.
The provided vector of `TRUE`

or `FALSE`

values is typically of the same length as the indexed vector `v`

.

For instance, assuming a numeric vector `one_to_ten`

:

```
<- 1:10
one_to_ten
one_to_ten#> [1] 1 2 3 4 5 6 7 8 9 10
```

we could select its elements in the first and third position by:

```
c(TRUE, FALSE, TRUE, FALSE, FALSE,
one_to_ten[FALSE, FALSE, FALSE, FALSE, FALSE)]
#> [1] 1 3
```

The same can be achieved in two steps by defining a vector of logical indices and then using it as an index to our numeric vector `one_to_ten`

:

```
<- c(TRUE, FALSE, TRUE, FALSE, FALSE,
my_ix_v FALSE, FALSE, FALSE, FALSE, FALSE)
one_to_ten[my_ix_v]#> [1] 1 3
```

Explicitly defining a vector of logical values quickly becomes impractical, especially for longer vectors.
However, the same can be achieved implicitly by using a logical test of the vector `v`

as the logical index values of vector `v`

:

```
<- (one_to_ten > 5)
my_ix_v
one_to_ten[my_ix_v]#> [1] 6 7 8 9 10
```

Using a test on the *same* vector to generate the indices to a vector is a very powerful tool for getting subsets of a vector (which is why *indexing* is also referred to as *subsetting*).
Essentially, the R expression within the square brackets `[]`

asks a question about a vector and the logical indexing construct returns the elements for which this question is answered in the affirmative (i.e., the indexing vector yields `TRUE`

).
Here are some examples:

```
< 3 | one_to_ten > 8]
one_to_ten[one_to_ten #> [1] 1 2 9 10
%% 2 == 0]
one_to_ten[one_to_ten #> [1] 2 4 6 8 10
!is.na(one_to_ten)]
one_to_ten[#> [1] 1 2 3 4 5 6 7 8 9 10
!= "Eve"]
ABC[ABC #> [1] "Annabelle" "Benjamin" "Cecilia" "David"
nchar(ABC) == 5]
ABC[#> [1] "David"
substr(ABC, 3, 3) == "n"]
ABC[#> [1] "Annabelle" "Benjamin"
```

The `which()`

function provides a bridge from logical to numerical indexing, as `which(v)`

returns the numeric indices of those elements of `v`

for which an R expression is `TRUE`

:

```
which(one_to_ten > 8)
#> [1] 9 10
which(nchar(ABC) > 7)
#> [1] 1 2
```

Thus, the following expressions use both types of indexing to yield identical results:

```
which(one_to_ten > 8)] # numerical indexing
one_to_ten[#> [1] 9 10
> 8] # logical indexing
one_to_ten[one_to_ten #> [1] 9 10
which(nchar(ABC) > 7)] # numerical indexing
ABC[#> [1] "Annabelle" "Benjamin"
nchar(ABC) > 7] # logical indexing
ABC[#> [1] "Annabelle" "Benjamin"
```

Note that both numerical and logical indexing use square brackets `[]`

directly following the name of the object to be indexed.
By contrast, functions always provide their arguments in round parentheses `()`

.

#### Example

Suppose we know the following facts about five people:

p_1 | p_2 | p_3 | p_4 | p_5 | |
---|---|---|---|---|---|

name | Adam | Ben | Cecily | David | Evelyn |

gender | male | male | female | male | misc |

age | 21 | 19 | 20 | 48 | 45 |

How would we encode this information in R?

Note that we know the same three facts about each person and the leftmost column in Table 2.1 specifies this type of information (i.e., a *variable*).
A straightforward way of representing these facts in R would consist in defining a vector for each variable:

```
<- c("Adam", "Ben", "Cecily", "David", "Evelyn")
name <- c("male", "male", "female", "male", "misc")
gender <- c(21, 19, 20, 48, 45) age
```

In this solution, we encode the two vectors `name`

and `gender`

as character data, whereas the vector `age`

encodes numeric data.
Note that `gender`

is often encoded as numeric values (e.g., as 0 vs. 1) or as logical value (e.g., `female?`

: `TRUE`

vs. `FALSE`

), but this creates problems — or rather incomplete accounts — when there are more than two gender values to consider.^{8}

Equipped with these three vectors, we can now employ numeric and logical indexing to ask and answer a wide range of questions about these people. Note that the three vectors have the same length (as they describe the same set of people). If we assume that a particular position in a vector always refers to the same person, we can use one of the vectors to index the same or any other vector. This is a very common and immensely powerful idea to select vector elements (or here: properties of people) based on their values on other variables.

As an exercise, try predicting the results of the following expressions and describe what we are asking for in each case in your own words (including the type of indexing). Then evaluate each expression to check your prediction.

```
c(-1)]
name[!= "male"]
name[gender >= 21]
name[age
3:5]
gender[nchar(name) > 5]
gender[> 30]
gender[age
c(1, 3, 5)]
age[!= "Ben") & (name != "Cecily")]
age[(name == "female"] age[gender
```

Here are the results:

```
c(-1)] # get names of all non-first people
name[#> [1] "Ben" "Cecily" "David" "Evelyn"
!= "male"] # get names of non-male people
name[gender #> [1] "Cecily" "Evelyn"
>= 21] # get names of people with an age of 21 or older
name[age #> [1] "Adam" "David" "Evelyn"
3:5] # get 3rd to 5th gender values
gender[#> [1] "female" "male" "misc"
nchar(name) > 5] # get gender of people with a name of more than 5 letters
gender[#> [1] "female" "misc"
> 30] # get gender of people over 30
gender[age #> [1] "male" "misc"
c(1, 3, 5)] # get age values of certain positions
age[#> [1] 21 20 45
!= "Ben") & (name != "Cecily")] # get age of people whose name is not "Ben" and not "Cecily"
age[(name #> [1] 21 48 45
== "female"] # get age values of all people with "female" gender values
age[gender #> [1] 20
```

The first command in each triple used numerical indexing, whereas the other two commands in each triple used logical indexing.

Atomic vectors are the key data structure in R.
In Chapter 3 on Data structures, we will learn that atomic vectors can assume different shapes (e.g., as matrices) and can be combined into more complex data structures (e.g., lists and rectangular tables).
Here, we will now focus on *functions*, which are the type of object that allow us doing things with data (stored as vectors or other data structures).

### References

*Advanced R*(2nd ed.). Chapman; Hall/CRC. https://adv-r.hadley.nz/

Fun fact: The English writer Evelyn Waugh (1903–1966) and his wife Evelyn Gardner (1903–1994) had the same given name.↩︎