2.3 Vectors

We mentioned above that every data object can be described by its shape and its type. Whereas we addressed the issue of data types (and the related term of modes, see Section 2.2.2), we have not yet discussed the shape of data objects.

All objects defined so far all shared the same shape: The were vectors that only contained a single element. In R, vectors of length 1 are known as scalars.

Vectors are by far the most common and most important data structure in R. Essentially, a vector is an ordered sequence of elements with three common properties:

  1. its type of elements (tested by typeof());
  2. its length (tested by length());
  3. optional attributes or meta-data (tested by attributes()).

More specifically, there are two types of vectors:

  • in atomic vectors, all elements are of the same type
  • in lists, elements can have different types

The vast majority of vectors we will encounter are atomic vectors (i.e., all elements of the same type), but lists are often used in R for storing a variety of data types in a common object (e.g., in statistical analyses). It is important to understand that the term “atomic” in “atomic vectors” refers to the type of the vector, rather than its shape or length: Atomic vectors can contain one or more objects of any type (i.e., can have multiple lengths), but not multiple types.

How can we create new vectors? We already encountered a basic way of creating a vector above: Creating a new data object by assigning a value to an object name (using the <- operator). As any scalar object already is a vector (of length 1), we actually are asking: How can we combine objects or vectors into new vectors? The simplest way of creating a vector is by using the c() function (think chain, combine, or concatenate) on a number of objects:

# Create vectors:
v_lg <- c(TRUE, FALSE)   # logical vector
v_n1 <- c(1, pi, 4.5)    # numeric vector (double)
v_n2 <- c(2L, 3L, 5L)    # numeric vector (integer)
v_cr <- c("hi", "Hallo", "salut")  # character vector

The vectors defined by combining existing vectors with the c() function typically are longer vectors than their constituents.

Whenever encountering a new vector, a typical thing to do is testing for its type and its length:

# type:
typeof(v_n1)
#> [1] "double"
typeof(v_cr)
#> [1] "character"
# length: 
length(v_lg)
#> [1] 2
length(v_n2)
#> [1] 3

Beyond these elementary functions, the majority of functions in R can be applied to vectors. However, most functions require a particular data type to work properly. For instance, a common operation that changes an existing vector consists in sorting vectors, which is achieved by the sort() function. An argument decreasing is set to FALSE by default, but can be set to TRUE if sorting in decreasing order is desired:

x <- c(4, 6, 2)

sort(x)
#> [1] 2 4 6
sort(x, decreasing = TRUE)
#> [1] 6 4 2

What happens when we apply sort() to other data types?

y <- c(TRUE, FALSE, TRUE, FALSE)
sort(y)
#> [1] FALSE FALSE  TRUE  TRUE

z <- c("A", "N", "T")
sort(z, decreasing = TRUE)
#> [1] "T" "N" "A"

This shows that generic R functions like sort() often work with multiple data types. However, many functions simply require specific data types and would not work with others. For instance, as most mathematical functions require numeric objects to work, the following would create an error:

sum("A", "B", "C")  # would yield an error

However, remember that vectors of logical values can be interpreted as numbers (FALSE as 0 and TRUE as 1):

v_lg2 <- c(FALSE, TRUE, FALSE)
v_nm2 <- c(4, 5)

c(v_lg2, v_nm2)
#> [1] 0 1 0 4 5
mean(v_lg2)
#> [1] 0.3333333

As attributes are optional, most (atomic) vectors have no attributes:

v_n2
#> [1] 2 3 5
attributes(v_n2)
#> NULL

The most common attribute of a vector \(v\) are the names of its elements, which can be set or retrieved by names(v):

# Setting names:
names(v_n2) <- c("A", "B", "C")
names(v_cr) <- c("en", "de", "fr")

# Getting names:
names(v_n2)
#> [1] "A" "B" "C"

Other attributes can be defined as name-value pairs using attr(v, name) <- value) and inspected by attributes(), str() or structure():

# Adding attributes:
attr(v_cr, "my_dictionary") <- "Words to greet people"

# Viewing attributes:
attributes(v_n2)
#> $names
#> [1] "A" "B" "C"
attributes(v_cr)
#> $names
#> [1] "en" "de" "fr"
#> 
#> $my_dictionary
#> [1] "Words to greet people"

# Inspecting a vector's structure: 
str(v_cr)
#>  Named chr [1:3] "hi" "Hallo" "salut"
#>  - attr(*, "names")= chr [1:3] "en" "de" "fr"
#>  - attr(*, "my_dictionary")= chr "Words to greet people"
structure(v_cr)
#>      en      de      fr 
#>    "hi" "Hallo" "salut" 
#> attr(,"my_dictionary")
#> [1] "Words to greet people"

There exists an is.vector() function in R, but it does not only test if an object is a vector. Instead, it returns TRUE only if the object is a vector with no attributes other than names.

To test if an object v actually is a vector, we can use is.atomic(v) | is.list(v) (i.e., test if it is an atomic vector or a list) or use an auxiliary is_vector() function of various packages (e.g., purrr):

# (1) A vector with only names:
is.vector(v_n2)  
#> [1] TRUE

# (2) A vector with other attributes:
is.vector(v_cr)
#> [1] FALSE
is.atomic(v_cr)
#> [1] TRUE
purrr::is_vector(v_cr)
#> [1] TRUE

2.3.1 Creating vectors

We have already seen that using the assignment operator <- creates new data objects and that the c() function allows combining objects into vectors. We can think of c() as combining objects into vectors, but when the objects being combined are already stored as vectors, we are actually creating longer vectors out of shorter ones:

# Combining scalar objects and vectors (into longer vectors): 
v1 <- 1        # is the same as v1 <- c(1)
v2 <- c(2, 3)

v3 <- c(v1, v2, 4)     # but the result is only 1 vector, not 2 or 3: 
v3
#> [1] 1 2 3 4

Note that the new vector v4 is still a vector, rather than some higher-order structure containing other vectors (i.e., c() flattens hierarchical vector structures into vectors).

Coercion of data types

When combining different data types, they are coerced into a single data type. The result is either a numeric vector (when mixing truth values and numberic objects) or a character vector (when mixing anything with characters):

# Combining different data types:
x <- c(TRUE, 2L, 3.0)  # logical, integer, double
x
#> [1] 1 2 3
typeof(x)
#> [1] "double"

y <- c(TRUE, "two")  # logical, character
y
#> [1] "TRUE" "two"
typeof(y)
#> [1] "character"

z <- c(TRUE, 2, "three")  # logical, numeric, character
z
#> [1] "TRUE"  "2"     "three"
typeof(z)
#> [1] "character"

Vector creation functions

The c() function is used for combining existing vectors. However, for creating vectors that contain more than just a few elements (i.e., vectors with larger length() values), using the c() function and then typing all vector elements becomes impractical. Useful functions and shortcuts to generate continuous or regular sequences are the colon operator :, and the functions seq() and rep():

  • m:n generates a numeric sequence (in steps of \(1\) or \(-1\)) from m to n:
# Colon operator (with by = 1):
s1 <- 0:10
s1
#>  [1]  0  1  2  3  4  5  6  7  8  9 10
s2 <- 10:0
all.equal(s1, rev(s2))
#> [1] TRUE
  • seq() generates numeric sequences from an initial number from to a final number to and allows either setting the step-width by or the length of the sequence length.out:
# Sequences with seq():
s3 <- seq(0, 10, 1)  # is short for: 
s3
#>  [1]  0  1  2  3  4  5  6  7  8  9 10
s4 <- seq(from = 0, to = 10, by = 1)
all.equal(s3, s4)
#> [1] TRUE
all.equal(s1, s3)
#> [1] TRUE

# Note: seq() is more flexible:
s5 <- seq(0, 10, by = 2.5)        # set step size
s5
#> [1]  0.0  2.5  5.0  7.5 10.0
s6 <- seq(0, 10, length.out = 5)  # set output length
all.equal(s5, s6)
#> [1] TRUE
  • rep() replicates the values provided in its first argument x either times times or each element each times:
# Replicating vectors (with rep):
s7 <- rep(c(0, 1), 3)  # is short for:
s7
#> [1] 0 1 0 1 0 1
s8 <- rep(x = c(0, 1), times = 3)
all.equal(s7, s8)
#> [1] TRUE

# but differs from:
s9 <- rep(x = c(0, 1), each = 3)
s9
#> [1] 0 0 0 1 1 1

Whereas : and seq() create numeric vectors, rep() can be used with other data types:

rep(c(TRUE, FALSE), times = 2)
#> [1]  TRUE FALSE  TRUE FALSE
rep(c("A", "B"), each = 2)
#> [1] "A" "A" "B" "B"

Random sampling from a population

A frequent situation when working with R is that we want a sequence of elements (i.e., a vector) that are randomly drawn from a given population. The sample() function allows drawing a sample of size size from a population x. A logical argument replace specifies whether the sample is to be drawn with or without replacement. Not surprisingly, the population x is provided as a vector of elements and the result of sample() is another vector of length size:

# Sampling vector elements (with sample):
sample(x = 1:3, size = 10, replace = TRUE)
#>  [1] 1 2 1 2 3 1 2 2 2 2
# Note:
# sample(1:3, 10)  
# would yield an error (as replace = FALSE by default). 

# Note:
one_to_ten <- 1:10
sample(one_to_ten, size = 10, replace = FALSE)  # drawing without replacement
#>  [1]  3  5  8  1  6  7  2 10  4  9
sample(one_to_ten, size = 10, replace = TRUE)   # drawing with replacement
#>  [1]  2  4  3  7 10  2  5  3  1  3

As the x argument of sample() accepts non-numeric vectors, we can use the function to generate sequences of random events. For instance, we can use character vectors to sample sequences of letters or words (which can be used to represent random events):

# Random letter/word sequences:
sample(x = c("A", "B", "C"), size = 10, replace = TRUE)
#>  [1] "C" "A" "B" "C" "C" "C" "C" "B" "A" "B"
sample(x = c("he", "she", "is", "good", "lucky", "sad"), size = 5, replace = TRUE)
#> [1] "lucky" "she"   "good"  "is"    "good"

# Binary sample (coin flip): 
coin <- c("H", "T")    # 2 events: Heads or Tails
sample(coin, 5, TRUE)  # is short for: 
#> [1] "T" "H" "T" "T" "H"
sample(x = coin, size = 5, replace = TRUE)    # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"

# Flipping 10.000 coins:
coins_10000 <- sample(x = coin, size = 10000, replace = TRUE)  # flip coin 10.000 times
table(coins_10000)  # overview of 10.000 flips
#> coins_10000
#>    H    T 
#> 5049 4951

2.3.2 Accessing and changing vectors

Having found various ways of storing R objects in vectors, we need to ask:

  • How can we access, test for, or replace individual vector elements?

These tasks are summarily known as indexing or subsetting. As this is an extremely common and important tasks, there are many ways of accessing and changing vector elements. We will only cover the two most important ones here (but Chapter 4 Subsetting of Wickham (2019a) lists six different ways):

1. Numerical indexing/subsetting

Numerical indexing/subsetting provides a numeric (vector of) value(s) denoting the position(s) of the desired elements in a vector in square brackets []. Given a character vector ABC (of a length 5):

ABC <- c("Anna", "Ben", "Cecily", "David", "Eve")
ABC
#> [1] "Anna"   "Ben"    "Cecily" "David"  "Eve"

here are two ways of accessing particular elements of this vector:

ABC[3]
#> [1] "Cecily"
ABC[c(2, 4)]
#> [1] "Ben"   "David"

Rather than merely accessing these elements, we can also change these elements by assigning new values to them:

ABC[1] <- "Annabelle"
ABC[c(2, 3)] <- c("Benjamin", "Cecilia")
ABC
#> [1] "Annabelle" "Benjamin"  "Cecilia"   "David"     "Eve"

Providing negative indices yields all elements of a vector expect for the ones at the specified positions:

ABC[-1]
#> [1] "Benjamin" "Cecilia"  "David"    "Eve"
ABC[c(-2, -4, -5)]
#> [1] "Annabelle" "Cecilia"

Even providing non-existent or missing (NA) indices yields sensible results:

ABC[99]  # accessing a non-existent position, vs. 
#> [1] NA
ABC[NA]  # accessing a missing (NA) position
#> [1] NA NA NA NA NA

Note that missing values are addictive in R: Asking for the NA-the element of a vector yields a vector of the same length with only NA values (and names).

2. Logical indexing/subsetting

Logical indexing/subsetting provides a logical (vector of) value(s) in square brackets []. The provided vector of TRUE or FALSE values is typically of the same length as the indexed vector v.

For instance, assuming a numeric vector one_to_ten:

one_to_ten <- 1:10 
one_to_ten
#>  [1]  1  2  3  4  5  6  7  8  9 10

we could select its elements in the first and third position by:

one_to_ten[c(TRUE, FALSE, TRUE, FALSE, FALSE, 
             FALSE, FALSE, FALSE, FALSE, FALSE)]
#> [1] 1 3

The same can be achieved in two steps by defining a vector of logical indices and then using it as an index to our numeric vector one_to_ten:

my_ix_v <- c(TRUE, FALSE, TRUE, FALSE, FALSE, 
             FALSE, FALSE, FALSE, FALSE, FALSE)
one_to_ten[my_ix_v]
#> [1] 1 3

Explicitly defining a vector of logical values quickly becomes impractical, especially for longer vectors. However, the same can be achieved implicitly by using a logical test of the vector v as the logical index values of vector v:

my_ix_v <- (one_to_ten > 5)
one_to_ten[my_ix_v]
#> [1]  6  7  8  9 10

Using a test on the same vector to generate the indices to a vector is a very powerful tool for getting subsets of a vector (which is why indexing is also referred to as subsetting). Essentially, the R expression within the square brackets [] asks a question about a vector and the logical indexing construct returns the elements for which this question is answered in the affirmative (i.e., the indexing vector yields TRUE). Here are some examples:

one_to_ten[one_to_ten < 3 | one_to_ten > 8]
#> [1]  1  2  9 10
one_to_ten[one_to_ten %% 2 == 0]
#> [1]  2  4  6  8 10
one_to_ten[!is.na(one_to_ten)]
#>  [1]  1  2  3  4  5  6  7  8  9 10

ABC[ABC != "Eve"]
#> [1] "Annabelle" "Benjamin"  "Cecilia"   "David"
ABC[nchar(ABC) == 5]
#> [1] "David"
ABC[substr(ABC, 3, 3) == "n"]
#> [1] "Annabelle" "Benjamin"

The which() function provides a bridge from logical to numerical indexing, as which(v) returns the numeric indices of those elements of v for which an R expression is TRUE:

which(one_to_ten > 8)
#> [1]  9 10
which(nchar(ABC) > 7)
#> [1] 1 2

Thus, the following expressions use both types of indexing to yield identical results:

one_to_ten[which(one_to_ten > 8)]  # numerical indexing
#> [1]  9 10
one_to_ten[one_to_ten > 8]         # logical indexing
#> [1]  9 10

ABC[which(nchar(ABC) > 7)]  # numerical indexing 
#> [1] "Annabelle" "Benjamin"
ABC[nchar(ABC) > 7]         # logical indexing
#> [1] "Annabelle" "Benjamin"

Note that both numerical and logical indexing use square brackets [] directly following the name of the object to be indexed. By contrast, functions always provide their arguments in round parentheses ().

Example

Suppose we know the following facts about five people:

Table 2.1: Some facts about five people (as 3 vectors, each of which has 5 elements).
p_1 p_2 p_3 p_4 p_5
name Adam Ben Cecily David Evelyn
gender male male female male misc
age 21 19 20 48 45

How would we encode this information in R?

Note that we know the same three facts about each person and the leftmost column in Table 2.1 specifies this type of information (i.e., a variable). A straightforward way of representing these facts in R would consist in defining a vector for each variable:

name   <- c("Adam", "Ben", "Cecily", "David", "Evelyn")
gender <- c("male", "male", "female", "male", "misc")
age    <- c(21, 19, 20, 48, 45)

In this solution, we encode the two vectors name and gender as character data, whereas the vector age encodes numeric data. Note that gender is often encoded as numeric values (e.g., as 0 vs. 1) or as logical value (e.g., female?: TRUE vs. FALSE), but this creates problems — or rather incomplete accounts — when there are more than two gender values to consider.8

Equipped with these three vectors, we can now employ numeric and logical indexing to ask and answer a wide range of questions about these people. Note that the three vectors have the same length (as they describe the same set of people). If we assume that a particular position in a vector always refers to the same person, we can use one of the vectors to index the same or any other vector. This is a very common and immensely powerful idea to select vector elements (or here: properties of people) based on their values on other variables.

As an exercise, try predicting the results of the following expressions and describe what we are asking for in each case in your own words (including the type of indexing). Then evaluate each expression to check your prediction.

name[c(-1)]
name[gender != "male"]
name[age >= 21]

gender[3:5]
gender[nchar(name) > 5]
gender[age > 30]

age[c(1, 3, 5)]
age[(name != "Ben") & (name != "Cecily")]
age[gender == "female"]

Here are the results:

name[c(-1)]              # get names of all non-first people
#> [1] "Ben"    "Cecily" "David"  "Evelyn"
name[gender != "male"]   # get names of non-male people
#> [1] "Cecily" "Evelyn"
name[age >= 21]          # get names of people with an age of 21 or older
#> [1] "Adam"   "David"  "Evelyn"

gender[3:5]              # get 3rd to 5th gender values
#> [1] "female" "male"   "misc"
gender[nchar(name) > 5]  # get gender of people with a name of more than 5 letters
#> [1] "female" "misc"
gender[age > 30]         # get gender of people over 30 
#> [1] "male" "misc"

age[c(1, 3, 5)]          # get age values of certain positions
#> [1] 21 20 45
age[(name != "Ben") & (name != "Cecily")]  # get age of people whose name is not "Ben" and not "Cecily"
#> [1] 21 48 45
age[gender == "female"]  # get age values of all people with "female" gender values
#> [1] 20

The first command in each triple used numerical indexing, whereas the other two commands in each triple used logical indexing.

Atomic vectors are the key data structure in R. In Chapter 3 on Data structures, we will learn that atomic vectors can assume different shapes (e.g., as matrices) and can be combined into more complex data structures (e.g., lists and rectangular tables). Here, we will now focus on functions, which are the type of object that allow us doing things with data (stored as vectors or other data structures).

References

Wickham, H. (2019a). Advanced R (2nd ed.). Chapman; Hall/CRC. https://adv-r.hadley.nz/

  1. Fun fact: The English writer Evelyn Waugh (1903–1966) and his wife Evelyn Gardner (1903–1994) had the same given name.↩︎