5 Data Structures

Data structures are akin to various containers that store data values. They define how objects are stored in R and can store multiple types of values.

We’ve met vectors already. Vectors are the most important type of object in R. There are several others that are more complicated than vectors.

In this chapter, we’ll walk through each type of data structure to see what they are and how they work.

  • Factors. We can think about factors as vectors with categorical labels.
  • Matrices and arrays. A matrix is an extension of a vector to two dimensions. An array is a multidimensional vector.
  • Lists. Lists are a general form of vector in which the various elements need not be of the same type. Lists can contain other objects, such as vectors, lists and data frames.
  • Data frames. Data frames are matrix-like structures, in which the columns can be of different types. We can think about data frames as “data matrices” with one row per observational unit.

5.1 Factor

A factor is a vector used to specify a discrete classification (grouping) of the components of other vectors of the same length. 6 We can use factors to represent a categorical variable and to label data items according to their group.

To create a factor, use the function factor().

flavor <- c("chocolate", "vanilla", "strawberry", "mint", 
            "coffee", "strawberry", "vanilla", "pistachio")
flavor_f <- factor(flavor)
flavor_f
## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla

levels

A factor has an attribute called levels. Levels are the different values that a factor can take.

attributes(flavor_f)
## $levels
## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry"
## [6] "vanilla"   
## 
## $class
## [1] "factor"

Use levels() to get the levels of a factor.

levels(flavor_f)
## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry"
## [6] "vanilla"

nlevels() returns the number of levels of a factor.

nlevels(flavor_f)
## [1] 6

We can manually set the order of the levels by using levels argument in the function factor(). Use ordered argument to determine if the levels should be regarded as ordered in the order given. By default, the levels are stored in alphabetical order.

factor(flavor)
## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla
factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"))
## [1] <NA>       vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: strawberry vanilla chocalate coffee mint pistachio
factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"),
       ordered = TRUE)
## [1] <NA>       vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: strawberry < vanilla < chocalate < coffee < mint < pistachio

A more meaningful example is when the order actually matters. For example, we conducted a survey and asked respondents how they felt about the statement “A.I. is going to change the world.” Respondents gave one of the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

survey_results <- factor(
c("Disagree", "Neutral", "Strongly Disagree",
"Neutral", "Agree", "Strongly Agree",
"Disagree", "Strongly Agree", "Neutral",
"Strongly Disagree", "Neutral", "Agree"),
levels = c("Strongly Disagree", "Disagree",
"Neutral", "Agree", "Strongly Agree"),
ordered = TRUE)

survey_results
##  [1] Disagree          Neutral           Strongly Disagree Neutral          
##  [5] Agree             Strongly Agree    Disagree          Strongly Agree   
##  [9] Neutral           Strongly Disagree Neutral           Agree            
## 5 Levels: Strongly Disagree < Disagree < Neutral < ... < Strongly Agree

5.2 Matrix

A matrix is an extension of a vector to two dimensions. Just to show what that means:

a <- 1 : 6
dim(a) #initially NULL
## NULL
dim(a) <- c(2, 3)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

In real life, use the function matrix() to generate a new matrix, and specify the numbers of rows and columns.

a <- matrix(data = 1 : 6, nrow = 2, ncol = 3)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

matrix indexing

We can refer to part of a matrix using the indexing operator [].

a[2, 2] #second row and second column
## [1] 4
a[1 : 2, 1 : 2] #first two rows and first two columns
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a[1,] #first row
## [1] 1 3 5
a[,1] #first column
## [1] 1 2

cbind(), rbind()

cbind() and rbind() combine matrices together by binding columns and rows.

m1 <- matrix(1:9, ncol = 3, nrow = 3) 
m2 <- matrix(10:12, ncol =1, nrow = 3)
m3 <- matrix(10:12, ncol = 3, nrow = 1) 

cbind() combine matrices by columns.

m1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m2
##      [,1]
## [1,]   10
## [2,]   11
## [3,]   12
cbind(m1, m2)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

rbind() combine matrices by rows.

m1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m3
##      [,1] [,2] [,3]
## [1,]   10   11   12
rbind(m1, m3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]   10   11   12

Note: A matrix stores data of a single type.

5.3 Array

A matrix is a special, two-dimensional array. An array is a multidimensional vector. Vectors and arrays are stored the same way internally.

b <- 1 : 12
dim(b) <- c(2, 3, 2)
b
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

A more natural way to create an array is to use the function array().

b <- array(1:12, dim = c(2,3,2))
b
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

5.4 List

A list is a vector where each element can be of a different data type.

To generate a list, use list(). We can name each component in a list.

book <- list(title = "Nineteen Eighty-Four: A Novel", 
             author = "George Orwell", 
             published_year = 1949, 
             pages = 328)
book
## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949
## 
## $pages
## [1] 328

list indexing

Lists can be indexed by position or name.

By position.

book[3]
## $published_year
## [1] 1949
book[-3]
## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $pages
## [1] 328
book[[3]]
## [1] 1949
book[c(2, 3)]
## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949

By name using $ or [[""]]. With $, R accepts partial matching of element names.

book$title 
## [1] "Nineteen Eighty-Four: A Novel"
book$t
## [1] "Nineteen Eighty-Four: A Novel"
book[["title"]]
## [1] "Nineteen Eighty-Four: A Novel"
book[c("title", "author")]
## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"

Note: Using [] for list indexing results in another list. If we want to access the elements of the list, we should use the double brackets [[]] as the indexing operator or the dollar sign $ to access the named components.

A list can contain other lists.

The fact that a list can contain a list makes it a recursive object in R. Functions can also be recursive, which we’ll discuss later.

books <- list("this list references another list", book)
books
## [[1]]
## [1] "this list references another list"
## 
## [[2]]
## [[2]]$title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## [[2]]$author
## [1] "George Orwell"
## 
## [[2]]$published_year
## [1] 1949
## 
## [[2]]$pages
## [1] 328

To access nested elements, we can stack up the square brackets.

books[[2]][["pages"]]
## [1] 328

5.5 Data frame

A data frame is a list with class data.frame.

Data frames are used to store spreadsheet-like data. It has rows and columns. Each column can store a different type of data of the same length. The columns must have names. The components of the data frame can be vectors, factors, numeric matrices, lists, or other data frames.

Data frames are particularly good for representing observational data.

To create a data frame, use data.frame().

laureate <- c("Bob Dylan", "Mo Yan", "Ernest Hemingway", "Winston Churchill", "Bertrand Russell")
year <- c(2016, 2012, 1954, 1953, 1950)
country <- c("United States", "China", "United States", "United Kingdom", "United Kingdom")
genre <- c("poetry, songwriting", "novel, short story", "novel, short story, screenplay", "history, essay, memoirs", "philosophy")

nobel_prize_literature <- data.frame(laureate, year, country, genre)
nobel_prize_literature
##            laureate year        country                          genre
## 1         Bob Dylan 2016  United States            poetry, songwriting
## 2            Mo Yan 2012          China             novel, short story
## 3  Ernest Hemingway 1954  United States novel, short story, screenplay
## 4 Winston Churchill 1953 United Kingdom        history, essay, memoirs
## 5  Bertrand Russell 1950 United Kingdom                     philosophy

Note: A data frame is not a matrix; it is a list interpreted as a data frame.

mode(nobel_prize_literature)
## [1] "list"
class(nobel_prize_literature)
## [1] "data.frame"

data frame indexing

We can refer to the components of a data frame by name using the list operators $ or [[]].

nobel_prize_literature$laureate
## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway" 
## [4] "Winston Churchill" "Bertrand Russell"
nobel_prize_literature[["laureate"]]
## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway" 
## [4] "Winston Churchill" "Bertrand Russell"

Or using matrix-like notations.

nobel_prize_literature[1,]
##    laureate year       country               genre
## 1 Bob Dylan 2016 United States poetry, songwriting

Logical conditions are allowed, and actually frequently used.

nobel_prize_literature$laureate[nobel_prize_literature$country == "United Kingdom"]
## [1] "Winston Churchill" "Bertrand Russell"
nobel_prize_literature$country == "United Kingdom"
## [1] FALSE FALSE FALSE  TRUE  TRUE

5.6 Summary

  1. All elements of a vector must have the same type, or mode.

  2. Lists are a general form of vector. Elements of a list need not be of the same type, or mode. Lists can contain other objects, such as vectors, lists and data frames (recursive data structures). Lists provide a convenient way to return the results of a statistical computation.

  3. The underlying storage mechanism for an array (including a matrix) is a vector.

a <- matrix(data = 1:6, nrow = 2, ncol = 3)
mode(a)
## [1] "numeric"
class(a)
## [1] "matrix" "array"
b <- array(1:12, dim = c(2,3,2))
mode(b)
## [1] "numeric"
class(b)
## [1] "array"
  1. Factors provide compact ways to handle categorical data.

  1. W. N. Venables, D. M. Smith and the R Core Team. (2022). An Introduction to R.↩︎