5 Data Structures
Data structures are akin to various containers that store data values. They define how objects are stored in R, and they can store multiple types of values.
We’ve met vectors already. Vectors are the most important type of object in R. There are several others that are more complicated than vectors.
In this chapter, we’ll walk through each type of data structure in terms of what they are and how they work.
- Factors. We can think about factors as vectors with categorical labels.
- Matrices and arrays. A matrix is an extension of a vector to two dimensions. An array is a multidimensional vector.
- Lists. Lists are a general form of vector in which the various elements need not be of the same type. Lists can contain other objects, such as vectors, lists and data frames.
- Data frames. Data frames are matrix-like structures, in which the columns can be of different types. We can think about data frames as “data matrices” with one row per observational unit.
5.1 Factor
A factor is a vector used to specify a discrete classification (grouping) of the components of other vectors of the same length. 6 We can use factors to represent a categorical variable and to label data items according to their group.
To create a factor, use the function factor()
.
flavor <- c("chocolate", "vanilla", "strawberry", "mint",
"coffee", "strawberry", "vanilla", "pistachio")
flavor_f <- factor(flavor)
flavor_f
## [1] chocolate vanilla strawberry mint coffee strawberry vanilla pistachio
## Levels: chocolate coffee mint pistachio strawberry vanilla
levels
A factor has an attribute called levels. Levels are the different values that a factor can take.
## $levels
## [1] "chocolate" "coffee" "mint" "pistachio" "strawberry" "vanilla"
##
## $class
## [1] "factor"
levels()
gets the levels of a factor.
## [1] "chocolate" "coffee" "mint" "pistachio" "strawberry" "vanilla"
nlevels()
returns the number of levels of a factor.
## [1] 6
We can manually set the order of the levels by using the argument levels
in the function factor()
. Use the argument ordered
to determine if the levels should be regarded as ordered in the order given. By default, the levels are stored in alphabetical order.
## [1] chocolate vanilla strawberry mint coffee strawberry vanilla pistachio
## Levels: chocolate coffee mint pistachio strawberry vanilla
## [1] <NA> vanilla strawberry mint coffee strawberry vanilla pistachio
## Levels: strawberry vanilla chocalate coffee mint pistachio
factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"),
ordered = TRUE)
## [1] <NA> vanilla strawberry mint coffee strawberry vanilla pistachio
## Levels: strawberry < vanilla < chocalate < coffee < mint < pistachio
A more meaningful example is when the order actually matters. For example, we conducted a survey and asked respondents how they felt about the statement “A.I. is going to change the world.” Respondents gave one of the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.
survey_results <- factor(
c("Disagree", "Neutral", "Strongly Disagree",
"Neutral", "Agree", "Strongly Agree",
"Disagree", "Strongly Agree", "Neutral",
"Strongly Disagree", "Neutral", "Agree"),
levels = c("Strongly Disagree", "Disagree",
"Neutral", "Agree", "Strongly Agree"),
ordered = TRUE)
survey_results
## [1] Disagree Neutral Strongly Disagree Neutral Agree Strongly Agree
## [7] Disagree Strongly Agree Neutral Strongly Disagree Neutral Agree
## Levels: Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree
5.2 Matrix
A matrix is an extension of a vector to two dimensions. To show what that means:
## NULL
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
In real life, use the function matrix()
to generate a new matrix, and specify the numbers of rows and columns.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix indexing
We can refer to part of a matrix using the indexing operator []
.
## [1] 4
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [1] 1 3 5
## [1] 1 2
cbind()
, rbind()
cbind()
and rbind()
combine matrices together by binding columns and rows.
m1 <- matrix(1:9, ncol = 3, nrow = 3)
m2 <- matrix(10:12, ncol =1, nrow = 3)
m3 <- matrix(10:12, ncol = 3, nrow = 1)
cbind()
combine matrices by columns.
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [,1]
## [1,] 10
## [2,] 11
## [3,] 12
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
rbind()
combine matrices by rows.
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [,1] [,2] [,3]
## [1,] 10 11 12
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 10 11 12
Note: A matrix stores data of a single type.
5.3 Array
A matrix is a special, two-dimensional array. An array is a multidimensional vector. Vectors and arrays are stored the same way internally.
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
A more natural way to create an array is to use the function array()
.
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
5.4 List
A list is a vector where each element can be of a different data type.
To generate a list, use list()
. We can name each component in a list.
book <- list(title = "Nineteen Eighty-Four: A Novel",
author = "George Orwell",
published_year = 1949,
pages = 328)
book
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
##
## $published_year
## [1] 1949
##
## $pages
## [1] 328
list indexing
Lists can be indexed by position or name.
By position.
## $published_year
## [1] 1949
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
##
## $pages
## [1] 328
## [1] 1949
## $author
## [1] "George Orwell"
##
## $published_year
## [1] 1949
By name using $
or [[""]]
. With $
, R accepts partial matching of element names.
## [1] "Nineteen Eighty-Four: A Novel"
## [1] "Nineteen Eighty-Four: A Novel"
## [1] "Nineteen Eighty-Four: A Novel"
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
Note: Using []
for list indexing results in another list. If we want to access the elements of the list, we should use the double brackets [[]]
as the indexing operator or use the dollar sign $
to access the named components.
A list can contain other lists.
The fact that a list can contain a list makes it a recursive object in R. Functions can also be recursive, which we’ll discuss later.
## [[1]]
## [1] "this list references another list"
##
## [[2]]
## [[2]]$title
## [1] "Nineteen Eighty-Four: A Novel"
##
## [[2]]$author
## [1] "George Orwell"
##
## [[2]]$published_year
## [1] 1949
##
## [[2]]$pages
## [1] 328
To access nested elements, we can stack up the square brackets.
## [1] 328
5.5 Data frame
A data frame is a list with class data.frame
.
Data frames are used to store spreadsheet-like data. It has rows and columns. Each column can store a different type of data of the same length. The columns must have names. The components of the data frame can be vectors, factors, numeric matrices, lists, or other data frames.
Data frames are particularly good for representing observational data.
To create a data frame, use data.frame()
.
laureate <- c("Bob Dylan", "Mo Yan", "Ernest Hemingway", "Winston Churchill", "Bertrand Russell")
year <- c(2016, 2012, 1954, 1953, 1950)
country <- c("United States", "China", "United States", "United Kingdom", "United Kingdom")
genre <- c("poetry, songwriting", "novel, short story", "novel, short story, screenplay", "history, essay, memoirs", "philosophy")
nobel_prize_literature <- data.frame(laureate, year, country, genre)
nobel_prize_literature
## laureate year country genre
## 1 Bob Dylan 2016 United States poetry, songwriting
## 2 Mo Yan 2012 China novel, short story
## 3 Ernest Hemingway 1954 United States novel, short story, screenplay
## 4 Winston Churchill 1953 United Kingdom history, essay, memoirs
## 5 Bertrand Russell 1950 United Kingdom philosophy
Note: A data frame is not a matrix; it is a list interpreted as a data frame.
## [1] "list"
## [1] "data.frame"
data frame indexing
We can refer to the components of a data frame by name using the list operators $
or [[]]
.
## [1] "Bob Dylan" "Mo Yan" "Ernest Hemingway" "Winston Churchill" "Bertrand Russell"
## [1] "Bob Dylan" "Mo Yan" "Ernest Hemingway" "Winston Churchill" "Bertrand Russell"
Or using matrix-like notations.
## laureate year country genre
## 1 Bob Dylan 2016 United States poetry, songwriting
Logical conditions are allowed, and actually frequently used.
## [1] "Winston Churchill" "Bertrand Russell"
## [1] FALSE FALSE FALSE TRUE TRUE
5.6 Summary
All elements of a vector must have the same type, or mode.
Lists are a general form of vector. Elements of a list need not be of the same type, or mode. Lists can contain other objects, such as vectors, lists and data frames. Lists provide a convenient way to return the results of a statistical computation.
The underlying storage mechanism for an array (including a matrix) is a vector.
## [1] "numeric"
## [1] "matrix" "array"
## [1] "numeric"
## [1] "array"
- Factors provide compact ways to handle categorical data.
W. N. Venables, D. M. Smith and the R Core Team. (2022). An Introduction to R.↩︎