Chapter 8 Data Structures
In the introduction section of these notes, we concentrated on data frames which we created and manipulated using dplyr
. There are other data structures that are used in R and it is useful to learn how to manipulate those other data structures. Furthermore, it is also useful to be able to use base R functionality to do certain manipulations on a data frame. This chapter will cover manipulation of different data structures within R, using both some tidyverse functionality and base R tools.
8.1 Vectors
R operates on vectors where we think of a vector as a collection of objects, usually numbers. The first thing we need to be able to do is define an arbitrary collection using the c()
function. The “c” stands for concatenation. We essentially are telling R to ‘bind’ together these numbers into a simple object, commonly known as a vector.
## [1] 1 2 3 4
There are many other ways to define vectors especially for structures that include integers or repeated sequences. The function rep(x, times)
is a quick base R function that repeats x
a the number times specified by times
. There are substructures of the rep()
function that you might like to explore using the ?rep()
command. This includes repeating each element of a number of times rather than the entire sequence, this sub-function is denoted by by
rather than times
. Below we use some quick R functionality to create a vector of twos, as well as a vector of strings containing ’A’s and ’B’s.
## [1] 2 2 2 2 2
## [1] "A" "B" "A" "B" "A" "B"
Finally, we can also define a sequence of numbers using the seq(from, to, by, length.out)
function which expects the user to supply 3 out of 4 possible arguments. The possible arguments are from
, to
, by
, and length.out
. From
is the starting point of the sequence, to
is the ending point, by
is the difference between any two successive elements, and length.out
is the total number of elements in the vector. Here are many ways we can create a sequence of integers. Notice that the last two examples actually create real numbers, showing off the flexibility of the seq
command. When working strictly with integers, there are many shortcuts, such as the 1:4
given below.
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
If we have two vectors and we wish to combine them, we can again use the c()
function.
## [1] 1 2 3 4 5 6
8.1.1 Accessing Vector Elements
Suppose I have defined a vector
and I am interested in accessing whatever is in the first spot of the vector. Or perhaps the 3rd or 5th element. To do that we use the []
notation, where the square bracket represents a subscript.
## [1] "A"
## [1] "D"
This sub-scripting notation can get more complicated. For example I might want the 2nd and 3rd element or the 3rd through 5th elements.
## [1] "B" "C"
## [1] "C" "D" "F"
Finally, I might be interested in getting the entire vector except for a certain element. To do this, R allows us to use the square bracket notation with a negative index number. A negative will tell R to remove that element and return all others.
## [1] "B" "C" "D" "F"
## [1] "C" "D" "F"
Now is a good time to address what is the [1]
doing in our output? Because vectors are often very long and might span multiple lines, R is trying to help us by telling us the index number of the left most value. If we have a very long vector, the second line of values will start with the index of the first value on the second line.
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
Here the [1]
is telling me that a
is the first element of the vector and the [20]
is telling me that t
is the 20th element of the vector.
8.1.2 Scalar Functions Applied to Vectors
It is very common to want to perform some operation on all the elements of a vector simultaneously. For example, I might want take the absolute value of every element. Functions that are inherently defined on single values will almost always apply the function to each element of the vector if given a vector.
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
## [1] 5 4 3 2 1 0 1 2 3 4 5
## [1] 6.737947e-03 1.831564e-02 4.978707e-02 1.353353e-01 3.678794e-01
## [6] 1.000000e+00 2.718282e+00 7.389056e+00 2.008554e+01 5.459815e+01
## [11] 1.484132e+02
8.1.3 Vector Algebra
All algebra done with vectors will be done element-wise by default. For matrix and vector multiplication as usually defined by mathematicians, use %*%
instead of *
. So two vectors added together result in their individual elements being summed.
## [1] 6 8 10 12
## [1] 5 12 21 32
But if we wished to take the inner-product, a common linear algebra calculation, we would need to use the %*%
structure. This will take the inner-product as returning a scalar, which in R will be a 1x1 matrix.
## [,1]
## [1,] 70
R does another trick when doing vector algebra. If the lengths of the two vectors don’t match, R will recycle the elements of the shorter vector to come up with vector the same length as the longer. This is potentially confusing and can cause problems if you are not aware of this behavior. Some exercises below will give you some insight with these issues. To understand how recycling work, lets consider taking a vector and adding 1
.
## [1] 2 3 4 5
We notice that R produces a vector with 1
added to each value of the vector. How it does this in the background is by recycling the shorter vector to be of the same size as the larger vector. That is, it created a hidden structure c(1,1,1,1)
, which was then added to the vector x
. But what if we try to add a shorter vector like c(1,2)
instead?
## [1] 2 4 4 6
Notice that R recycled the values within c(1,2)
, creating a hidden structure c(1,2,1,2)
, which was then added to the vector x
. If this is what the user intended, then everything is okay. However, if it was NOT the intention, then we must be careful when doing simple vector calculations to provide the exact details of what we are looking to do (i.e. create an equal length vector of exactly the elements we want added). Be aware, R provided no warning or messages above, it did the recycling implicitly expecting the user to know what is going on!
8.1.4 Commonly Used Vector Functions
Function | Result |
---|---|
min(x) |
Minimum value in vector x |
max(x) |
Maximum value in vector x |
length(x) |
Number of elements in vector x |
sum(x) |
Sum of all the elements in vector x |
mean(x) |
Mean of the elements in vector x |
median(x) |
Median of the elements in vector x |
var(x) |
Variance of the elements in vector x |
sd(x) |
Standard deviation of the elements in x |
pmax(x,y) |
Pairwise maximum of x and y |
pmin(x,y) |
Pairwise minimum of x and y |
Putting this all together, we can easily perform tedious calculations with ease. To demonstrate how scalars, vectors, and functions of them work together, we will calculate the variance of 5 numbers. Recall that sample variance (\(s^2\)) is defined as
\[ s^2_x=\frac{\sum\limits_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1} \]
Here are some common statistical calculations one might do in an STA 270 or STA 570 course:
## [1] 6
## [1] -4 -2 0 2 4
## [1] 16 4 0 4 16
## [1] 40
## [1] 5
## [1] 10
## [1] 10
8.2 Matrices
We often want to store numerical data in a square or rectangular format and mathematicians will call these “matrices”. These will have two dimensions, rows and columns. To create a matrix in R we can create it directly using the matrix()
command which requires the data to fill the matrix with, and optionally, some information about the number of rows and columns:
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Notice that because we only gave it six values, the information the number of columns is redundant and could be left off and R would figure out how many columns are needed. Next notice that the order that R chose to fill in the matrix was to fill in the first column then the second, and then the third. If we wanted to fill the matrix in order of the rows first, then we’d use the optional byrow=TRUE
argument.
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The alternative to the matrix()
command is we could create two columns as individual vectors and just push them together. Or we could have made three rows and lump them by rows instead. To do this we’ll use a group of functions that bind vectors together. To join two column vectors together, we’ll use cbind
and to bind rows together we’ll use the rbind
function
## a b
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## a 1 2 3
## b 4 5 6
Notice that doing this has provided R with some names for the individual rows and columns. I can change these using the commands colnames()
and rownames()
.
M <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE)
colnames(M) <- c('Column1', 'Column2') # set column labels
rownames(M) <- c('Row1', 'Row2','Row3') # set row labels
M
## Column1 Column2
## Row1 1 2
## Row2 3 4
## Row3 5 6
Accessing a particular element of a matrix is done in a similar manner as with vectors, using the [ ]
notation, but this time we must specify which row and which column. Notice that this scheme always is [row, col]
.
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 4
## [1] 1 4
I might want to grab a single row or a single column out of a matrix, which is sometimes referred to as taking a slice of the matrix. I could figure out how long that vector is, but often I’m too lazy. Instead I can just specify the specify the particular row or column I want.
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 1 4
## [1] 4 5 6
8.3 Data Frames
Matrices are great for mathematical operations, but I also want to be able to store data that is numerical. For example I might want to store a categorical variable such as manufacturer brand. To generalize our concept of a matrix to include these types of data, we will create a structure called a data.frame
. These are very much like a simple Excel spreadsheet where each column represents a different trait or measurement type and each row will represent an individual.
Perhaps the easiest way to create a data frame is to just type the columns of data
data <- tibble(
Name = c('Bob','Jeff','Mary'),
Score = c(90, 75, 92)
)
data <- data.frame(
Name = c('Bob','Jeff','Mary'),
Score = c(90, 75, 92)
)
# Show the data.frame
data
## Name Score
## 1 Bob 90
## 2 Jeff 75
## 3 Mary 92
Because a data frame feels like a matrix, R also allows matrix notation for accessing particular values.
Format | Result |
---|---|
[a,b] |
Element in row a and column b |
[a,] |
All of row a |
[,b] |
All of column b |
Because the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number. A large Excel spreadsheet can get annoying recalling which column something was in. “Was total biomass in column P or Q?” A system where the name is used to define the column Total.Biomass
is much nicer to work with and helps make fewer mistakes.
## [1] "Bob" "Jeff" "Mary"
## [1] "Jeff"
I can mix the [ ]
notation with the column names. The following is also acceptable:
## [1] "Bob" "Jeff" "Mary"
The next thing we might wish to do is add a new column to a preexisting data frame. There are two ways to do this. First, we could use the cbind()
function to bind two data frames together. Second we could reference a new column name and assign values to it.
Second.score <- data.frame(Score2=c(41,42,43)) # another data.frame
data <- cbind( data, Second.score ) # squish them together
data
## Name Score Score2
## 1 Bob 90 41
## 2 Jeff 75 42
## 3 Mary 92 43
# if you assign a value to a column that doesn't exist, R will create it
data$Score3 <- c(61,62,63) # the Score3 column will created
data
## Name Score Score2 Score3
## 1 Bob 90 41 61
## 2 Jeff 75 42 62
## 3 Mary 92 43 63
Data frames are very commonly used and many commonly used functions will take a data=
argument and all other arguments are assumed to be in the given data frame. Unfortunately this is not universally supported by all functions and you must look at the help file for the function you are interested in.
Data frames are also very restrictive in that the shape of the data must be rectangular. If I try to create a new column that doesn’t have enough rows, R will complain.
## Error in `$<-.data.frame`(`*tmp*`, Score4, value = c(1, 2)): replacement has 2 rows, data has 3
8.3.1 data.frames
vs tibbles
Previously we’ve been using data.frame
and tibble
objects interchangeably, but now is a good time make a distinction. Essentially a tibble
is a data.frame
that does more type checking and less coercion during creation and manipulation. So a tibble
does less (automatically) and complains more. The rational for this is that while coercion between data types can be helpful, it often disguises errors that take a long time to track down. On the whole, is better to force the user to do the coercion explicitly rather than hope that R magically does the right thing.
Second, the printing methods for tibbles
prevent it from showing too many rows or columns. This is a very convenient and more user-friendly way to show the data. We can control how many rows or columns are printed using the options()
command, which sets all of the global options.
Options | Result |
---|---|
options(tibble.print_max = n, tibble.print_min = m) |
if there are more than n rows, print only the first m . |
options(tibble.print_max = Inf) |
Always print all the rows. |
options(tibble.width = Inf) |
Always print all columns, regardless of the width of the display device. |
Third, tibbles
support column names that would be rejected by a data frame. For example, a data frame will not allow columns to begin with a number, nor can column names contain a space. These are allowable by tibbles
, although they are required to be enclosed by back-quotes when referring to them.
# the tribble() function just creates a tibble, but specifying the information
# in rows. This can be beneficial in creating small data sets by hand.
example <- tribble(
~'1984', ~"Is Awesome",
'George', 20,
'Orwell', 87)
example %>% select( `1984`, `Is Awesome` )
## # A tibble: 2 × 2
## `1984` `Is Awesome`
## <chr> <dbl>
## 1 George 20
## 2 Orwell 87
8.3.2 Access via [ ]
vs [[ ]]
To grab elements of a data frame, we have been using [ ]
, which returns the desired rows and columns as a data frame. If we wanted to force R to return the result as a vector, we could force another layer of de-referencing using the double square bracket notation. The way to think of this is that []
is the sub-setting function returns the same object structure as you send in (just smaller). The [[ ]]
is the extractor function and it will return the data structure of whatever you are aiming to extract.
## 'data.frame': 3 obs. of 1 variable:
## $ Name: chr "Bob" "Jeff" "Mary"
## chr [1:3] "Bob" "Jeff" "Mary"
The tidyverse command pull()
is also an efficient way to de-reference the structure. Pulling data will result in a vector as well, and can be used in a pipeline. This is a good differentiation from the previously discussed select()
command, which provides the columns as a data frame.
## 'data.frame': 3 obs. of 1 variable:
## $ Name: chr "Bob" "Jeff" "Mary"
## chr [1:3] "Bob" "Jeff" "Mary"
8.4 Lists
Data frames are useful for organizing and storing data, but sometimes we need to store diverse pieces of information that don’t fit neatly into a data frame. For these cases, we can use a more flexible data structure known as a list. A list can be thought of as a collection (or vector) of objects, where there is no requirement for each element to be of the same type.
For example, let’s say we want to store information about a family. This would include a spouse’s name, as well as a list of siblings’ names, since someone may have more than one sibling. Additionally, I might want to include information about pets and the age of each pet. The key point is that there’s no requirement for the number of pets to match the number of siblings or spouses—lists allow this kind of flexibility in organizing data.
spouse <- 'Micky'
sibs <- c('Tina','Caroline','Brandon','John')
pets <- c('Beau','Tess','Kaylee')
pet.ages <- c(8, 7, 4)
Family <- list(Spouse=spouse, Siblings=sibs, Pets=pets, Pet.Ages=pet.ages) # Create the list
str(Family) # show the structure of object
## List of 4
## $ Spouse : chr "Micky"
## $ Siblings: chr [1:4] "Tina" "Caroline" "Brandon" "John"
## $ Pets : chr [1:3] "Beau" "Tess" "Kaylee"
## $ Pet.Ages: num [1:3] 8 7 4
Notice that the object Family
is a list of four elements. The first is the single string containing the spouse’s name. This is followed by two more vectors of strings, each of different length. Finally attached is a numerical vector. Lists do not require that any level within match.
Accessing any element of a list uses an indexing scheme similar to matrices and vectors.
## $Pets
## [1] "Beau" "Tess" "Kaylee"
## [1] "Beau" "Tess" "Kaylee"
Generally you will want to access a list element as whatever format it was stored and I don’t want to keep the list. We can similarly access a list based on a numerical location.
## [1] "Micky"
## [1] "Beau" "Tess" "Kaylee"
There is a second way I can access elements. For data frames it was convenient to use the notation DataFrame$ColumnName
and we will use the same convention for lists. Actually a data frame is just a list with the requirement that each list element is a vector and all vectors are of the same length. To access the pets names we can use the following notation:
## [1] "Beau" "Tess" "Kaylee"
To add something new to the list object, we can just make an assignment in a similar fashion as we did for data.frame
and just assign a value to a slot that doesn’t (yet!) exist.
We can also add extremely complicated items to my list. Here we’ll add a data.frame
as another list element.
# Recall that we previous had defined a data.frame called "data"
Family$RandomDataFrame <- data # Assign it to be a list element
str(Family)
## List of 6
## $ Spouse : chr "Micky"
## $ Siblings : chr [1:4] "Tina" "Caroline" "Brandon" "John"
## $ Pets : chr [1:3] "Beau" "Tess" "Kaylee"
## $ Pet.Ages : num [1:3] 8 7 4
## $ Kids : chr [1:2] "Elise" "Casey"
## $ RandomDataFrame:'data.frame': 3 obs. of 4 variables:
## ..$ Name : chr [1:3] "Bob" "Jeff" "Mary"
## ..$ Score : num [1:3] 90 75 92
## ..$ Score2: num [1:3] 41 42 43
## ..$ Score3: num [1:3] 61 62 63
Now we see that the list Family
has six elements and some of those elements are pretty complicated. It is not uncommon to use lists of lists which have a leveled nesting structure and follow a similar syntax to just a single list.
Family$FamilyNest <- Family # Create a list within a list that contains all of Family
Family$Spouse # first level list
## [1] "Micky"
## [1] "Micky"
The place that most users will run into lists is that the output of many statistical procedures will return the results in a list object. When a user asks R to perform a regression, the output returned is a list object, and we’ll need to grab particular information from that object afterwards. For example, the output from a t-test in R is a list:
x <- c(5.1, 4.9, 5.6, 4.2, 4.8, 4.5, 5.3, 5.2) # some toy data
results <- t.test(x, alternative='less', mu=5) # do a t-test
str(results) # examine the resulting object
## List of 10
## $ statistic : Named num -0.314
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 7
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.381
## $ conf.int : num [1:2] -Inf 5.25
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num 4.95
## ..- attr(*, "names")= chr "mean of x"
## $ null.value : Named num 5
## ..- attr(*, "names")= chr "mean"
## $ stderr : num 0.159
## $ alternative: chr "less"
## $ method : chr "One Sample t-test"
## $ data.name : chr "x"
## - attr(*, "class")= chr "htest"
We see that result is actually a list with 10 elements in it. To access the p-value we could use:
## [1] 0.3813385
We have previously accessed information from models using the broom
package, but there are many simple native R commands if you need to quickly access information. It is important to be able to explore data objects using the str()
command shown throughout this chapter. R will often output a simplified object that can mask other important information one may be interested in, or obscures how to access the data. For example, asking R to print the object results
will hide the complex list structure from the output. Instead, there is an hidden print
function defined specifically for objects created by the t.test()
function that output a simple summarized version.
##
## One Sample t-test
##
## data: x
## t = -0.31399, df = 7, p-value = 0.3813
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
## -Inf 5.251691
## sample estimates:
## mean of x
## 4.95
8.5 Exercises
Exercise 1
Create a vector of three elements (2,4,6) and name that vector vec_a
. Create a second vector, vec_b
, that contains \((8,10,12)\). Add these two vectors together and name the result vec_c
.
Exercise 2
Create a vector, named vec_d
, that contains only two elements (14,20). Add this vector to vec_a
. What is the result and what do you think R did (look up the recycling rule using Google)? What is the warning message that R gives you?
Exercise 3
Next add \(5\) to the vector vec_a
. What is the result and what did R do? Why doesn’t in give you a warning message similar to what you saw in the previous problem?
Exercise 4
Generate the vector of integers \(\left\{ 1,2,\dots5\right\}\) in two different ways.
a) First using the seq()
function
b) Using the a:b
shortcut.
Exercise 5
Generate the vector of even numbers \(\left\{ 2,4,6,\dots,20\right\}\)
a) Using the seq()
function and
b) Using the a:b shortcut and some subsequent algebra. Hint: Generate a sequence of integers then multiple by 2.
Exercise 6
Generate a vector of \(21\) elements that are evenly placed between \(0\) and \(1\) using the seq()
command and name this vector x
.
Exercise 7
Generate the vector \(\left\{ 2,4,8,2,4,8,2,4,8\right\}\) using the rep()
command to replicate the vector c(2,4,8)
.
Exercise 8
Generate the vector \(\left\{ 2,2,2,2,4,4,4,4,8,8,8,8\right\}\) using the rep()
command. You might need to check the help file for rep()
. In particular, look at the optional argument each=
.
Exercise 9
The vector letters
is a built-in vector to R and contains the lower case English alphabet.
a) Extract the 9th element of the letters vector.
b) Extract the sub-vector that contains the 9th, 11th, and 19th elements.
c) Extract the sub-vector that contains everything except the last two elements.
Exercise 10
In this problem, we will work with the matrix
\[ \left[\begin{array}{ccccc} 2 & 4 & 6 & 8 & 10\\ 12 & 14 & 16 & 18 & 20\\ 22 & 24 & 26 & 28 & 30 \end{array}\right]\]
a) Create the matrix in two ways and save the resulting matrix as M
.
Create the matrix using some combination of the
seq()
andmatrix()
commands.Create the same matrix by some combination of multiple
seq()
commands and either therbind()
orcbind()
command.
b) Extract the second row out of M
.
c) Extract the element in the third row and second column of M
.
Exercise 11
Create and manipulate a data frame.
a) Create a data.frame
named my.trees
that has the following columns:
- Girth = {8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11.0}
- Height= {70, 65, 63, 72, 81, 83, 66}
- Volume= {10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6}
Complete the following without using dplyr
functions
b) Extract the third observation (i.e. the third row)
c) Extract the Girth column referring to it by name (don’t use a numerical value based on column position).
d) Print out a data frame of all the observations except for the fourth observation. (i.e. Remove the fourth observation/row.)
e) Use the which()
command to create a vector of row indices that have a girth
greater than 10. Call that vector index
.
f) Use the index
vector to create a small data set with just the large girth trees.
g) Use the index
vector to create a small data set with just the small girth trees.
Exercise 12
The following code creates a data.frame
and then has two different methods for removing the rows with NA
values in the column Grade
. Explain the difference between the two.
Exercise 13
Creation of data frames is usually done by binding together vectors while using seq
and rep
commands. However often we need to create a data frame that contains all possible combinations of several variables. The function expand.grid()
addresses this need.
A fun example of using this function is making several graphs of the standard normal distribution versus the t-distribution. Use the expand.grid
function to create a data.frame
with all combinations of x=seq(-4,4,by=.01)
, dist=c('Normal','t')
, and df=c(2,3,4,5,10,15,20,30)
. Use the dplyr::mutate
command with the if_else
command to generate the function heights y
using either dt(x,df)
or dnorm(x)
depending on what is in the distribution column.
Exercise 14
Create and manipulate a list.
a) Create a list named my.test with elements
- x = c(4,5,6,7,8,9,10)
- y = c(34,35,41,40,45,47,51)
- slope = 2.82
- p.value = 0.000131
b) Extract the second element in the list.
c) Extract the element named p.value
from the list.
Exercise 15
The function lm()
creates a linear model, which is a general class of model that includes both regression and ANOVA. We will call this on a data frame and examine the results. For this problem, there isn’t much to figure out, but rather the goal is to recognize the data structures being used in common analysis functions.
a) There are many data sets that are included with R and its packages. One of which is the trees
data which is a data set of \(n=31\) cherry trees. Load this data set into your current workspace using the command:
b) Examine the data frame using the str()
command. Look at the help file for the data using the command help(trees)
or ?trees
.
c) Perform a regression relating the volume of lumber produced to the girth and height of the tree using the following command
d) Use the str()
command to inspect m
. Extract the model coefficients from this list.