Chapter 8 Data Structures

library(tidyverse)

In the introduction section of these notes, we concentrated on data frames which we created and manipulated using dplyr. There are other data structures that are used in R and it is useful to learn how to manipulate those other data structures. Furthermore, it is also useful to be able to use base R functionality to do certain manipulations on a data frame. This chapter will cover manipulation of different data structures within R, using both some tidyverse functionality and base R tools.

8.1 Vectors

R operates on vectors where we think of a vector as a collection of objects, usually numbers. The first thing we need to be able to do is define an arbitrary collection using the c() function. The “c” stands for concatenation. We essentially are telling R to ‘bind’ together these numbers into a simple object, commonly known as a vector.

# Define the vector of numbers 1, ..., 4
c(1,2,3,4)
## [1] 1 2 3 4

There are many other ways to define vectors especially for structures that include integers or repeated sequences. The function rep(x, times) is a quick base R function that repeats x a the number times specified by times. There are substructures of the rep() function that you might like to explore using the ?rep() command. This includes repeating each element of a number of times rather than the entire sequence, this sub-function is denoted by by rather than times. Below we use some quick R functionality to create a vector of twos, as well as a vector of strings containing ’A’s and ’B’s.

rep(2, 5)              # repeat 2 five times... 2 2 2 2 2
## [1] 2 2 2 2 2
rep( c('A','B'), 3 )   # repeat A B three times  A B A B A B
## [1] "A" "B" "A" "B" "A" "B"

Finally, we can also define a sequence of numbers using the seq(from, to, by, length.out) function which expects the user to supply 3 out of 4 possible arguments. The possible arguments are from, to, by, and length.out. From is the starting point of the sequence, to is the ending point, by is the difference between any two successive elements, and length.out is the total number of elements in the vector. Here are many ways we can create a sequence of integers. Notice that the last two examples actually create real numbers, showing off the flexibility of the seq command. When working strictly with integers, there are many shortcuts, such as the 1:4 given below.

seq(from=1, to=4, by=1)
## [1] 1 2 3 4
seq(1,4)        # 'by' has a default of 1   
## [1] 1 2 3 4
1:4             # a shortcut for seq(1,4)
## [1] 1 2 3 4
seq(1,5, by=.5)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
seq(1,5, length.out=11) 
##  [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0

If we have two vectors and we wish to combine them, we can again use the c() function.

vec1 <- c(1,2,3)
vec2 <- c(4,5,6)
vec3 <- c(vec1, vec2)
vec3
## [1] 1 2 3 4 5 6

8.1.1 Accessing Vector Elements

Suppose I have defined a vector

foo <- c('A', 'B', 'C', 'D', 'F')

and I am interested in accessing whatever is in the first spot of the vector. Or perhaps the 3rd or 5th element. To do that we use the [] notation, where the square bracket represents a subscript.

foo[1]  # First element in vector foo
## [1] "A"
foo[4]  # Fourth element in vector foo
## [1] "D"

This sub-scripting notation can get more complicated. For example I might want the 2nd and 3rd element or the 3rd through 5th elements.

foo[c(2,3)]  # elements 2 and 3
## [1] "B" "C"
foo[3:5]     # elements 3 to 5
## [1] "C" "D" "F"

Finally, I might be interested in getting the entire vector except for a certain element. To do this, R allows us to use the square bracket notation with a negative index number. A negative will tell R to remove that element and return all others.

foo[-1]          # everything but the first element
## [1] "B" "C" "D" "F"
foo[ -1*c(1,2) ] # everything but the first two elements
## [1] "C" "D" "F"

Now is a good time to address what is the [1] doing in our output? Because vectors are often very long and might span multiple lines, R is trying to help us by telling us the index number of the left most value. If we have a very long vector, the second line of values will start with the index of the first value on the second line.

# The letters vector is a vector of all 26 lower-case letters
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

Here the [1] is telling me that a is the first element of the vector and the [20] is telling me that t is the 20th element of the vector.

8.1.2 Scalar Functions Applied to Vectors

It is very common to want to perform some operation on all the elements of a vector simultaneously. For example, I might want take the absolute value of every element. Functions that are inherently defined on single values will almost always apply the function to each element of the vector if given a vector.

x <- -5:5
x
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
abs(x) # find the absolute value of each element
##  [1] 5 4 3 2 1 0 1 2 3 4 5
exp(x) # find e^{value} of each element of the vector
##  [1] 6.737947e-03 1.831564e-02 4.978707e-02 1.353353e-01 3.678794e-01
##  [6] 1.000000e+00 2.718282e+00 7.389056e+00 2.008554e+01 5.459815e+01
## [11] 1.484132e+02

8.1.3 Vector Algebra

All algebra done with vectors will be done element-wise by default. For matrix and vector multiplication as usually defined by mathematicians, use %*% instead of *. So two vectors added together result in their individual elements being summed.

x <- 1:4
y <- 5:8
x + y
## [1]  6  8 10 12
x * y
## [1]  5 12 21 32

But if we wished to take the inner-product, a common linear algebra calculation, we would need to use the %*% structure. This will take the inner-product as returning a scalar, which in R will be a 1x1 matrix.

x %*% y
##      [,1]
## [1,]   70

R does another trick when doing vector algebra. If the lengths of the two vectors don’t match, R will recycle the elements of the shorter vector to come up with vector the same length as the longer. This is potentially confusing and can cause problems if you are not aware of this behavior. Some exercises below will give you some insight with these issues. To understand how recycling work, lets consider taking a vector and adding 1.

x <- 1:4
x + 1
## [1] 2 3 4 5

We notice that R produces a vector with 1 added to each value of the vector. How it does this in the background is by recycling the shorter vector to be of the same size as the larger vector. That is, it created a hidden structure c(1,1,1,1), which was then added to the vector x. But what if we try to add a shorter vector like c(1,2) instead?

x + c(1,2)
## [1] 2 4 4 6

Notice that R recycled the values within c(1,2), creating a hidden structure c(1,2,1,2), which was then added to the vector x. If this is what the user intended, then everything is okay. However, if it was NOT the intention, then we must be careful when doing simple vector calculations to provide the exact details of what we are looking to do (i.e. create an equal length vector of exactly the elements we want added). Be aware, R provided no warning or messages above, it did the recycling implicitly expecting the user to know what is going on!

8.1.4 Commonly Used Vector Functions

Function Result
min(x) Minimum value in vector x
max(x) Maximum value in vector x
length(x) Number of elements in vector x
sum(x) Sum of all the elements in vector x
mean(x) Mean of the elements in vector x
median(x) Median of the elements in vector x
var(x) Variance of the elements in vector x
sd(x) Standard deviation of the elements in x
pmax(x,y) Pairwise maximum of x and y
pmin(x,y) Pairwise minimum of x and y

Putting this all together, we can easily perform tedious calculations with ease. To demonstrate how scalars, vectors, and functions of them work together, we will calculate the variance of 5 numbers. Recall that sample variance (\(s^2\)) is defined as

\[ s^2_x=\frac{\sum\limits_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1} \]

Here are some common statistical calculations one might do in an STA 270 or STA 570 course:

x <- c(2,4,6,8,10)
xbar <- mean(x)         # calculate the mean
xbar
## [1] 6
x - xbar                # calculate the deviations 
## [1] -4 -2  0  2  4
(x-xbar)^2              # calculated squared deviations
## [1] 16  4  0  4 16
sum( (x-xbar)^2 )       # find the sum of all squared deviations
## [1] 40
n <- length(x)          # how many data points do we have
n
## [1] 5
sum((x-xbar)^2)/(n-1)   # calculating the sample variance by hand using what we setup above
## [1] 10
var(x)                  # built-in variance function (does all the above)
## [1] 10

8.2 Matrices

We often want to store numerical data in a square or rectangular format and mathematicians will call these “matrices”. These will have two dimensions, rows and columns. To create a matrix in R we can create it directly using the matrix() command which requires the data to fill the matrix with, and optionally, some information about the number of rows and columns:

W <- matrix( c(1,2,3,4,5,6), nrow=2, ncol=3)
W
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Notice that because we only gave it six values, the information the number of columns is redundant and could be left off and R would figure out how many columns are needed. Next notice that the order that R chose to fill in the matrix was to fill in the first column then the second, and then the third. If we wanted to fill the matrix in order of the rows first, then we’d use the optional byrow=TRUE argument.

W <- matrix( c(1,2,3,4,5,6), nrow=2, byrow=TRUE )
W
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

The alternative to the matrix() command is we could create two columns as individual vectors and just push them together. Or we could have made three rows and lump them by rows instead. To do this we’ll use a group of functions that bind vectors together. To join two column vectors together, we’ll use cbind and to bind rows together we’ll use the rbind function

a <- c(1,2,3)
b <- c(4,5,6)
cbind(a,b)  # Column Bind:  a,b are columns in resultant matrix 
##      a b
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
rbind(a,b)  # Row Bind:     a,b are   rows  in resultant matrix    
##   [,1] [,2] [,3]
## a    1    2    3
## b    4    5    6

Notice that doing this has provided R with some names for the individual rows and columns. I can change these using the commands colnames() and rownames().

M <- matrix(1:6, nrow=3, ncol=2, byrow=TRUE) 
colnames(M) <- c('Column1', 'Column2')       # set column labels
rownames(M) <- c('Row1', 'Row2','Row3')      # set row labels
M
##      Column1 Column2
## Row1       1       2
## Row2       3       4
## Row3       5       6

Accessing a particular element of a matrix is done in a similar manner as with vectors, using the [ ] notation, but this time we must specify which row and which column. Notice that this scheme always is [row, col].

M1 <- matrix(1:6, nrow=3, ncol=2)
M1
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
M1[1,2]     # Grab row 1, column 2 value
## [1] 4
M1[1, 1:2]  # Grab row 1, and columns 1 and 2.
## [1] 1 4

I might want to grab a single row or a single column out of a matrix, which is sometimes referred to as taking a slice of the matrix. I could figure out how long that vector is, but often I’m too lazy. Instead I can just specify the specify the particular row or column I want.

M1
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
M1[1, ]    # grab the 1st row 
## [1] 1 4
M1[ ,2]    # grab second column (the spaces are optional...)
## [1] 4 5 6

8.3 Data Frames

Matrices are great for mathematical operations, but I also want to be able to store data that is numerical. For example I might want to store a categorical variable such as manufacturer brand. To generalize our concept of a matrix to include these types of data, we will create a structure called a data.frame. These are very much like a simple Excel spreadsheet where each column represents a different trait or measurement type and each row will represent an individual.

Perhaps the easiest way to create a data frame is to just type the columns of data

data <- tibble(
  Name  = c('Bob','Jeff','Mary'),   
  Score = c(90, 75, 92)
)
data <- data.frame(
  Name  = c('Bob','Jeff','Mary'),   
  Score = c(90, 75, 92)
)
# Show the data.frame 
data
##   Name Score
## 1  Bob    90
## 2 Jeff    75
## 3 Mary    92

Because a data frame feels like a matrix, R also allows matrix notation for accessing particular values.

Format Result
[a,b] Element in row a and column b
[a,] All of row a
[,b] All of column b

Because the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number. A large Excel spreadsheet can get annoying recalling which column something was in. “Was total biomass in column P or Q?” A system where the name is used to define the column Total.Biomass is much nicer to work with and helps make fewer mistakes.

data$Name       # The $-sign means to reference a column by its label
## [1] "Bob"  "Jeff" "Mary"
data$Name[2]    # Notice that data$Name results in a vector, which I can manipulate
## [1] "Jeff"

I can mix the [ ] notation with the column names. The following is also acceptable:

data[, 'Name']   # Grab the column labeled 'Name'
## [1] "Bob"  "Jeff" "Mary"

The next thing we might wish to do is add a new column to a preexisting data frame. There are two ways to do this. First, we could use the cbind() function to bind two data frames together. Second we could reference a new column name and assign values to it.

Second.score <- data.frame(Score2=c(41,42,43))  # another data.frame
data <-  cbind( data, Second.score )            # squish them together
data
##   Name Score Score2
## 1  Bob    90     41
## 2 Jeff    75     42
## 3 Mary    92     43
# if you assign a value to a column that doesn't exist, R will create it
data$Score3 <- c(61,62,63) # the Score3 column will created
data
##   Name Score Score2 Score3
## 1  Bob    90     41     61
## 2 Jeff    75     42     62
## 3 Mary    92     43     63

Data frames are very commonly used and many commonly used functions will take a data= argument and all other arguments are assumed to be in the given data frame. Unfortunately this is not universally supported by all functions and you must look at the help file for the function you are interested in.

Data frames are also very restrictive in that the shape of the data must be rectangular. If I try to create a new column that doesn’t have enough rows, R will complain.

data$Score4 <- c(1,2)
## Error in `$<-.data.frame`(`*tmp*`, Score4, value = c(1, 2)): replacement has 2 rows, data has 3

8.3.1 data.frames vs tibbles

Previously we’ve been using data.frame and tibble objects interchangeably, but now is a good time make a distinction. Essentially a tibble is a data.frame that does more type checking and less coercion during creation and manipulation. So a tibble does less (automatically) and complains more. The rational for this is that while coercion between data types can be helpful, it often disguises errors that take a long time to track down. On the whole, is better to force the user to do the coercion explicitly rather than hope that R magically does the right thing.

Second, the printing methods for tibbles prevent it from showing too many rows or columns. This is a very convenient and more user-friendly way to show the data. We can control how many rows or columns are printed using the options() command, which sets all of the global options.

Options Result
options(tibble.print_max = n, tibble.print_min = m) if there are more than n rows, print only the first m.
options(tibble.print_max = Inf) Always print all the rows.
options(tibble.width = Inf) Always print all columns, regardless of the width of the display device.

Third, tibbles support column names that would be rejected by a data frame. For example, a data frame will not allow columns to begin with a number, nor can column names contain a space. These are allowable by tibbles, although they are required to be enclosed by back-quotes when referring to them.

# the tribble() function just creates a tibble, but specifying the information
# in rows. This can be beneficial in creating small data sets by hand.
example <- tribble(
  ~'1984', ~"Is Awesome",
  'George',   20,
  'Orwell',   87)

example %>% select( `1984`, `Is Awesome` )
## # A tibble: 2 × 2
##   `1984` `Is Awesome`
##   <chr>         <dbl>
## 1 George           20
## 2 Orwell           87

8.3.2 Access via [ ] vs [[ ]]

To grab elements of a data frame, we have been using [ ], which returns the desired rows and columns as a data frame. If we wanted to force R to return the result as a vector, we could force another layer of de-referencing using the double square bracket notation. The way to think of this is that [] is the sub-setting function returns the same object structure as you send in (just smaller). The [[ ]] is the extractor function and it will return the data structure of whatever you are aiming to extract.

str(data['Name'])   # Returns a data frame with just the Name column.
## 'data.frame':    3 obs. of  1 variable:
##  $ Name: chr  "Bob" "Jeff" "Mary"
str(data[['Name']]) # Returns a vector that is the Name column.
##  chr [1:3] "Bob" "Jeff" "Mary"

The tidyverse command pull() is also an efficient way to de-reference the structure. Pulling data will result in a vector as well, and can be used in a pipeline. This is a good differentiation from the previously discussed select() command, which provides the columns as a data frame.

data %>% select('Name') %>% str()
## 'data.frame':    3 obs. of  1 variable:
##  $ Name: chr  "Bob" "Jeff" "Mary"
data %>% pull('Name') %>% str()
##  chr [1:3] "Bob" "Jeff" "Mary"

8.4 Lists

Data frames are useful for organizing and storing data, but sometimes we need to store diverse pieces of information that don’t fit neatly into a data frame. For these cases, we can use a more flexible data structure known as a list. A list can be thought of as a collection (or vector) of objects, where there is no requirement for each element to be of the same type.

For example, let’s say we want to store information about a family. This would include a spouse’s name, as well as a list of siblings’ names, since someone may have more than one sibling. Additionally, I might want to include information about pets and the age of each pet. The key point is that there’s no requirement for the number of pets to match the number of siblings or spouses—lists allow this kind of flexibility in organizing data.

spouse <- 'Micky'
sibs <- c('Tina','Caroline','Brandon','John')
pets <- c('Beau','Tess','Kaylee')
pet.ages <- c(8, 7, 4)
Family <- list(Spouse=spouse, Siblings=sibs, Pets=pets, Pet.Ages=pet.ages) # Create the list
str(Family) # show the structure of object
## List of 4
##  $ Spouse  : chr "Micky"
##  $ Siblings: chr [1:4] "Tina" "Caroline" "Brandon" "John"
##  $ Pets    : chr [1:3] "Beau" "Tess" "Kaylee"
##  $ Pet.Ages: num [1:3] 8 7 4

Notice that the object Family is a list of four elements. The first is the single string containing the spouse’s name. This is followed by two more vectors of strings, each of different length. Finally attached is a numerical vector. Lists do not require that any level within match.

Accessing any element of a list uses an indexing scheme similar to matrices and vectors.

Family[  'Pets'  ]   # Return list with just the Pets element
## $Pets
## [1] "Beau"   "Tess"   "Kaylee"
Family[[ 'Pets' ]]   # Return element as whatever structure it is
## [1] "Beau"   "Tess"   "Kaylee"

Generally you will want to access a list element as whatever format it was stored and I don’t want to keep the list. We can similarly access a list based on a numerical location.

Family[[ 1 ]]    # First element of the list is Spouse!
## [1] "Micky"
Family[[ 3 ]]    # Third element of the list is the vector of pets
## [1] "Beau"   "Tess"   "Kaylee"

There is a second way I can access elements. For data frames it was convenient to use the notation DataFrame$ColumnName and we will use the same convention for lists. Actually a data frame is just a list with the requirement that each list element is a vector and all vectors are of the same length. To access the pets names we can use the following notation:

Family$Pets         # Using the '$' notation
## [1] "Beau"   "Tess"   "Kaylee"

To add something new to the list object, we can just make an assignment in a similar fashion as we did for data.frame and just assign a value to a slot that doesn’t (yet!) exist.

Family$Kids <- c('Elise', 'Casey')

We can also add extremely complicated items to my list. Here we’ll add a data.frame as another list element.

# Recall that we previous had defined a data.frame called "data"
Family$RandomDataFrame <- data  # Assign it to be a list element
str(Family)
## List of 6
##  $ Spouse         : chr "Micky"
##  $ Siblings       : chr [1:4] "Tina" "Caroline" "Brandon" "John"
##  $ Pets           : chr [1:3] "Beau" "Tess" "Kaylee"
##  $ Pet.Ages       : num [1:3] 8 7 4
##  $ Kids           : chr [1:2] "Elise" "Casey"
##  $ RandomDataFrame:'data.frame': 3 obs. of  4 variables:
##   ..$ Name  : chr [1:3] "Bob" "Jeff" "Mary"
##   ..$ Score : num [1:3] 90 75 92
##   ..$ Score2: num [1:3] 41 42 43
##   ..$ Score3: num [1:3] 61 62 63

Now we see that the list Family has six elements and some of those elements are pretty complicated. It is not uncommon to use lists of lists which have a leveled nesting structure and follow a similar syntax to just a single list.

Family$FamilyNest <- Family # Create a list within a list that contains all of Family
Family$Spouse # first level list
## [1] "Micky"
Family$FamilyNest$Spouse # second level list
## [1] "Micky"

The place that most users will run into lists is that the output of many statistical procedures will return the results in a list object. When a user asks R to perform a regression, the output returned is a list object, and we’ll need to grab particular information from that object afterwards. For example, the output from a t-test in R is a list:

x <- c(5.1, 4.9, 5.6, 4.2, 4.8, 4.5, 5.3, 5.2)   # some toy data
results <- t.test(x, alternative='less', mu=5)   # do a t-test
str(results)                                     # examine the resulting object
## List of 10
##  $ statistic  : Named num -0.314
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 7
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.381
##  $ conf.int   : num [1:2] -Inf 5.25
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 4.95
##   ..- attr(*, "names")= chr "mean of x"
##  $ null.value : Named num 5
##   ..- attr(*, "names")= chr "mean"
##  $ stderr     : num 0.159
##  $ alternative: chr "less"
##  $ method     : chr "One Sample t-test"
##  $ data.name  : chr "x"
##  - attr(*, "class")= chr "htest"

We see that result is actually a list with 10 elements in it. To access the p-value we could use:

results$p.value
## [1] 0.3813385

We have previously accessed information from models using the broom package, but there are many simple native R commands if you need to quickly access information. It is important to be able to explore data objects using the str() command shown throughout this chapter. R will often output a simplified object that can mask other important information one may be interested in, or obscures how to access the data. For example, asking R to print the object results will hide the complex list structure from the output. Instead, there is an hidden print function defined specifically for objects created by the t.test() function that output a simple summarized version.

results
## 
##  One Sample t-test
## 
## data:  x
## t = -0.31399, df = 7, p-value = 0.3813
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
##      -Inf 5.251691
## sample estimates:
## mean of x 
##      4.95

8.5 Exercises

Exercise 1

Create a vector of three elements (2,4,6) and name that vector vec_a. Create a second vector, vec_b, that contains \((8,10,12)\). Add these two vectors together and name the result vec_c.

Exercise 2

Create a vector, named vec_d, that contains only two elements (14,20). Add this vector to vec_a. What is the result and what do you think R did (look up the recycling rule using Google)? What is the warning message that R gives you?

Exercise 3

Next add \(5\) to the vector vec_a. What is the result and what did R do? Why doesn’t in give you a warning message similar to what you saw in the previous problem?

Exercise 4

Generate the vector of integers \(\left\{ 1,2,\dots5\right\}\) in two different ways.

a) First using the seq() function

b) Using the a:b shortcut.

Exercise 5

Generate the vector of even numbers \(\left\{ 2,4,6,\dots,20\right\}\)

a) Using the seq() function and

b) Using the a:b shortcut and some subsequent algebra. Hint: Generate a sequence of integers then multiple by 2.

Exercise 6

Generate a vector of \(21\) elements that are evenly placed between \(0\) and \(1\) using the seq() command and name this vector x.

Exercise 7

Generate the vector \(\left\{ 2,4,8,2,4,8,2,4,8\right\}\) using the rep() command to replicate the vector c(2,4,8).

Exercise 8

Generate the vector \(\left\{ 2,2,2,2,4,4,4,4,8,8,8,8\right\}\) using the rep() command. You might need to check the help file for rep(). In particular, look at the optional argument each=.

Exercise 9

The vector letters is a built-in vector to R and contains the lower case English alphabet.

a) Extract the 9th element of the letters vector.

b) Extract the sub-vector that contains the 9th, 11th, and 19th elements.

c) Extract the sub-vector that contains everything except the last two elements.

Exercise 10

In this problem, we will work with the matrix

\[ \left[\begin{array}{ccccc} 2 & 4 & 6 & 8 & 10\\ 12 & 14 & 16 & 18 & 20\\ 22 & 24 & 26 & 28 & 30 \end{array}\right]\]

a) Create the matrix in two ways and save the resulting matrix as M.

  1. Create the matrix using some combination of the seq() and matrix() commands.

  2. Create the same matrix by some combination of multiple seq() commands and either the rbind() or cbind() command.

b) Extract the second row out of M.

c) Extract the element in the third row and second column of M.

Exercise 11

Create and manipulate a data frame.

a) Create a data.frame named my.trees that has the following columns:

  • Girth = {8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11.0}
  • Height= {70, 65, 63, 72, 81, 83, 66}
  • Volume= {10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6}

Complete the following without using dplyr functions

b) Extract the third observation (i.e. the third row)

c) Extract the Girth column referring to it by name (don’t use a numerical value based on column position).

d) Print out a data frame of all the observations except for the fourth observation. (i.e. Remove the fourth observation/row.)

e) Use the which() command to create a vector of row indices that have a girth greater than 10. Call that vector index.

f) Use the index vector to create a small data set with just the large girth trees.

g) Use the index vector to create a small data set with just the small girth trees.

Exercise 12

The following code creates a data.frame and then has two different methods for removing the rows with NA values in the column Grade. Explain the difference between the two.

df <- data.frame(name= c('Alice','Bob','Charlie','Daniel'),
                 Grade = c(6,8,NA,9))

df[ -which(  is.na(df$Grade) ), ]
df[  which( !is.na(df$Grade) ), ]

Exercise 13

Creation of data frames is usually done by binding together vectors while using seq and rep commands. However often we need to create a data frame that contains all possible combinations of several variables. The function expand.grid() addresses this need.

expand.grid( F1=c('A','B'), F2=c('x','w','z'), replicate=1:2 )

A fun example of using this function is making several graphs of the standard normal distribution versus the t-distribution. Use the expand.grid function to create a data.frame with all combinations of x=seq(-4,4,by=.01), dist=c('Normal','t'), and df=c(2,3,4,5,10,15,20,30). Use the dplyr::mutate command with the if_else command to generate the function heights y using either dt(x,df) or dnorm(x) depending on what is in the distribution column.

expand.grid( x=seq(-4,4,by=.01), 
             dist=c('Normal','t'), 
             df=c(2,3,4,5,10,15,20,30) ) %>%
  mutate( y = if_else(dist == 't', dt(x, df), dnorm(x) ) ) %>%
  ggplot( aes( x=x, y=y, color=dist) ) + 
  geom_line() + 
  facet_wrap(~df)

Exercise 14

Create and manipulate a list.

a) Create a list named my.test with elements

  • x = c(4,5,6,7,8,9,10)
  • y = c(34,35,41,40,45,47,51)
  • slope = 2.82
  • p.value = 0.000131

b) Extract the second element in the list.

c) Extract the element named p.value from the list.

Exercise 15

The function lm() creates a linear model, which is a general class of model that includes both regression and ANOVA. We will call this on a data frame and examine the results. For this problem, there isn’t much to figure out, but rather the goal is to recognize the data structures being used in common analysis functions.

a) There are many data sets that are included with R and its packages. One of which is the trees data which is a data set of \(n=31\) cherry trees. Load this data set into your current workspace using the command:

data(trees)     # load trees data.frame

b) Examine the data frame using the str() command. Look at the help file for the data using the command help(trees) or ?trees.

c) Perform a regression relating the volume of lumber produced to the girth and height of the tree using the following command

m <- lm( Volume ~ Girth + Height, data=trees)

d) Use the str() command to inspect m. Extract the model coefficients from this list.