Chapter 13 Data Reshaping

# library(tidyr)   # for the gather/spread commands
# library(dplyr)   # for the join stuff
library(tidyverse) # dplyr, tidyr, ggplot2, etc.

I have a YouTube Video Lecture for this chapter.

Most of the time, our data is in the form of a data frame and we are interested in exploring the relationships. However most procedures in R expect the data to show up in a ‘long’ format where each row is an observation and each column is a covariate. In practice, the data is often not stored like that and the data comes to us with repeated observations included on a single row. This is often done as a memory saving technique or because there is some structure in the data that makes the ‘wide’ format attractive. As a result, we need a way to convert data from ‘wide’ to ‘long’ and vice-versa.

Next we need a way to squish two data frames together. It is often advantageous to store data that would be be repeated separately in a different table so that a particular piece of information lives in only one location. This makes the data easier to modify, and more likely to maintain consistence. However, this practice requires that, when necessary, we can add information to a table, that might involve a lot of duplicated rows.

13.1 cbind & rbind

Base R has two functions for squishing two data frames together, but they assume that the data frames are aligned correctly. The c and r parts of cbind and rbind correspond to if we are pushing columns together or rows together.

# Define a tibble using a row-wise layout.
df1 <- tribble(            
  ~ID,  ~First,
  1,   'Alice',
  2,   'Bob',
  3,   'Charlie')

# Define a tibble by columns
df2 <- tibble( Last = c('Anderson', 'Barker', 'Cooper') )

# Squish the two tibbles together by columns - results in a data.frame
People <- cbind( df1, df2 )
People
##   ID   First     Last
## 1  1   Alice Anderson
## 2  2     Bob   Barker
## 3  3 Charlie   Cooper

Squishing together by rows is easy as well.

df3 <- tibble( ID=4, First='Daniel', Last='Davidson' )

People <- rbind( People, df3 )
People
##   ID   First     Last
## 1  1   Alice Anderson
## 2  2     Bob   Barker
## 3  3 Charlie   Cooper
## 4  4  Daniel Davidson

If both inputs have row or column names, then they don’t need to be appropriate arranged as rbind and cbind can figure out how to order them.

There are equivalent functions in the dplyr package that do the same job, but will work more consistently. For example, if we try to bind a row of data that has more columns, bind_rows() will introduce a column of NA values to the smaller data set. Furthermore, if the column orders are mixed up

df4 <- tibble(First='Elise', ID=5, Last='Erikson') # Notice changed columns order! 
rbind(People, df4)  # but rbind() can still figure it out!
##   ID   First     Last
## 1  1   Alice Anderson
## 2  2     Bob   Barker
## 3  3 Charlie   Cooper
## 4  4  Daniel Davidson
## 5  5   Elise  Erikson
df5 <- tibble(First='Frank', ID=6, Last='Fredrick', dob=lubridate::ymd('1980-7-21'))
rbind(People, df5)   # throws an error
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
bind_rows( People, df4, df5) # Inserts NA values as appropriate.
##   ID   First     Last        dob
## 1  1   Alice Anderson       <NA>
## 2  2     Bob   Barker       <NA>
## 3  3 Charlie   Cooper       <NA>
## 4  4  Daniel Davidson       <NA>
## 5  5   Elise  Erikson       <NA>
## 6  6   Frank Fredrick 1980-07-21

In general, I find that rbind() and bind_rows() work really well and I use them quite often. However, cbind() and bind_cols() are less useful because I have to make sure that either I have rownames set up for each data set, or I have to be very careful with the ordering. Instead, it is safer to write code that relies on joins, which will be discussed later in this chapter.

13.2 Table Pivots

There is a common issue with obtaining data with many columns that you wish were organized as rows. For example, I might have data in a grade book that has several homework scores and I’d like to produce a nice graph that has assignment number on the x-axis and score on the y-axis. Unfortunately this is incredibly hard to do when the data is arranged in the following way:

grade.book <- rbind(
  data.frame(name='Alison.Anderson',  HW.1=8, HW.2=5, HW.3=8, HW.4=4),
  data.frame(name='Brandon.Babbage', HW.1=5, HW.2=3, HW.3=6, HW.4=9),
  data.frame(name='Charles.Collins', HW.1=9, HW.2=7, HW.3=9, HW.4=10))
grade.book
##              name HW.1 HW.2 HW.3 HW.4
## 1 Alison.Anderson    8    5    8    4
## 2 Brandon.Babbage    5    3    6    9
## 3 Charles.Collins    9    7    9   10

What we want to do is turn this data frame from a wide data frame into a long data frame. In MS Excel this is called pivoting. Essentially I’d like to create a data frame with three columns: name, assignment, and score. That is to say that each homework datum really has three pieces of information: who it came from, which homework it was, and what the score was. It doesn’t conceptually matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way to identify how a student scored on a particular homework. So we want to reshape the HW1 to HW4 columns into two columns (assignment and score).

This package was built by the same people that created dplyr and ggplot2 and there is a nice introduction at: [http://blog.rstudio.org/2014/07/22/introducing-tidyr/]

There have been several iterations of functions to convert data from a wide to long and back again. When Hadley first began thinking through these issues in the reshape package, he called these two functions melt and cast. The second iteration in these functions in the tidyr package were called gather and spread. I believe we are finally in the last which are simply named pivot_wider() and pivot_longer().

There are two main verbs to remember:

Function Description
pivot_longer() Gather multiple columns that are related into two columns that contain the original column name and the value. For example for columns HW1, HW2, HW3 we would gather them into two columns: Homework and Score. In this case, we refer to Homework as the names column and Score as the values column.
pivot_wider() This is the opposite function. This takes a names and values pair of columns and forms a new column for each level of the names column(s).
# first we gather the score columns into columns we'll name Homework and Score
tidy.scores <- grade.book %>% 
  pivot_longer( 
    HW.1:HW.4,                # which columns to apply this to
    names_to  = 'Homework',   # What should I call the column of old column names
    values_to = 'Score')      # What should I call the values column
tidy.scores
## # A tibble: 12 x 3
##    name            Homework Score
##    <chr>           <chr>    <dbl>
##  1 Alison.Anderson HW.1         8
##  2 Alison.Anderson HW.2         5
##  3 Alison.Anderson HW.3         8
##  4 Alison.Anderson HW.4         4
##  5 Brandon.Babbage HW.1         5
##  6 Brandon.Babbage HW.2         3
##  7 Brandon.Babbage HW.3         6
##  8 Brandon.Babbage HW.4         9
##  9 Charles.Collins HW.1         9
## 10 Charles.Collins HW.2         7
## 11 Charles.Collins HW.3         9
## 12 Charles.Collins HW.4        10

To spread the names values pairs out into a wide data frame, we use the pivot_wider() command.

# Turn the Homework/Score pair of columns into one column per factor level of Homework
tidy.scores %>% pivot_wider( names_from=Homework, values_from=Score )
## # A tibble: 3 x 5
##   name             HW.1  HW.2  HW.3  HW.4
##   <chr>           <dbl> <dbl> <dbl> <dbl>
## 1 Alison.Anderson     8     5     8     4
## 2 Brandon.Babbage     5     3     6     9
## 3 Charles.Collins     9     7     9    10

Often times, the long format of the data is most helpful for graphing or doing data analysis. Because of this, we often refer to this as the tidy form of the data. Hadley has a nice article about messy data vs tidy data and his article is well worth your time to read, although dplyr and tidyr have matured since he wrote this article. The main point can be summarized by:

  1. Each row in the data frame represents a single object.
  2. Each column represents an attribute that can be measured on every object.
  3. Every object type implicitly forms a table.

In our grade book example, we could consider each student as on object and store the data with 1 row per student and columns for each attribute (homework score). Alternatively, the long form is a Scores data set stores the data with a single row representing a single score. If I have multiple sections of the course and some homeworks are not assigned to one course but are assigned to another, then the long form Scores would be the appropriate storage.

Most of our graphing and data analysis procedures will expect the data to be in a long format, but for both data input and digital storage compression, we might have data given to us in a wide format.

13.3 Spreading a Single Column with separate

It is common for data columns to contain multiple attributes. For example, a name column might contain both first and last names or a file name might contain multiple bits of information broken up by underscores. In the string manipulation chapter, we saw how to use stringr::str_split() to split strings. Now we’ll apply this using the tidyr::separate() function.

grade.book %>% separate(name, sep='\\.', c('F.Name','L.Name'), remove = FALSE)
##              name  F.Name   L.Name HW.1 HW.2 HW.3 HW.4
## 1 Alison.Anderson  Alison Anderson    8    5    8    4
## 2 Brandon.Babbage Brandon  Babbage    5    3    6    9
## 3 Charles.Collins Charles  Collins    9    7    9   10

13.4 Storing Data in Multiple Tables

In many data sets it is common to store data across multiple tables, usually with the goal of minimizing memory used as well as providing minimal duplication of information so any change that must be made is only made in a single place. Recall the guiding idea that each type of object stored should result in its own table.

To see the rational why we might do this, consider building a data set of blood donations by a variety of donors across several years. For each blood donation, we will perform some assay and measure certain qualities about the blood and the patients health at the donation. But should we contain the donor’s address, phone number, and email address in the same data table that holds the information about the blood donated.

I would like to include additional information about the donor where that information doesn’t change overtime. For example we might want to have information about the donor’s birthdate, sex, blood type. However, I don’t want that information in every single donation line. Otherwise if I mistype a birthday and have to correct it, I would have to correct it everywhere. For information about the donor, should live in a donors table, while information about a particular donation should live in the donations table.

Furthermore, there are many Jeffs and Dereks in the world and to maintain a unique identifier (without using Social Security numbers) I will just create a Donor_ID code that will uniquely identify a person. Similarly I will create a Blood_ID that will uniquely identify a blood donation.

Donors
## # A tibble: 3 x 7
##   Donor_ID Name   Blood_Type Birthday   Street       City      State
##   <chr>    <chr>  <chr>      <chr>      <chr>        <chr>     <chr>
## 1 D1       Derek  O+         1976-09-17 7392 Willard Flagstaff AZ   
## 2 D2       Jeff   A-         1974-06-23 873 Vine     Bozeman   MT   
## 3 D3       Aubrey O+         1976-09-17 7392 Willard Flagstaff AZ
Blood_Donations 
## # A tibble: 5 x 6
##   Blood_ID Donor_ID Date       Hemoglobin Systolic Diastolic
##   <chr>    <chr>    <chr>           <dbl>    <dbl>     <dbl>
## 1 B_1      D1       2017-04-14       17.4      120        79
## 2 B_2      D1       2017-06-20       16.5      121        80
## 3 B_3      D2       2017-08-14       16.9      145       101
## 4 B_4      D1       2017-08-26       17.6      120        79
## 5 B_5      D3       2017-08-26       16.1      137        90

If we have a new donor walk in and give blood, then we’ll have to create a new entry in the donors table as well as a new entry in the donations table. If an experienced donor gives again, we just have to create a new entry in the donations table.

Given this data structure, we can now easily create new donations as well as store donor information. In the event that we need to change something about a donor, there is only one place to make that change.

However, having data spread across multiple tables is challenging because I often want that information squished back together. For example, if during routine testing we discover that a blood sample is HIV positive. Then we need to be able to join the blood donations table to the donors table in some sensible manner.

13.5 Table Joins

There are four different types of joins: outer, left, right, and inner joins. Consider the following example tables. Here we consider that there is some key column that is common to both tables.

Join Type Result
inner_join(A,B) Include rows if the key value is in both tables.
left_join(A,B) Include all rows of A, and if the match in B doesn’t exist, just insert a NA.
right_join(A,B) Include all rows of B, and if the match in A doesn’t exist, just insert a NA.
full_join(A,B) Include all rows of A and B and if the necessary match doesn’t exist, insert NA values.

For a practical example

A <- tribble(
  ~ID, ~x,
  'a', 34,    # Notice that A doesn't have ID = d
  'b', 36,
  'c', 38)
B <- tribble(
  ~ID, ~y,
  'b', 56,   # Notice that B doesn't have ID = a
  'c', 57,
  'd', 59)
# only include rows with IDs that are in both tables
inner_join(A,B)
## # A tibble: 2 x 3
##   ID        x     y
##   <chr> <dbl> <dbl>
## 1 b        36    56
## 2 c        38    57
# All the rows in table A, insert NA if the B info is missing
left_join(A,B)
## # A tibble: 3 x 3
##   ID        x     y
##   <chr> <dbl> <dbl>
## 1 a        34    NA
## 2 b        36    56
## 3 c        38    57
# All the rows in table B, insert NA if the A info is missing
right_join(A,B)
## # A tibble: 3 x 3
##   ID        x     y
##   <chr> <dbl> <dbl>
## 1 b        36    56
## 2 c        38    57
## 3 d        NA    59
# All the rows possible, insert NA if the matching info is missing
full_join(A,B)
## # A tibble: 4 x 3
##   ID        x     y
##   <chr> <dbl> <dbl>
## 1 a        34    NA
## 2 b        36    56
## 3 c        38    57
## 4 d        NA    59

Consider the case where we have a data frame of observations of fish and a separate data frame that contains information about lake (perhaps surface area, max depth, pH, etc). I want to store them as two separate tables so that when I have to record a lake level observation, I only input it one place. This decreases the chance that I make a copy/paste error.

To illustrate the different types of table joins, we’ll consider two different tables.

# tibbles are just data.frames that print a bit nicer and don't automatically
# convert character columns into factors.  They behave a bit more consistently
# in a wide variety of situations compared to data.frames.
Fish.Data <- tibble(
  Lake_ID = c('A','A','B','B','C','C'), 
  Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
Lake.Data <- tibble(
  Lake_ID = c(    'B','C','D'),   
  Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),   
  pH=c(6.5, 6.3, 6.1),
  area = c(40, 210, 240),
  avg_depth = c(8, 10, 38))
Fish.Data
## # A tibble: 6 x 2
##   Lake_ID Fish.Weight
##   <chr>         <dbl>
## 1 A              267.
## 2 A              270.
## 3 B              212.
## 4 B              280.
## 5 C              308.
## 6 C              282.
Lake.Data
## # A tibble: 3 x 5
##   Lake_ID Lake_Name      pH  area avg_depth
##   <chr>   <chr>       <dbl> <dbl>     <dbl>
## 1 B       Lake Elaine   6.5    40         8
## 2 C       Mormon Lake   6.3   210        10
## 3 D       Lake Mary     6.1   240        38

Notice that each of these tables has a column labeled Lake_ID. When we join these two tables, the row that describes lake A should be duplicated for each row in the Fish.Data that corresponds with fish caught from lake A.

full_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 7 x 6
##   Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
##   <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
## 1 A              267. <NA>         NA      NA        NA
## 2 A              270. <NA>         NA      NA        NA
## 3 B              212. Lake Elaine   6.5    40         8
## 4 B              280. Lake Elaine   6.5    40         8
## 5 C              308. Mormon Lake   6.3   210        10
## 6 C              282. Mormon Lake   6.3   210        10
## 7 D               NA  Lake Mary     6.1   240        38

Notice that because we didn’t have any fish caught in lake D and we don’t have any Lake information about lake A, when we join these two tables, we end up introducing missing observations into the resulting data frame.

left_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 6 x 6
##   Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
##   <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
## 1 A              267. <NA>         NA      NA        NA
## 2 A              270. <NA>         NA      NA        NA
## 3 B              212. Lake Elaine   6.5    40         8
## 4 B              280. Lake Elaine   6.5    40         8
## 5 C              308. Mormon Lake   6.3   210        10
## 6 C              282. Mormon Lake   6.3   210        10
inner_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 4 x 6
##   Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
##   <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
## 1 B              212. Lake Elaine   6.5    40         8
## 2 B              280. Lake Elaine   6.5    40         8
## 3 C              308. Mormon Lake   6.3   210        10
## 4 C              282. Mormon Lake   6.3   210        10

The above examples assumed that the column used to join the two tables was named the same in both tables. This is good practice to try to do, but sometimes you have to work with data where that isn’t the case. In that situation you can use the by=c("ColName.A"="ColName.B") syntax where ColName.A represents the name of the column in the first data frame and ColName.B is the equivalent column in the second data frame.

Filtering joins are a useful trick when programming to help discover which rows would be included. In particular, I often want to know which rows would be dropped or don’t have a corresponding match.

Join Type Result
semi_join(A,B) All the rows in A that have a match in B.
anti_join(A,B) All the rows in A that dont have a match in B.

13.6 Row-wise Calculations

Finally, the combination of pivot_longer and pivot_wider allows me to do some very complex calculations across many columns of a data set. For example, I might gather up a set of columns, calculate some summary statistics, and then join the result back to original data set.

grade.book %>%
  group_by(name) %>%
  pivot_longer(HW.1:HW.4, names_to='Homework', values_to='Score' ) %>%
  summarise( HW.avg = mean(Score) ) %>%
  left_join( grade.book, . )
## Joining, by = "name"
##              name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson    8    5    8    4   6.25
## 2 Brandon.Babbage    5    3    6    9   5.75
## 3 Charles.Collins    9    7    9   10   8.75

While this works, it is not very convenient. Instead, we’d like to have some mechanism for doing rowwise calculations. This is actually not as easy as you might expect because data frames are stored in memory to make column-wise operations fast.

13.6.1 Using base apply() function

One solution is to use the base R function apply(). The apply() function applies a function to each row or column (MARGIN=1 or MARGIN=2 respectively).

A <- matrix(1:6, ncol=3)  # Make a matrix of 2 rows, 3 columns
A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
apply(A, MARGIN = 1, FUN=sum) # Apply sum() function to rows (MARGIN=1)
## [1]  9 12
apply(A, MARGIN = 2, FUN=sum) # Apply sum() function to columns (MARGIN=1)
## [1]  3  7 11

So to add a row calculation, we just need to put together select and apply statements.

#     new.col.name             columns              function
grade.book %>%
  mutate(., HW.avg = select(., HW.1:HW.4) %>% apply(1, mean) ) %>%
  print() # this print is just to show you can keep the pipeline going...
##              name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson    8    5    8    4   6.25
## 2 Brandon.Babbage    5    3    6    9   5.75
## 3 Charles.Collins    9    7    9   10   8.75
grade.book %>%
  select(., HW.1:HW.4) %>%
  apply(1, mean)
## [1] 6.25 5.75 8.75

But this pipeline inside a mutate command is a little cumbersome. The command dplyr::rowwise() causes subsequent actions to be performed rowwise instead of the default of columnwise. rowwise() is actually a special form of group_by() which creates a unique group for each row. The function dplyr::c_across() allows you to use all the select style tricks for picking columns.

# grade.book %>%
#   rowwise() %>%
#   mutate( HW.avg = mean( c(HW.1, HW.2, HW.3, HW.4) ) )  # List the columns to average
# 
# grade.book %>%
#   rowwise() %>%
#   mutate( HW.avg = mean( c_across(HW.1:HW.4) ) )     # Give a column range to average

grade.book %>%
  rowwise() %>%
  mutate( HW.avg = mean( c_across(starts_with('HW'))) )  # Average all HW... columns 
## # A tibble: 3 x 6
## # Rowwise: 
##   name             HW.1  HW.2  HW.3  HW.4 HW.avg
##   <chr>           <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Alison.Anderson     8     5     8     4   6.25
## 2 Brandon.Babbage     5     3     6     9   5.75
## 3 Charles.Collins     9     7     9    10   8.75

Because rowwise() is a special form of grouping, to exit the row-wise calculations, call ungroup().

Finally, there are some base R functions rowSums() and rowMeans() that calculate sums and means across rows. They are actually quite fast, but don’t allow the user to input an arbitrary function.

grade.book %>%
  mutate( HW.avg = rowMeans(across(starts_with('HW'))) )
##              name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson    8    5    8    4   6.25
## 2 Brandon.Babbage    5    3    6    9   5.75
## 3 Charles.Collins    9    7    9   10   8.75

Somewhat confusingly, in this statement we used across which applies a transformation to multiple columns, while the c_across() is designed to work with the rowwise() command.

I prefer the solution that uses the select(., cols) %>% apply(1, fun) chain inside the mutate command because I don’t have do an additional rowwise() ungroup() command in my pipeline.

13.7 Exercises

  1. A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are Gender and Year which are the students gender and year in college.

    1. Download the dataset and correctly order the Year variable using the following:

      Survey <- read.csv('https://www.lock5stat.com/datasets3e/StudentSurvey.csv', na.strings=c('',' ')) 
    2. Using some combination of dplyr functions, produce a data set with eight rows that contains the number of responses for each gender:year combination. Make sure your table orders the Year variable in the correct order of First Year, Sophmore, Junior, and then Senior. You might want to look at the following functions: dplyr::count and dplyr::drop_na.

    3. Using tidyr commands, produce a table of the number of responses in the following form:

      Gender First Year Sophmore Junior Senior
      Female
      Male
  2. From the book website, there is a .csv file of the daily maximum temperature in Flagstaff at the Pulliam Airport. The direction link is at: https://raw.githubusercontent.com/dereksonderegger/444/master/data-raw/FlagMaxTemp.csv

    1. Create a line graph that gives the daily maximum temperature for 2005. Make sure the x-axis is a date and covers the whole year.
    2. Create a line graph that gives the monthly average maximum temperature for 2013 - 2015. Again the x-axis should be the date and the axis spans 3 years.
  3. Challenging! We often are given data in a table format that is easy for a human to parse, but annoying a program. In the following example we have data of US government expenditures from 1962 to 2015. I downloaded this data from https://obamawhitehouse.archives.gov/omb/budget/Historicals (Table 3.2) on Sept 22, 2019. I separated the Function/Subfunction from a single column into two and removed the sub-sub functions for the military. (Look at the non-gentle version of the file to see what the original file looked like.) Our goal is to end up with a data frame with columns for Function, Subfunction, Year, and Amount. We’ll ignore the “On-budget” and “Off-budget” distinction.

    1. Download the data file, inspect it, and read in the data using the readxl package. Hint, your column names should be the the years.
    2. Remove any row with Total, Subtotal, On-budget or Off-budget. Also remove the row at the bottom that defines what NA means.
    3. For all of the NA values in the Department column, fill them in with the value above. Hint: the function tidyr::fill() will be helpful.
    4. Remove rows that corresponded to the Function name that have no data. Hint, you can just check if the 2015 column is NA.
    5. Reshape the data into four columns for Function, Subfunction, Year, and Amount.
    6. Remove rows that have Amount value of ........... Alternatively, we could have used this as one of the NA strings during the import stage.
    7. Make sure that Year and Amount are numeric. Hint: it is OK to get rid of the estimate rows for 2016+. Alternatively you could transform those
      • by transforming 2016 estimate to just 2016.*
    8. Make a line graph that compares spending for National Defense, Health, Medicare, Income Security, and Social Security for each of the years 2001 through 2015. Notice you’ll have to sum up the sub-functions within each function.
  4. For this problem we will consider two simple data sets.

    A <- tribble(
      ~Name, ~Car,
      'Alice', 'Ford F150',
      'Bob',   'Tesla Model III',
      'Charlie', 'VW Bug')
    
    B <- tribble(
      ~First.Name, ~Pet,
      'Bob',  'Cat',
      'Charlie', 'Dog',
      'Alice', 'Rabbit')
    1. Squish the data frames together to generate a data set with three rows and three columns. Do two ways: first using cbind and then using one of the dplyr join commands.
    2. It turns out that Alice also has a pet guinea pig. Add another row to the B data set. Do this using either the base function rbind, or either of the dplyr functions add_row or bind_rows.
    3. Squish the A and B data sets together to generate a data set with four rows and three columns. Do this two ways: first using cbind and then using one of the dplyr join commands. Which was easier to program? Which is more likely to have an error.
  5. Data table joins are extremely common because effective database design almost always involves having multiple tables for different types of objects. To illustrate both the table joins and the usefulness of multiple tables we will develop a set of data frames that will represent a credit card company’s customer data base. We will have tables for Customers, Retailers, Cards, and Transactions. Below is code that will create and populate these tables.

    Customers <- tribble(
      ~PersonID, ~Name, ~Street, ~City, ~State,
      1, 'Derek Sonderegger',  '231 River Run', 'Flagstaff', 'AZ',
      2, 'Aubrey Sonderegger', '231 River Run', 'Flagstaff', 'AZ',
      3, 'Robert Buscaglia', '754 Forest Heights', 'Flagstaff', 'AZ',
      4, 'Roy St Laurent', '845 Elk View', 'Flagstaff', 'AZ')
    
    Retailers <- tribble(
      ~RetailID, ~Name, ~Street, ~City, ~State,
      1, 'Kickstand Kafe', '719 N Humphreys St', 'Flagstaff', 'AZ',
      2, 'MartAnnes', '112 E Route 66', 'Flagstaff', 'AZ',
      3, 'REI', '323 S Windsor Ln', 'Flagstaff', 'AZ' )
    
    Cards <- tribble(
      ~CardID, ~PersonID, ~Issue_DateTime, ~Exp_DateTime,
      '9876768717278723',  1,  '2019-9-20 0:00:00', '2022-9-20 0:00:00',
      '5628927579821287',  2,  '2019-9-20 0:00:00', '2022-9-20 0:00:00',
      '7295825498122734',  3,  '2019-9-28 0:00:00', '2022-9-28 0:00:00',
      '8723768965231926',  4,  '2019-9-30 0:00:00', '2022-9-30 0:00:00' ) 
    
    Transactions <- tribble(
      ~CardID, ~RetailID, ~DateTime, ~Amount,
      '9876768717278723', 1, '2019-10-1 8:31:23',    5.68,
      '7295825498122734', 2, '2019-10-1 12:45:45',  25.67,
      '9876768717278723', 1, '2019-10-2 8:26:31',    5.68,
      '9876768717278723', 1, '2019-10-2 8:30:09',    9.23,
      '5628927579821287', 3, '2019-10-5 18:58:57',  68.54,
      '7295825498122734', 2, '2019-10-5 12:39:26',  31.84,
      '8723768965231926', 2, '2019-10-10 19:02:20', 42.83) 
    
    Cards <- Cards %>% 
      mutate( Issue_DateTime = lubridate::ymd_hms(Issue_DateTime),
              Exp_DateTime   = lubridate::ymd_hms(Exp_DateTime) )
    Transactions <- Transactions %>% 
      mutate( DateTime = lubridate::ymd_hms(DateTime))
    1. Create a table that gives the credit card statement for Derek. It should give all the transactions, the amounts, and the store name. Write your code as if the only initial information you have is the customer’s name. Hint: Do a bunch of table joins, and then filter for the desired customer name. To be efficient, do the filtering first and then do the table joins.

    2. Aubrey has lost her credit card on Oct 15, 2019. Close her credit card at 4:28:21 PM and issue her a new credit card in the Cards table. Hint: Using the Aubrey’s name, get necessary CardID and PersonID and save those as cardID and personID. Then update the Cards table row that corresponds to the cardID so that the expiration date is set to the time that the card is closed. Then insert a new row with the personID for Aubrey and a new CardID number that you make up.

    3. Aubrey is using her new card at Kickstand Kafe on Oct 16, 2019 at 2:30:21 PM for coffee with a charge of $4.98. Generate a new transaction for this action. Hint: create temporary variables card,retailid,datetime, and amount that contain the information for this transaction and then write your code to use those. This way in the next question you can just use the same code but modify the temporary variables. Alternatively, you could write a function that takes in these four values and manipulates the tables in the GLOBAL environment using the <<- command to assign a result to a variable defined in the global environment. The reason this is OK is that in a real situation, these data would be stored in a database and we would expect the function to update that database.

    4. On Oct 17, 2019, some nefarious person is trying to use her OLD credit card at REI. Make sure your code in part (c) first checks to see if the credit card is active before creating a new transaction. Using the same code, verify that the nefarious transaction at REI is denied. Hint: your check ought to look something like this:

      card <- '9876768717278723'
      retailid <- 2
      datetime <- ymd_hms('2019-10-16 14:30:21')
      amount <- 4.98
      
      # If the card is currently valid, this should return exactly 1 row.
      Valid_Cards <- Cards %>% 
        filter(CardID == card, Issue_DateTime <= datetime, datetime <= Exp_DateTime)
      
      # If the transaction is valid, insert the transaction into the table
      if( nrow(Valid_Cards) == 1){
        # Some code to insert the transaction
      }else{
        print('Card Denied')
      }
    5. Generate a table that gives the credit card statement for Aubrey. It should give all the transactions, amounts, and retailer name for both credit cards she had during this period.

  6. The package nycflights13 contains information about all the flights that arrived in or left from New York City in 2013. This package contains five data tables, but there are three data tables we will work with. The data table flights gives information about a particular flight, airports gives information about a particular airport, and airlines gives information about each airline. Create a table of all the flights on February 14th by Virgin America that has columns for the carrier, destination, departure time, and flight duration. Join this table with the airports information for the destination. Notice that because the column for the destination airport code doesn’t match up between flights and airports, you’ll have to use the by=c("TableA.Col"="TableB.Col") argument where you insert the correct names for TableA.Col and TableB.Col.