Chapter 13 Data Reshaping
# library(tidyr) # for the gather/spread commands
# library(dplyr) # for the join stuff
library(tidyverse) # dplyr, tidyr, ggplot2, etc.
I have a YouTube Video Lecture for this chapter.
Most of the time, our data is in the form of a data frame and we are interested in exploring the relationships. However most procedures in R expect the data to show up in a ‘long’ format where each row is an observation and each column is a covariate. In practice, the data is often not stored like that and the data comes to us with repeated observations included on a single row. This is often done as a memory saving technique or because there is some structure in the data that makes the ‘wide’ format attractive. As a result, we need a way to convert data from ‘wide’ to ‘long’ and vice-versa.
Next we need a way to squish two data frames together. It is often advantageous to store data that would be be repeated separately in a different table so that a particular piece of information lives in only one location. This makes the data easier to modify, and more likely to maintain consistence. However, this practice requires that, when necessary, we can add information to a table, that might involve a lot of duplicated rows.
13.1 cbind
& rbind
Base R has two functions for squishing two data frames together, but they assume
that the data frames are aligned correctly. The c
and r
parts of cbind
and
rbind
correspond to if we are pushing columns together or rows together.
# Define a tibble using a row-wise layout.
<- tribble(
df1 ~ID, ~First,
1, 'Alice',
2, 'Bob',
3, 'Charlie')
# Define a tibble by columns
<- tibble( Last = c('Anderson', 'Barker', 'Cooper') )
df2
# Squish the two tibbles together by columns - results in a data.frame
<- cbind( df1, df2 )
People People
## ID First Last
## 1 1 Alice Anderson
## 2 2 Bob Barker
## 3 3 Charlie Cooper
Squishing together by rows is easy as well.
<- tibble( ID=4, First='Daniel', Last='Davidson' )
df3
<- rbind( People, df3 )
People People
## ID First Last
## 1 1 Alice Anderson
## 2 2 Bob Barker
## 3 3 Charlie Cooper
## 4 4 Daniel Davidson
If both inputs have row or column names, then they don’t need to be appropriate
arranged as rbind
and cbind
can figure out how to order them.
There are equivalent functions in the dplyr
package that do the same job, but
will work more consistently. For example, if we try to bind a row of data that
has more columns, bind_rows()
will introduce a column of NA
values to the
smaller data set. Furthermore, if the column orders are mixed up
<- tibble(First='Elise', ID=5, Last='Erikson') # Notice changed columns order!
df4 rbind(People, df4) # but rbind() can still figure it out!
## ID First Last
## 1 1 Alice Anderson
## 2 2 Bob Barker
## 3 3 Charlie Cooper
## 4 4 Daniel Davidson
## 5 5 Elise Erikson
<- tibble(First='Frank', ID=6, Last='Fredrick', dob=lubridate::ymd('1980-7-21'))
df5 rbind(People, df5) # throws an error
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
bind_rows( People, df4, df5) # Inserts NA values as appropriate.
## ID First Last dob
## 1 1 Alice Anderson <NA>
## 2 2 Bob Barker <NA>
## 3 3 Charlie Cooper <NA>
## 4 4 Daniel Davidson <NA>
## 5 5 Elise Erikson <NA>
## 6 6 Frank Fredrick 1980-07-21
In general, I find that rbind()
and bind_rows()
work really well and I use
them quite often. However, cbind()
and bind_cols()
are less useful because
I have to make sure that either I have rownames set up for each data set, or I
have to be very careful with the ordering. Instead, it is safer to write code
that relies on joins
, which will be discussed later in this chapter.
13.2 Table Pivots
There is a common issue with obtaining data with many columns that you wish were organized as rows. For example, I might have data in a grade book that has several homework scores and I’d like to produce a nice graph that has assignment number on the x-axis and score on the y-axis. Unfortunately this is incredibly hard to do when the data is arranged in the following way:
<- rbind(
grade.book data.frame(name='Alison.Anderson', HW.1=8, HW.2=5, HW.3=8, HW.4=4),
data.frame(name='Brandon.Babbage', HW.1=5, HW.2=3, HW.3=6, HW.4=9),
data.frame(name='Charles.Collins', HW.1=9, HW.2=7, HW.3=9, HW.4=10))
grade.book
## name HW.1 HW.2 HW.3 HW.4
## 1 Alison.Anderson 8 5 8 4
## 2 Brandon.Babbage 5 3 6 9
## 3 Charles.Collins 9 7 9 10
What we want to do is turn this data frame from a wide data frame into a long
data frame. In MS Excel this is called pivoting. Essentially I’d like to create
a data frame with three columns: name
, assignment
, and score
. That is to
say that each homework datum really has three pieces of information: who it came
from, which homework it was, and what the score was. It doesn’t conceptually
matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way
to identify how a student scored on a particular homework. So we want to reshape
the HW1 to HW4 columns into two columns (assignment and score).
This package was built by the same people that created dplyr and ggplot2 and there is a nice introduction at: [http://blog.rstudio.org/2014/07/22/introducing-tidyr/]
There have been several iterations of functions to convert data from a wide to
long and back again. When Hadley first began thinking through these issues in
the reshape
package, he called these two functions melt
and cast
. The
second iteration in these functions in the tidyr
package were called gather
and spread
. I believe we are finally in the last which are simply named
pivot_wider()
and pivot_longer()
.
There are two main verbs to remember:
Function | Description |
---|---|
pivot_longer() |
Gather multiple columns that are related into two columns that contain the original column name and the value. For example for columns HW1 , HW2 , HW3 we would gather them into two columns: Homework and Score . In this case, we refer to Homework as the names column and Score as the values column. |
pivot_wider() |
This is the opposite function. This takes a names and values pair of columns and forms a new column for each level of the names column(s). |
# first we gather the score columns into columns we'll name Homework and Score
<- grade.book %>%
tidy.scores pivot_longer(
.1:HW.4, # which columns to apply this to
HWnames_to = 'Homework', # What should I call the column of old column names
values_to = 'Score') # What should I call the values column
tidy.scores
## # A tibble: 12 x 3
## name Homework Score
## <chr> <chr> <dbl>
## 1 Alison.Anderson HW.1 8
## 2 Alison.Anderson HW.2 5
## 3 Alison.Anderson HW.3 8
## 4 Alison.Anderson HW.4 4
## 5 Brandon.Babbage HW.1 5
## 6 Brandon.Babbage HW.2 3
## 7 Brandon.Babbage HW.3 6
## 8 Brandon.Babbage HW.4 9
## 9 Charles.Collins HW.1 9
## 10 Charles.Collins HW.2 7
## 11 Charles.Collins HW.3 9
## 12 Charles.Collins HW.4 10
To spread the names values pairs out into a wide data frame, we use the
pivot_wider()
command.
# Turn the Homework/Score pair of columns into one column per factor level of Homework
%>% pivot_wider( names_from=Homework, values_from=Score ) tidy.scores
## # A tibble: 3 x 5
## name HW.1 HW.2 HW.3 HW.4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alison.Anderson 8 5 8 4
## 2 Brandon.Babbage 5 3 6 9
## 3 Charles.Collins 9 7 9 10
Often times, the long format of the data is most helpful for graphing or doing
data analysis. Because of this, we often refer to this as the tidy form of the
data. Hadley has a nice article about messy data vs tidy data and his article
is well worth your time to read, although dplyr
and tidyr
have matured since
he wrote this article. The main point can be summarized by:
- Each row in the data frame represents a single object.
- Each column represents an attribute that can be measured on every object.
- Every object type implicitly forms a table.
In our grade book example, we could consider each student as on object and store
the data with 1 row per student and columns for each attribute (homework score).
Alternatively, the long form is a Scores
data set stores the data with a single
row representing a single score. If I have multiple sections of the course and
some homeworks are not assigned to one course but are assigned to another, then
the long form Scores
would be the appropriate storage.
Most of our graphing and data analysis procedures will expect the data to be in a long format, but for both data input and digital storage compression, we might have data given to us in a wide format.
13.3 Spreading a Single Column with separate
It is common for data columns to contain multiple attributes. For example, a name
column might contain both first and last names or a file name might contain
multiple bits of information broken up by underscores. In the string manipulation
chapter, we saw how to use stringr::str_split()
to split strings. Now we’ll
apply this using the tidyr::separate()
function.
%>% separate(name, sep='\\.', c('F.Name','L.Name'), remove = FALSE) grade.book
## name F.Name L.Name HW.1 HW.2 HW.3 HW.4
## 1 Alison.Anderson Alison Anderson 8 5 8 4
## 2 Brandon.Babbage Brandon Babbage 5 3 6 9
## 3 Charles.Collins Charles Collins 9 7 9 10
13.4 Storing Data in Multiple Tables
In many data sets it is common to store data across multiple tables, usually with the goal of minimizing memory used as well as providing minimal duplication of information so any change that must be made is only made in a single place. Recall the guiding idea that each type of object stored should result in its own table.
To see the rational why we might do this, consider building a data set of blood donations by a variety of donors across several years. For each blood donation, we will perform some assay and measure certain qualities about the blood and the patients health at the donation. But should we contain the donor’s address, phone number, and email address in the same data table that holds the information about the blood donated.
I would like to include additional information about the donor where that
information doesn’t change overtime. For example we might want to have
information about the donor’s birthdate, sex, blood type. However, I don’t want
that information in every single donation line. Otherwise if I mistype a
birthday and have to correct it, I would have to correct it everywhere. For
information about the donor, should live in a donors
table, while information
about a particular donation should live in the donations
table.
Furthermore, there are many Jeffs and Dereks in the world and to maintain a unique
identifier (without using Social Security numbers) I will just create a Donor_ID
code that will uniquely identify a person. Similarly I will create a Blood_ID
that will uniquely identify a blood donation.
Donors
## # A tibble: 3 x 7
## Donor_ID Name Blood_Type Birthday Street City State
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 D1 Derek O+ 1976-09-17 7392 Willard Flagstaff AZ
## 2 D2 Jeff A- 1974-06-23 873 Vine Bozeman MT
## 3 D3 Aubrey O+ 1976-09-17 7392 Willard Flagstaff AZ
Blood_Donations
## # A tibble: 5 x 6
## Blood_ID Donor_ID Date Hemoglobin Systolic Diastolic
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 B_1 D1 2017-04-14 17.4 120 79
## 2 B_2 D1 2017-06-20 16.5 121 80
## 3 B_3 D2 2017-08-14 16.9 145 101
## 4 B_4 D1 2017-08-26 17.6 120 79
## 5 B_5 D3 2017-08-26 16.1 137 90
If we have a new donor walk in and give blood, then we’ll have to create a new
entry in the donors
table as well as a new entry in the donations
table. If
an experienced donor gives again, we just have to create a new entry in the
donations table.
Given this data structure, we can now easily create new donations as well as store donor information. In the event that we need to change something about a donor, there is only one place to make that change.
However, having data spread across multiple tables is challenging because I often want that information squished back together. For example, if during routine testing we discover that a blood sample is HIV positive. Then we need to be able to join the blood donations table to the donors table in some sensible manner.
13.5 Table Joins
There are four different types of joins: outer, left, right, and inner joins. Consider the following example tables. Here we consider that there is some key column that is common to both tables.
Join Type | Result |
---|---|
inner_join(A,B) |
Include rows if the key value is in both tables. |
left_join(A,B) |
Include all rows of A , and if the match in B doesn’t exist, just insert a NA . |
right_join(A,B) |
Include all rows of B , and if the match in A doesn’t exist, just insert a NA . |
full_join(A,B) |
Include all rows of A and B and if the necessary match doesn’t exist, insert NA values. |
For a practical example
<- tribble(
A ~ID, ~x,
'a', 34, # Notice that A doesn't have ID = d
'b', 36,
'c', 38)
<- tribble(
B ~ID, ~y,
'b', 56, # Notice that B doesn't have ID = a
'c', 57,
'd', 59)
# only include rows with IDs that are in both tables
inner_join(A,B)
## # A tibble: 2 x 3
## ID x y
## <chr> <dbl> <dbl>
## 1 b 36 56
## 2 c 38 57
# All the rows in table A, insert NA if the B info is missing
left_join(A,B)
## # A tibble: 3 x 3
## ID x y
## <chr> <dbl> <dbl>
## 1 a 34 NA
## 2 b 36 56
## 3 c 38 57
# All the rows in table B, insert NA if the A info is missing
right_join(A,B)
## # A tibble: 3 x 3
## ID x y
## <chr> <dbl> <dbl>
## 1 b 36 56
## 2 c 38 57
## 3 d NA 59
# All the rows possible, insert NA if the matching info is missing
full_join(A,B)
## # A tibble: 4 x 3
## ID x y
## <chr> <dbl> <dbl>
## 1 a 34 NA
## 2 b 36 56
## 3 c 38 57
## 4 d NA 59
Consider the case where we have a data frame of observations of fish and a separate data frame that contains information about lake (perhaps surface area, max depth, pH, etc). I want to store them as two separate tables so that when I have to record a lake level observation, I only input it one place. This decreases the chance that I make a copy/paste error.
To illustrate the different types of table joins, we’ll consider two different tables.
# tibbles are just data.frames that print a bit nicer and don't automatically
# convert character columns into factors. They behave a bit more consistently
# in a wide variety of situations compared to data.frames.
<- tibble(
Fish.Data Lake_ID = c('A','A','B','B','C','C'),
Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
<- tibble(
Lake.Data Lake_ID = c( 'B','C','D'),
Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),
pH=c(6.5, 6.3, 6.1),
area = c(40, 210, 240),
avg_depth = c(8, 10, 38))
Fish.Data
## # A tibble: 6 x 2
## Lake_ID Fish.Weight
## <chr> <dbl>
## 1 A 267.
## 2 A 270.
## 3 B 212.
## 4 B 280.
## 5 C 308.
## 6 C 282.
Lake.Data
## # A tibble: 3 x 5
## Lake_ID Lake_Name pH area avg_depth
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 B Lake Elaine 6.5 40 8
## 2 C Mormon Lake 6.3 210 10
## 3 D Lake Mary 6.1 240 38
Notice that each of these tables has a column labeled Lake_ID
. When we join
these two tables, the row that describes lake A
should be duplicated for each
row in the Fish.Data
that corresponds with fish caught from lake A
.
full_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 7 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A 267. <NA> NA NA NA
## 2 A 270. <NA> NA NA NA
## 3 B 212. Lake Elaine 6.5 40 8
## 4 B 280. Lake Elaine 6.5 40 8
## 5 C 308. Mormon Lake 6.3 210 10
## 6 C 282. Mormon Lake 6.3 210 10
## 7 D NA Lake Mary 6.1 240 38
Notice that because we didn’t have any fish caught in lake D
and we don’t have
any Lake information about lake A
, when we join these two tables, we end up
introducing missing observations into the resulting data frame.
left_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 6 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A 267. <NA> NA NA NA
## 2 A 270. <NA> NA NA NA
## 3 B 212. Lake Elaine 6.5 40 8
## 4 B 280. Lake Elaine 6.5 40 8
## 5 C 308. Mormon Lake 6.3 210 10
## 6 C 282. Mormon Lake 6.3 210 10
inner_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 4 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 B 212. Lake Elaine 6.5 40 8
## 2 B 280. Lake Elaine 6.5 40 8
## 3 C 308. Mormon Lake 6.3 210 10
## 4 C 282. Mormon Lake 6.3 210 10
The above examples assumed that the column used to join the two tables was named
the same in both tables. This is good practice to try to do, but sometimes you
have to work with data where that isn’t the case. In that situation you can use
the by=c("ColName.A"="ColName.B")
syntax where ColName.A
represents the name
of the column in the first data frame and ColName.B
is the equivalent column
in the second data frame.
Filtering joins are a useful trick when programming to help discover which rows would be included. In particular, I often want to know which rows would be dropped or don’t have a corresponding match.
Join Type | Result |
---|---|
semi_join(A,B) |
All the rows in A that have a match in B . |
anti_join(A,B) |
All the rows in A that dont have a match in B . |
13.6 Row-wise Calculations
Finally, the combination of pivot_longer
and pivot_wider
allows me to do some
very complex calculations across many columns of a data set. For example, I
might gather up a set of columns, calculate some summary statistics, and then
join the result back to original data set.
%>%
grade.book group_by(name) %>%
pivot_longer(HW.1:HW.4, names_to='Homework', values_to='Score' ) %>%
summarise( HW.avg = mean(Score) ) %>%
left_join( grade.book, . )
## Joining, by = "name"
## name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson 8 5 8 4 6.25
## 2 Brandon.Babbage 5 3 6 9 5.75
## 3 Charles.Collins 9 7 9 10 8.75
While this works, it is not very convenient. Instead, we’d like to have some mechanism for doing rowwise calculations. This is actually not as easy as you might expect because data frames are stored in memory to make column-wise operations fast.
13.6.1 Using base apply()
function
One solution is to use the base R function apply()
.
The apply()
function applies a function to each row or column (MARGIN=1
or
MARGIN=2
respectively).
<- matrix(1:6, ncol=3) # Make a matrix of 2 rows, 3 columns
A A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
apply(A, MARGIN = 1, FUN=sum) # Apply sum() function to rows (MARGIN=1)
## [1] 9 12
apply(A, MARGIN = 2, FUN=sum) # Apply sum() function to columns (MARGIN=1)
## [1] 3 7 11
So to add a row calculation, we just need to put together select
and apply
statements.
# new.col.name columns function
%>%
grade.book mutate(., HW.avg = select(., HW.1:HW.4) %>% apply(1, mean) ) %>%
print() # this print is just to show you can keep the pipeline going...
## name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson 8 5 8 4 6.25
## 2 Brandon.Babbage 5 3 6 9 5.75
## 3 Charles.Collins 9 7 9 10 8.75
%>%
grade.book select(., HW.1:HW.4) %>%
apply(1, mean)
## [1] 6.25 5.75 8.75
But this pipeline inside a mutate command is a little cumbersome. The command
dplyr::rowwise()
causes subsequent actions to be performed rowwise instead
of the default of columnwise. rowwise()
is actually a special form of
group_by()
which creates a unique group for each row. The function
dplyr::c_across()
allows you to use all the select
style tricks for
picking columns.
# grade.book %>%
# rowwise() %>%
# mutate( HW.avg = mean( c(HW.1, HW.2, HW.3, HW.4) ) ) # List the columns to average
#
# grade.book %>%
# rowwise() %>%
# mutate( HW.avg = mean( c_across(HW.1:HW.4) ) ) # Give a column range to average
%>%
grade.book rowwise() %>%
mutate( HW.avg = mean( c_across(starts_with('HW'))) ) # Average all HW... columns
## # A tibble: 3 x 6
## # Rowwise:
## name HW.1 HW.2 HW.3 HW.4 HW.avg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alison.Anderson 8 5 8 4 6.25
## 2 Brandon.Babbage 5 3 6 9 5.75
## 3 Charles.Collins 9 7 9 10 8.75
Because rowwise()
is a special form of grouping, to exit the row-wise
calculations, call ungroup()
.
Finally, there are some base R functions rowSums()
and rowMeans()
that
calculate sums and means across rows. They are actually quite fast, but don’t
allow the user to input an arbitrary function.
%>%
grade.book mutate( HW.avg = rowMeans(across(starts_with('HW'))) )
## name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison.Anderson 8 5 8 4 6.25
## 2 Brandon.Babbage 5 3 6 9 5.75
## 3 Charles.Collins 9 7 9 10 8.75
Somewhat confusingly, in this statement we used across
which applies a
transformation to multiple columns, while the c_across()
is designed to
work with the rowwise()
command.
I prefer the solution that uses the select(., cols) %>% apply(1, fun)
chain inside the mutate
command because I don’t have do an additional
rowwise()
ungroup()
command in my pipeline.
13.7 Exercises
A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are
Gender
andYear
which are the students gender and year in college.Download the dataset and correctly order the
Year
variable using the following:<- read.csv('https://www.lock5stat.com/datasets3e/StudentSurvey.csv', na.strings=c('',' ')) Survey
Using some combination of
dplyr
functions, produce a data set with eight rows that contains the number of responses for each gender:year combination. Make sure your table orders theYear
variable in the correct order ofFirst Year
,Sophmore
,Junior
, and thenSenior
. You might want to look at the following functions:dplyr::count
anddplyr::drop_na
.Using
tidyr
commands, produce a table of the number of responses in the following form:Gender First Year Sophmore Junior Senior Female Male
From the book website, there is a .csv file of the daily maximum temperature in Flagstaff at the Pulliam Airport. The direction link is at: https://raw.githubusercontent.com/dereksonderegger/444/master/data-raw/FlagMaxTemp.csv
- Create a line graph that gives the daily maximum temperature for 2005. Make sure the x-axis is a date and covers the whole year.
- Create a line graph that gives the monthly average maximum temperature for 2013 - 2015. Again the x-axis should be the date and the axis spans 3 years.
Challenging! We often are given data in a table format that is easy for a human to parse, but annoying a program. In the following example we have data of US government expenditures from 1962 to 2015. I downloaded this data from https://obamawhitehouse.archives.gov/omb/budget/Historicals (Table 3.2) on Sept 22, 2019. I separated the Function/Subfunction from a single column into two and removed the sub-sub functions for the military. (Look at the non-gentle version of the file to see what the original file looked like.) Our goal is to end up with a data frame with columns for
Function
,Subfunction
,Year
, andAmount
. We’ll ignore the “On-budget” and “Off-budget” distinction.- Download the data file, inspect it, and read in the data using the
readxl
package. Hint, your column names should be the the years. - Remove any row with Total, Subtotal, On-budget or Off-budget. Also remove the row at the bottom that defines what NA means.
- For all of the NA values in the Department column, fill them in with the
value above. Hint: the function
tidyr::fill()
will be helpful. - Remove rows that corresponded to the Function name that have no data.
Hint, you can just check if the
2015
column isNA
. - Reshape the data into four columns for Function, Subfunction, Year, and Amount.
- Remove rows that have Amount value of
..........
. Alternatively, we could have used this as one of the NA strings during the import stage. - Make sure that Year and Amount are numeric. Hint: it is OK to get rid
of the estimate rows for 2016+. Alternatively you could transform those
- by transforming
2016 estimate
to just2016
.*
- by transforming
- Make a line graph that compares spending for National Defense, Health, Medicare, Income Security, and Social Security for each of the years 2001 through 2015. Notice you’ll have to sum up the sub-functions within each function.
- Download the data file, inspect it, and read in the data using the
For this problem we will consider two simple data sets.
<- tribble( A ~Name, ~Car, 'Alice', 'Ford F150', 'Bob', 'Tesla Model III', 'Charlie', 'VW Bug') <- tribble( B ~First.Name, ~Pet, 'Bob', 'Cat', 'Charlie', 'Dog', 'Alice', 'Rabbit')
- Squish the data frames together to generate a data set with three rows
and three columns. Do two ways: first using
cbind
and then using one of thedplyr
join
commands. - It turns out that Alice also has a pet guinea pig. Add another row to
the
B
data set. Do this using either the base functionrbind
, or either of thedplyr
functionsadd_row
orbind_rows
. - Squish the
A
andB
data sets together to generate a data set with four rows and three columns. Do this two ways: first usingcbind
and then using one of thedplyr
join
commands. Which was easier to program? Which is more likely to have an error.
- Squish the data frames together to generate a data set with three rows
and three columns. Do two ways: first using
Data table joins are extremely common because effective database design almost always involves having multiple tables for different types of objects. To illustrate both the table joins and the usefulness of multiple tables we will develop a set of data frames that will represent a credit card company’s customer data base. We will have tables for Customers, Retailers, Cards, and Transactions. Below is code that will create and populate these tables.
<- tribble( Customers ~PersonID, ~Name, ~Street, ~City, ~State, 1, 'Derek Sonderegger', '231 River Run', 'Flagstaff', 'AZ', 2, 'Aubrey Sonderegger', '231 River Run', 'Flagstaff', 'AZ', 3, 'Robert Buscaglia', '754 Forest Heights', 'Flagstaff', 'AZ', 4, 'Roy St Laurent', '845 Elk View', 'Flagstaff', 'AZ') <- tribble( Retailers ~RetailID, ~Name, ~Street, ~City, ~State, 1, 'Kickstand Kafe', '719 N Humphreys St', 'Flagstaff', 'AZ', 2, 'MartAnnes', '112 E Route 66', 'Flagstaff', 'AZ', 3, 'REI', '323 S Windsor Ln', 'Flagstaff', 'AZ' ) <- tribble( Cards ~CardID, ~PersonID, ~Issue_DateTime, ~Exp_DateTime, '9876768717278723', 1, '2019-9-20 0:00:00', '2022-9-20 0:00:00', '5628927579821287', 2, '2019-9-20 0:00:00', '2022-9-20 0:00:00', '7295825498122734', 3, '2019-9-28 0:00:00', '2022-9-28 0:00:00', '8723768965231926', 4, '2019-9-30 0:00:00', '2022-9-30 0:00:00' ) <- tribble( Transactions ~CardID, ~RetailID, ~DateTime, ~Amount, '9876768717278723', 1, '2019-10-1 8:31:23', 5.68, '7295825498122734', 2, '2019-10-1 12:45:45', 25.67, '9876768717278723', 1, '2019-10-2 8:26:31', 5.68, '9876768717278723', 1, '2019-10-2 8:30:09', 9.23, '5628927579821287', 3, '2019-10-5 18:58:57', 68.54, '7295825498122734', 2, '2019-10-5 12:39:26', 31.84, '8723768965231926', 2, '2019-10-10 19:02:20', 42.83) <- Cards %>% Cards mutate( Issue_DateTime = lubridate::ymd_hms(Issue_DateTime), Exp_DateTime = lubridate::ymd_hms(Exp_DateTime) ) <- Transactions %>% Transactions mutate( DateTime = lubridate::ymd_hms(DateTime))
Create a table that gives the credit card statement for Derek. It should give all the transactions, the amounts, and the store name. Write your code as if the only initial information you have is the customer’s name. Hint: Do a bunch of table joins, and then filter for the desired customer name. To be efficient, do the filtering first and then do the table joins.
Aubrey has lost her credit card on Oct 15, 2019. Close her credit card at 4:28:21 PM and issue her a new credit card in the
Cards
table. Hint: Using the Aubrey’s name, get necessary CardID and PersonID and save those ascardID
andpersonID
. Then update theCards
table row that corresponds to thecardID
so that the expiration date is set to the time that the card is closed. Then insert a new row with thepersonID
for Aubrey and a newCardID
number that you make up.Aubrey is using her new card at Kickstand Kafe on Oct 16, 2019 at 2:30:21 PM for coffee with a charge of $4.98. Generate a new transaction for this action. Hint: create temporary variables
card
,retailid
,datetime
, andamount
that contain the information for this transaction and then write your code to use those. This way in the next question you can just use the same code but modify the temporary variables. Alternatively, you could write a function that takes in these four values and manipulates the tables in the GLOBAL environment using the<<-
command to assign a result to a variable defined in the global environment. The reason this is OK is that in a real situation, these data would be stored in a database and we would expect the function to update that database.On Oct 17, 2019, some nefarious person is trying to use her OLD credit card at REI. Make sure your code in part (c) first checks to see if the credit card is active before creating a new transaction. Using the same code, verify that the nefarious transaction at REI is denied. Hint: your check ought to look something like this:
<- '9876768717278723' card <- 2 retailid <- ymd_hms('2019-10-16 14:30:21') datetime <- 4.98 amount # If the card is currently valid, this should return exactly 1 row. <- Cards %>% Valid_Cards filter(CardID == card, Issue_DateTime <= datetime, datetime <= Exp_DateTime) # If the transaction is valid, insert the transaction into the table if( nrow(Valid_Cards) == 1){ # Some code to insert the transaction else{ }print('Card Denied') }
Generate a table that gives the credit card statement for Aubrey. It should give all the transactions, amounts, and retailer name for both credit cards she had during this period.
The package
nycflights13
contains information about all the flights that arrived in or left from New York City in 2013. This package contains five data tables, but there are three data tables we will work with. The data tableflights
gives information about a particular flight,airports
gives information about a particular airport, andairlines
gives information about each airline. Create a table of all the flights on February 14th by Virgin America that has columns for the carrier, destination, departure time, and flight duration. Join this table with the airports information for the destination. Notice that because the column for the destination airport code doesn’t match up betweenflights
andairports
, you’ll have to use theby=c("TableA.Col"="TableB.Col")
argument where you insert the correct names forTableA.Col
andTableB.Col
.