14 Data Manipulation (III)

In this and the following chapters, we discuss data frame manipulations. This chapter focuses on combining data frames by rows and by columns (merge/join).

14.1 Combining data frames

So far we have been working on our two demo datasets individually. The two datasets are certainly interesting on their own, but may generate more insights when combined together.

When should we combine data frames? There are a couple of scenarios where this may be necessary. As a motivating example, suppose we obtain datasets from an e-commerce company. The company stores information in different tables within their database, such as groceries, clothing, books, electronics, and so on. If we want to apply a model across all categories, we may need to combine all the tables into one.

In the second case, we receive datasets from the same e-commerce company, but they choose to store information for all categories in one table per day. If we’re using a model that requires historical data across time periods, we may need to combine all the tables into one again. In the first scenario, we combine data frames by columns, whereas in the second scenario, we combine them by rows.

Data frames can be combined vertically by rows or horizontally by columns.

  • rbind() and cbind() combine matrices by rows and columns.
  • merge() combines data frames by columns and is more flexible than cbind().
  • Additionally, dplyr supports mutating joins and filtering joins to link data frames using the database join language. We’ll discuss mutating joins in this chapter.

14.2 Binding data frames by rows

Let’s use two sample datasets to show how to append one data frame to another by rows. In this case, we are combining datasets from different years, and the variables are the same in two datasets.

file1
## # A tibble: 6 × 5
##      id  year family role   rating
##   <dbl> <dbl>  <dbl> <chr>  <chr> 
## 1     1  1995      1 child  A     
## 2     2  1995      1 child  B     
## 3     3  1995      1 mother B     
## 4     1  1995      2 father C     
## 5     2  1995      2 mother B     
## 6     3  1995      2 child  A
file2
## # A tibble: 6 × 5
##      id  year family role   rating
##   <dbl> <dbl>  <dbl> <chr>  <chr> 
## 1     1  2005      1 child  A     
## 2     2  2005      1 child  A     
## 3     3  2005      1 mother B     
## 4     1  2005      2 father B     
## 5     2  2005      2 mother C     
## 6     3  2005      2 child  B

The task is to combine file2 to file1 by rows vertically. This can be achieved with rbind().

rbind(file1, file2)
## # A tibble: 12 × 5
##       id  year family role   rating
##    <dbl> <dbl>  <dbl> <chr>  <chr> 
##  1     1  1995      1 child  A     
##  2     2  1995      1 child  B     
##  3     3  1995      1 mother B     
##  4     1  1995      2 father C     
##  5     2  1995      2 mother B     
##  6     3  1995      2 child  A     
##  7     1  2005      1 child  A     
##  8     2  2005      1 child  A     
##  9     3  2005      1 mother B     
## 10     1  2005      2 father B     
## 11     2  2005      2 mother C     
## 12     3  2005      2 child  B

Or using dplyr::bind_rows():

bind_rows(file1, file2)
## # A tibble: 12 × 5
##       id  year family role   rating
##    <dbl> <dbl>  <dbl> <chr>  <chr> 
##  1     1  1995      1 child  A     
##  2     2  1995      1 child  B     
##  3     3  1995      1 mother B     
##  4     1  1995      2 father C     
##  5     2  1995      2 mother B     
##  6     3  1995      2 child  A     
##  7     1  2005      1 child  A     
##  8     2  2005      1 child  A     
##  9     3  2005      1 mother B     
## 10     1  2005      2 father B     
## 11     2  2005      2 mother C     
## 12     3  2005      2 child  B

We usually bind datasets by rows when dealing with longitudinal data or datasets that record user behavior over a time period.

14.3 Binding data frames by columns

The terms we typically use to describe binding data frames by columns are merge and join. We discuss the base R merge() function and dplyr join functions in this section that perform merge or join operations.

There are several factors to consider when merging or joining two data frames by columns. Firstly, we need to identify a common variable within each data frame that links all the data frames (i.e. key in the database language). This variable should uniquely identifies each observation in each data frame. In some cases, a single variable is sufficient to identify each observation; in other cases, we may need a combination of multiple variables to create a group identifier.

Additionally, we need to decide how to match records across the tables, and whether to keep or discard them during the merge/join process.

merge()

merge() merges two data frames by common columns or row names.

  1. In the simplest case, here we have file1 and file5 to merge by the common column family. This could happen when we have a survey, where there is a questionnaire for the household (id: family) and another questionnaire for the family members (id: id).
file1
## # A tibble: 6 × 5
##      id  year family role   rating
##   <dbl> <dbl>  <dbl> <chr>  <chr> 
## 1     1  1995      1 child  A     
## 2     2  1995      1 child  B     
## 3     3  1995      1 mother B     
## 4     1  1995      2 father C     
## 5     2  1995      2 mother B     
## 6     3  1995      2 child  A
file5
## # A tibble: 3 × 3
##   family area    SES
##    <dbl> <chr> <dbl>
## 1      1 N         1
## 2      2 S         2
## 3      3 N         3

by sets the identifier.

merge(file1, file5, by = "family")
##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
  1. There are cases where the same variables have different names in each data frame.
file5_2
## # A tibble: 3 × 3
##   Family Area    SES
##    <dbl> <chr> <dbl>
## 1      1 N         1
## 2      2 S         2
## 3      3 N         3

These columns need to be specified before merging.

merge(file1, file5_2, by.x = "family", by.y = "Family")
##   family id year   role rating Area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
  1. Besides, there can be cases where variables with the same names have different meanings. We should always refer to the codebook before we start to merge data frames.

keeping and dropping unmatched cases

To decide if we would like to keep or drop the unmatched cases from either data frame, we may further specify the all argument.

There are three possible outcomes for a row in the data frame to the left-hand side:

  • If it doesn’t match anything, it’s dropped.
  • If it matches 1 row in the data frame to the right-hand side, it’s preserved.
  • If it matches more than 1 row in the data frame to the right-hand sidw, it’s duplicated once for each match.

These possibilities can be specified by the arguments all, all.x and all.y.

With all = TRUE we keep all observations from both files. Rows in file5 with no match in file1 are marked by NA in the columns from file1.

merge(file1, file5, by = "family", all = TRUE)
##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Without all = TRUE, unmatched cases will be dropped.

merge(file1, file5, by = "family")
##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

If we set all.y to be TRUE, all non-missing information from file5 (y) is kept, while information from file1 (x) is missing.

merge(file1, file5, by = "family", all.y = TRUE)
##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Setting all.x = TRUE keeps only complete cases from file5.

merge(file1, file5, by = "family", all.x = TRUE)
##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

merging using multiple identifiers

There are times when we need to merge data frames based on multiple common columns from the data frames to be merged.

For instance, in file3, each record is identified by id together with family. In other words, an individual with the same id can belong to different family.

file3
## # A tibble: 6 × 3
##      id family role  
##   <dbl>  <dbl> <chr> 
## 1     1      1 child 
## 2     2      1 child 
## 3     3      1 mother
## 4     1      2 father
## 5     2      2 mother
## 6     3      2 child

Same for file4, where each record is identified by id and family.

file4
## # A tibble: 6 × 4
##      id  year family rating
##   <dbl> <dbl>  <dbl> <chr> 
## 1     1  1995      1 A     
## 2     2  1995      1 B     
## 3     3  1995      1 B     
## 4     1  1995      2 C     
## 5     2  1995      2 B     
## 6     3  1995      2 A

To merge these two files, we use id and family to link the two data frames.

merge(file3, file4, by = c("id", "family"))
##   id family   role year rating
## 1  1      1  child 1995      A
## 2  1      2 father 1995      C
## 3  2      1  child 1995      B
## 4  2      2 mother 1995      B
## 5  3      1 mother 1995      B
## 6  3      2  child 1995      A

join operations

dplyr supports mutating joins that add columns from one data frame y to another data frame x and match observations based on the keys (identifiers).

inner_join() performs inner joins that only keeps observations from x that have a matching key in y. left_join(), right_join(), and full_join() perform outer joins that keep observations that appear in at least one of the data frames.

The sample data frames we have here are file6 and file7, which describe several companies in different industries and different years.

Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.

file6
## # A tibble: 7 × 3
##   industry company  rank
##   <chr>    <chr>   <dbl>
## 1 A        a           1
## 2 A        b           2
## 3 A        c           3
## 4 B        l           4
## 5 B        m           5
## 6 B        n           6
## 7 C        o           7
file7
## # A tibble: 6 × 2
##   industry  year
##   <chr>    <dbl>
## 1 A         1995
## 2 A         2005
## 3 B         1995
## 4 B         2005
## 5 D         1995
## 6 D         2005

inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combinations of the matches are returned.

When a many-to-many relationship is expected, we should set multiple = "all".

file6 %>% inner_join(file7, by = "industry", multiple = "all")
## # A tibble: 12 × 4
##    industry company  rank  year
##    <chr>    <chr>   <dbl> <dbl>
##  1 A        a           1  1995
##  2 A        a           1  2005
##  3 A        b           2  1995
##  4 A        b           2  2005
##  5 A        c           3  1995
##  6 A        c           3  2005
##  7 B        l           4  1995
##  8 B        l           4  2005
##  9 B        m           5  1995
## 10 B        m           5  2005
## 11 B        n           6  1995
## 12 B        n           6  2005

left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

file6 %>% left_join(file7, by = "industry", multiple = "all")
## # A tibble: 13 × 4
##    industry company  rank  year
##    <chr>    <chr>   <dbl> <dbl>
##  1 A        a           1  1995
##  2 A        a           1  2005
##  3 A        b           2  1995
##  4 A        b           2  2005
##  5 A        c           3  1995
##  6 A        c           3  2005
##  7 B        l           4  1995
##  8 B        l           4  2005
##  9 B        m           5  1995
## 10 B        m           5  2005
## 11 B        n           6  1995
## 12 B        n           6  2005
## 13 C        o           7    NA

right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

file6 %>% right_join(file7, by = "industry", multiple = "all")
## # A tibble: 14 × 4
##    industry company  rank  year
##    <chr>    <chr>   <dbl> <dbl>
##  1 A        a           1  1995
##  2 A        a           1  2005
##  3 A        b           2  1995
##  4 A        b           2  2005
##  5 A        c           3  1995
##  6 A        c           3  2005
##  7 B        l           4  1995
##  8 B        l           4  2005
##  9 B        m           5  1995
## 10 B        m           5  2005
## 11 B        n           6  1995
## 12 B        n           6  2005
## 13 D        <NA>       NA  1995
## 14 D        <NA>       NA  2005

full_join() returns all rows and all columns from both x and y. Where there are unmatched values, it returns NA for the missing one.

file6 %>% full_join(file7, by = "industry", multiple = "all")
## # A tibble: 15 × 4
##    industry company  rank  year
##    <chr>    <chr>   <dbl> <dbl>
##  1 A        a           1  1995
##  2 A        a           1  2005
##  3 A        b           2  1995
##  4 A        b           2  2005
##  5 A        c           3  1995
##  6 A        c           3  2005
##  7 B        l           4  1995
##  8 B        l           4  2005
##  9 B        m           5  1995
## 10 B        m           5  2005
## 11 B        n           6  1995
## 12 B        n           6  2005
## 13 C        o           7    NA
## 14 D        <NA>       NA  1995
## 15 D        <NA>       NA  2005

merging two demo datasets

Now if we want to merge our demo datasets, we first need to look for the common identifier(s) in the two datasets. Here we only have one identifier, which is symbol.

sp500tickers:

names(sp500tickers)
## [1] "Symbol"                "Security"              "SEC filings"          
## [4] "GICS Sector"           "GICS Sub-Industry"     "Headquarters Location"
## [7] "Date first added"      "CIK"                   "Founded"

sp500stocks:

names(sp500stocks)
## [1] "symbol"   "date"     "open"     "high"     "low"      "close"    "volume"  
## [8] "adjusted"

However, the variable names are not consistent in two data sources. So let’s rename one of the symbols.

names(sp500tickers)[1] <- "symbol"

Next, we want to join the tickers dataset to the stocks dataset, and keep all cases in the stocks dataset.

sp500 <- sp500stocks %>% left_join(sp500tickers, by = "symbol")

We are not going to use all the variables, so we’ll drop some of them.

sp500 %>% select(symbol, `GICS Sector`, date, open, high, low, close)
## # A tibble: 10,100 × 7
##    symbol `GICS Sector` date        open  high   low close
##    <chr>  <chr>         <date>     <dbl> <dbl> <dbl> <dbl>
##  1 MMM    Industrials   2021-04-12  198.  199.  197.  198.
##  2 MMM    Industrials   2021-04-13  196.  197.  195.  196.
##  3 MMM    Industrials   2021-04-14  196.  197.  195.  197.
##  4 MMM    Industrials   2021-04-15  197.  198.  196.  197.
##  5 MMM    Industrials   2021-04-16  199.  200.  198.  199.
##  6 MMM    Industrials   2021-04-19  198.  199.  197.  199.
##  7 MMM    Industrials   2021-04-20  199.  199.  197.  198.
##  8 MMM    Industrials   2021-04-21  198.  202.  198.  201.
##  9 MMM    Industrials   2021-04-22  201.  202.  200.  201.
## 10 MMM    Industrials   2021-04-23  201   203.  201.  202.
## # … with 10,090 more rows

non-equality joins

We have focused on equality joins so far, where the rows match if the x key equals the y key.

dplyr also supports non-equality joins:

  • Cross joins match every pair of rows.
  • Inequality joins use <, <=, >, and >= instead of ==.
  • Rolling joins are similar to inequality joins but only find the closest match.
  • Overlap joins are a special type of inequality join designed to work with ranges.

We’ll return to inequality joins and rolling joins in the case study Tweets and Stock Price Impacts.