14 Data Manipulation (III)
In this and the following chapters, we discuss data frame manipulations. This chapter focuses on combining data frames by rows and by columns (merge/join).
14.1 Combining data frames
So far we have been working on our two demo datasets individually. The two datasets are certainly interesting on their own, but may generate more insights when combined together.
When should we combine data frames? There are a couple of scenarios where this may be necessary. As a motivating example, suppose we obtain datasets from an e-commerce company. The company stores information in different tables within their database, such as groceries, clothing, books, electronics, and so on. If we want to apply a model across all categories, we may need to combine all the tables into one.
In the second case, we receive datasets from the same e-commerce company, but they choose to store information for all categories in one table per day. If we’re using a model that requires historical data across time periods, we may need to combine all the tables into one again. In the first scenario, we combine data frames by columns, whereas in the second scenario, we combine them by rows.
Data frames can be combined vertically by rows or horizontally by columns.
rbind()
andcbind()
combine matrices by rows and columns.merge()
combines data frames by columns and is more flexible thancbind()
.- Additionally,
dplyr
supports mutating joins and filtering joins to link data frames using the database join language. We’ll discuss mutating joins in this chapter.
14.2 Binding data frames by rows
Let’s use two sample datasets to show how to append one data frame to another by rows. In this case, we are combining datasets from different years, and the variables are the same in two datasets.
## # A tibble: 6 × 5
## id year family role rating
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## # A tibble: 6 × 5
## id year family role rating
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2005 1 child A
## 2 2 2005 1 child A
## 3 3 2005 1 mother B
## 4 1 2005 2 father B
## 5 2 2005 2 mother C
## 6 3 2005 2 child B
The task is to combine file2
to file1
by rows vertically. This can be achieved with rbind()
.
## # A tibble: 12 × 5
## id year family role rating
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## 7 1 2005 1 child A
## 8 2 2005 1 child A
## 9 3 2005 1 mother B
## 10 1 2005 2 father B
## 11 2 2005 2 mother C
## 12 3 2005 2 child B
Or using dplyr::bind_rows()
:
## # A tibble: 12 × 5
## id year family role rating
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## 7 1 2005 1 child A
## 8 2 2005 1 child A
## 9 3 2005 1 mother B
## 10 1 2005 2 father B
## 11 2 2005 2 mother C
## 12 3 2005 2 child B
We usually bind datasets by rows when dealing with longitudinal data or datasets that record user behavior over a time period.
14.3 Binding data frames by columns
The terms we typically use to describe binding data frames by columns are merge and join. We discuss the base R merge()
function and dplyr
join functions in this section that perform merge or join operations.
There are several factors to consider when merging or joining two data frames by columns. Firstly, we need to identify a common variable within each data frame that links all the data frames (i.e. key in the database language). This variable should uniquely identifies each observation in each data frame. In some cases, a single variable is sufficient to identify each observation; in other cases, we may need a combination of multiple variables to create a group identifier.
Additionally, we need to decide how to match records across the tables, and whether to keep or discard them during the merge/join process.
merge()
merge()
merges two data frames by common columns or row names.
- In the simplest case, here we have
file1
andfile5
to merge by the common columnfamily
. This could happen when we have a survey, where there is a questionnaire for the household (id:family
) and another questionnaire for the family members (id:id
).
## # A tibble: 6 × 5
## id year family role rating
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## # A tibble: 3 × 3
## family area SES
## <dbl> <chr> <dbl>
## 1 1 N 1
## 2 2 S 2
## 3 3 N 3
by
sets the identifier.
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
- There are cases where the same variables have different names in each data frame.
## # A tibble: 3 × 3
## Family Area SES
## <dbl> <chr> <dbl>
## 1 1 N 1
## 2 2 S 2
## 3 3 N 3
These columns need to be specified before merging.
## family id year role rating Area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
- Besides, there can be cases where variables with the same names have different meanings. We should always refer to the codebook before we start to merge data frames.
keeping and dropping unmatched cases
To decide if we would like to keep or drop the unmatched cases from either data frame, we may further specify the all
argument.
There are three possible outcomes for a row in the data frame to the left-hand side:
- If it doesn’t match anything, it’s dropped.
- If it matches 1 row in the data frame to the right-hand side, it’s preserved.
- If it matches more than 1 row in the data frame to the right-hand sidw, it’s duplicated once for each match.
These possibilities can be specified by the arguments all
, all.x
and all.y
.
With all = TRUE
we keep all observations from both files. Rows in file5
with no match in file1
are marked by NA
in the columns from file1
.
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Without all = TRUE
, unmatched cases will be dropped.
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
If we set all.y
to be TRUE
, all non-missing information from file5
(y
) is kept, while information from file1
(x
) is missing.
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Setting all.x = TRUE
keeps only complete cases from file5
.
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
merging using multiple identifiers
There are times when we need to merge data frames based on multiple common columns from the data frames to be merged.
For instance, in file3
, each record is identified by id
together with family
. In other words, an individual with the same id
can belong to different family
.
## # A tibble: 6 × 3
## id family role
## <dbl> <dbl> <chr>
## 1 1 1 child
## 2 2 1 child
## 3 3 1 mother
## 4 1 2 father
## 5 2 2 mother
## 6 3 2 child
Same for file4
, where each record is identified by id
and family
.
## # A tibble: 6 × 4
## id year family rating
## <dbl> <dbl> <dbl> <chr>
## 1 1 1995 1 A
## 2 2 1995 1 B
## 3 3 1995 1 B
## 4 1 1995 2 C
## 5 2 1995 2 B
## 6 3 1995 2 A
To merge these two files, we use id
and family
to link the two data frames.
## id family role year rating
## 1 1 1 child 1995 A
## 2 1 2 father 1995 C
## 3 2 1 child 1995 B
## 4 2 2 mother 1995 B
## 5 3 1 mother 1995 B
## 6 3 2 child 1995 A
join operations
dplyr
supports mutating joins that add columns from one data frame y
to another data frame x
and match observations based on the keys (identifiers).
inner_join()
performs inner joins that only keeps observations from x
that have a matching key in y
. left_join()
, right_join()
, and full_join()
perform outer joins that keep observations that appear in at least one of the data frames.
The sample data frames we have here are file6
and file7
, which describe several companies in different industries and different years.
Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.
## # A tibble: 7 × 3
## industry company rank
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 A c 3
## 4 B l 4
## 5 B m 5
## 6 B n 6
## 7 C o 7
## # A tibble: 6 × 2
## industry year
## <chr> <dbl>
## 1 A 1995
## 2 A 2005
## 3 B 1995
## 4 B 2005
## 5 D 1995
## 6 D 2005
inner_join()
returns all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combinations of the matches are returned.
When a many-to-many relationship is expected, we should set multiple = "all"
.
## # A tibble: 12 × 4
## industry company rank year
## <chr> <chr> <dbl> <dbl>
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
left_join()
returns all rows from x
, and all columns from x
and y
. Rows in x
with no match in y
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.
## # A tibble: 13 × 4
## industry company rank year
## <chr> <chr> <dbl> <dbl>
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
right_join()
returns all rows from y
, and all columns from x
and y
. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.
## # A tibble: 14 × 4
## industry company rank year
## <chr> <chr> <dbl> <dbl>
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 D <NA> NA 1995
## 14 D <NA> NA 2005
full_join()
returns all rows and all columns from both x
and y
. Where there are unmatched values, it returns NA
for the missing one.
## # A tibble: 15 × 4
## industry company rank year
## <chr> <chr> <dbl> <dbl>
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
## 14 D <NA> NA 1995
## 15 D <NA> NA 2005
merging two demo datasets
Now if we want to merge our demo datasets, we first need to look for the common identifier(s) in the two datasets. Here we only have one identifier, which is symbol
.
sp500tickers
:
## [1] "Symbol" "Security" "SEC filings" "GICS Sector"
## [5] "GICS Sub-Industry" "Headquarters Location" "Date first added" "CIK"
## [9] "Founded"
sp500stocks
:
## [1] "symbol" "date" "open" "high" "low" "close" "volume" "adjusted"
However, the variable names are not consistent in two data sources. So let’s rename one of the symbol
s.
Next, we want to join the tickers dataset to the stocks dataset, and keep all cases in the stocks dataset.
We are not going to use all the variables, so we’ll drop some of them.
## # A tibble: 10,100 × 7
## symbol `GICS Sector` date open high low close
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 MMM Industrials 2021-04-12 198. 199. 197. 198.
## 2 MMM Industrials 2021-04-13 196. 197. 195. 196.
## 3 MMM Industrials 2021-04-14 196. 197. 195. 197.
## 4 MMM Industrials 2021-04-15 197. 198. 196. 197.
## 5 MMM Industrials 2021-04-16 199. 200. 198. 199.
## 6 MMM Industrials 2021-04-19 198. 199. 197. 199.
## 7 MMM Industrials 2021-04-20 199. 199. 197. 198.
## 8 MMM Industrials 2021-04-21 198. 202. 198. 201.
## 9 MMM Industrials 2021-04-22 201. 202. 200. 201.
## 10 MMM Industrials 2021-04-23 201 203. 201. 202.
## # ℹ 10,090 more rows
non-equality joins
We have focused on equality joins so far, where the rows match if the x
key equals the y
key.
dplyr
also supports non-equality joins:
- Cross joins match every pair of rows.
- Inequality joins use
<
,<=
,>
, and>=
instead of==
. - Rolling joins are similar to inequality joins but only find the closest match.
- Overlap joins are a special type of inequality join designed to work with ranges.
We’ll return to inequality joins and rolling joins in the case study Tweets and Stock Price Impacts.