4.6 Merge (join)
Merging, or joining, datasets is the process by which you combine datasets that have at least some of the same observations (rows) but different variables (columns). Typically, there is one variable in common, called the “key” variable, so that R knows which rows from one dataset match the rows of the other. It is also possible to have multiple key variables.
In a one-to-one merge, each row in one dataset matches at most one row in the other. Some rows may not have any match, and may or may not be included depending on what options you choose. In a one-to-many merge, each row in one dataset can match multiple rows in the other. This occurs, for example, when you have individual-level information in one dataset including what county they live in, and county-level information in another dataset. The row for a county in the county-level data will match with all individuals who live in that county.
NOTES:
- If you are sure the two datasets are already in the same order, and have exactly the same number of rows, you could use
cbind()
orbind_cols()
to bind columns together like you did for rows in the previous section, but it is safer to merge by matching on a key variable. Also, merging with a key variable allows merging datasets of different size. - Stack vs. Merge: When you stack two datasets, the result will have the same number of columns as the source datasets and the number of rows will be the sum of the numbers of rows from the source datasets. When you merge two datasets, the rows in the result depend on the type of merge (see below), and the columns in the result will be the key variable(s) + all the non-key columns from the first dataset + all the non-key columns from the second dataset.
Before merging, check the dimensions and names of the datasets, and make sure they have one or more names in common (using intersect()
) and that these common names represent the key variable(s) you want to merge on. You may have to rename (Section 3.8) or remove (Section 4.1) variables if this is not the case.
## [1] 530 7
## [1] 530 8
## [1] "ID" "Age" "AgeGp" "Sex" "Yrs_From_Dx"
## [6] "CDAI" "CDAI_YN"
## [1] "ID" "DAS_28" "DAS28_YN" "Steroids_GT_5" "DMARDs"
## [6] "Biologics" "sDMARDS" "OsteopScreen"
## [1] "ID"
## [1] 530 14
## [1] "ID" "Age" "AgeGp" "Sex" "Yrs_From_Dx"
## [6] "CDAI" "CDAI_YN" "DAS_28" "DAS28_YN" "Steroids_GT_5"
## [11] "DMARDs" "Biologics" "sDMARDS" "OsteopScreen"
By default, merge()
uses all variables that are common to the two datasets as keys to merge on. If this is not what you want, then modify your datasets ahead of time and/or use the by
option. See ?merge
for more information.
NOTE: In the example above, the two datasets each had 530 observations and the same set of 530 IDs. If there are rows in one that do not match any rows in the other, then merge()
will exclude those rows by default. You can choose to include these rows using the all
, all.x
, and all.y
options (examples below). If including rows that do not match, missing values will be assigned to the variables from the dataset that did not have that row.
Below are examples demonstrating four possible ways in base R of merging two datasets that do not have all rows in common:
# Create the datasets needed for this example
# (two dataset that have some but not all rows in common)
X <- dat_merge_1[ 1:400,]
Y <- dat_merge_2[201:500,]
# X has 400 rows
nrow(X)
## [1] 400
## [1] 300
# They have 200 rows in common (rows 201 to 400)
# By default, merge only keeps rows that are in both X and Y ("inner join")
Z <- merge(X, Y)
nrow(Z)
## [1] 200
# Keep all rows from X, but only those from Y that match ("left join")
Z <- merge(X, Y, all.x = T)
nrow(Z)
## [1] 400
# Keep all rows from Y, but only those from X that match ("right join")
Z <- merge(X, Y, all.y = T)
nrow(Z)
## [1] 300
## [1] 500
In tidyverse
, a “merge” is called a “join” which is the term more commonly used in relational databases. There are four main types of joins: left_join, right_join, inner_join, and full_join, which correspond to base R merge()
with various values of all.x
, all.y
, and all
as shown in the previous example. Let X
be the dataset on the left and Y
be the dataset on the right.
- inner_join() corresponds to
merge(X, Y)
– only retain rows that match betweenX
andY
. - left_join() corresponds to
merge(X, Y, all.x = T)
– retain all rows fromX
, whether or not they match inY
. - right_join() corresponds to
merge(X, Y, all.y = T)
– retain all rows fromY
, whether or not they match inX
. - full_join() corresponds to
merge(X, Y, all = T)
– retain all rows from bothX
andY
, whether or not they match.
Also, in tidyverse, you can use these functions in a pipe.
## [1] 530 14
## [1] "ID" "Age" "AgeGp" "Sex" "Yrs_From_Dx"
## [6] "CDAI" "CDAI_YN" "DAS_28" "DAS28_YN" "Steroids_GT_5"
## [11] "DMARDs" "Biologics" "sDMARDS" "OsteopScreen"
## [1] 530 14
## [1] "ID" "Age" "AgeGp" "Sex" "Yrs_From_Dx"
## [6] "CDAI" "CDAI_YN" "DAS_28" "DAS28_YN" "Steroids_GT_5"
## [11] "DMARDs" "Biologics" "sDMARDS" "OsteopScreen"
The following illustrates the four types of joins using two datasets that do not have all rows in common (the same X
and Y
as used above).
## [1] 400
## [1] 300
# They have 200 rows in common (rows 201 to 400)
# Only keep rows that are in both X and Y
Z <- inner_join(X, Y)
nrow(Z)
## [1] 200
## [1] 400
## [1] 300
## [1] 500
For more information on joining datasets using tidyverse
, see R for Data Science: Relational Data (Wickham and Grolemund 2017).
4.6.1 Example of a one-to-many merge
For this example, I randomly added a county-level geographic identifier (FIPS code) to the rheumatoid arthritis data. The arthritis data does not actually have any geographic information; this is an artificial example that is just being used to demonstrate a one-to-many merge. I then obtained the population for each U.S county from the USAFACTS COVID-19 data (Source: USAFacts https://usafacts.org/). Now we can merge the county-level population values into the arthritis data.
## [1] "ID" "Age" "AgeGp" "Sex" "Yrs_From_Dx"
## [6] "CDAI" "CDAI_YN" "DAS_28" "DAS28_YN" "Steroids_GT_5"
## [11] "DMARDs" "Biologics" "sDMARDS" "OsteopScreen" "FIPS"
## [1] "countyFIPS" "County Name" "State" "population"
## character(0)
The two datasets have no names are in common, but mydat
has FIPS
and county
has countyFIPS
. These contain the same information but have different names.
## [1] 3142
## [1] 530
## [1] 25
There are 3142 unique counties. mydat
has 530 observations but only 25 unique counties. Thus, when we merge in the population values, each of the 25 county populations will be matched with multiple observations in the arthritis data.
Use by.x
and by.y
to tell R that these contain the information to match on. Use all.x = T
since we want to retain all the rows in X
even if a row has missing or mis-spelled county information.
## [1] 530 15
## [1] 3142 4
## [1] 530 18
## [1] "FIPS" "ID" "Age" "AgeGp" "Sex"
## [6] "Yrs_From_Dx" "CDAI" "CDAI_YN" "DAS_28" "DAS28_YN"
## [11] "Steroids_GT_5" "DMARDs" "Biologics" "sDMARDS" "OsteopScreen"
## [16] "County Name" "State" "population"
FIPS
is retained, but not countyFIPS
. Since this is the key (by) variable, retaining both would be redundant.
Let’s examine more closely to see how this was a one-to-many merge. The following are the unique FIPS values in the dataset and how many observations correspond to each.
##
## 1043 5111 13031 13065 13237 17135 17153 18019 21043 21223 23023 26061 26149 28003
## 18 21 24 21 22 21 25 20 21 20 17 25 20 17
## 31131 35009 36091 37003 40025 41025 46109 47187 48427 51595 55073
## 20 17 26 19 21 25 26 25 22 20 17
Let’s pick the first two and examine the population values. You can see that all ID
values with the same FIPS have the same population value, and those with different FIPS have different values. In other words, “one” population value was merged with “many” individuals.
## ID FIPS population
## 1 216 1043 83768
## 2 398 1043 83768
## 3 232 1043 83768
## 4 103 1043 83768
## 5 65 1043 83768
## 6 489 1043 83768
## 7 478 1043 83768
## 8 557 1043 83768
## 9 218 1043 83768
## 10 483 1043 83768
## 11 140 1043 83768
## 12 157 1043 83768
## 13 45 1043 83768
## 14 173 1043 83768
## 15 292 1043 83768
## 16 355 1043 83768
## 17 221 1043 83768
## 18 499 1043 83768
## 19 199 5111 23528
## 20 541 5111 23528
## 21 434 5111 23528
## 22 267 5111 23528
## 23 38 5111 23528
## 24 50 5111 23528
## 25 490 5111 23528
## 26 383 5111 23528
## 27 354 5111 23528
## 28 36 5111 23528
## 29 402 5111 23528
## 30 320 5111 23528
## 31 55 5111 23528
## 32 445 5111 23528
## 33 407 5111 23528
## 34 375 5111 23528
## 35 547 5111 23528
## 36 117 5111 23528
## 37 465 5111 23528
## 38 302 5111 23528
## 39 182 5111 23528
In tidyverse
, use by
to tell R that these two variables contain the information to match on. Use left_join()
since we want to retain all the rows in mydat
even if a row has missing or mis-spelled county information.
mydat_tibble_pop <- mydat_tibble %>%
left_join(county,
by = c("FIPS" = "countyFIPS"))
dim(mydat_tibble)
dim(county)
dim(mydat_tibble_pop)
names(mydat_tibble_pop)
NOTE: The examples given used relatively clean data. If there are missing values for the key variable, you may get unexpected results. See the R help file ?inner_join
for options regarding missing values.