4.6 Merge (join)

Merging, or joining, datasets is the process by which you combine datasets that have at least some of the same observations (rows) but different variables (columns). Typically, there is one variable in common, called the “key” variable, so that R knows which rows from one dataset match the rows of the other. It is also possible to have multiple key variables.

In a one-to-one merge, each row in one dataset matches at most one row in the other. Some rows may not have any match, and may or may not be included depending on what options you choose. In a one-to-many merge, each row in one dataset can match multiple rows in the other. This occurs, for example, when you have individual-level information in one dataset including what county they live in, and county-level information in another dataset. The row for a county in the county-level data will match with all individuals who live in that county.

NOTES:

If you are sure the two datasets are already in the same order, and have exactly the same number of rows, you could use cbind() or bind_cols() to bind columns together like you did for rows in the previous section, but it is safer to merge by matching on a key variable. Also, merging with a key variable allows merging datasets of different size.
Stack vs. Merge: When you stack two datasets, the result will have the same number of columns as the source datasets and the number of rows will be the sum of the numbers of rows from the source datasets. When you merge two datasets, the rows in the result depend on the type of merge (see below), and the columns in the result will be the key variable(s) + all the non-key columns from the first dataset + all the non-key columns from the second dataset.

Before merging, check the dimensions and names of the datasets, and make sure they have one or more names in common (using intersect()) and that these common names represent the key variable(s) you want to merge on. You may have to rename (Section 3.8) or remove (Section 4.1) variables if this is not the case.

load("Data/dat_merge_1.RData")
load("Data/dat_merge_2.RData")

# n1 columns
dim(dat_merge_1)

## [1] 530   7

# n2 columns
dim(dat_merge_2)

## [1] 530   8

names(dat_merge_1)

## [1] "ID"          "Age"         "AgeGp"       "Sex"         "Yrs_From_Dx"
## [6] "CDAI"        "CDAI_YN"

names(dat_merge_2)

## [1] "ID"            "DAS_28"        "DAS28_YN"      "Steroids_GT_5" "DMARDs"       
## [6] "Biologics"     "sDMARDS"       "OsteopScreen"

# ID is the key variable
intersect(names(dat_merge_1),
          names(dat_merge_2))

## [1] "ID"

dat <- merge(dat_merge_1,
             dat_merge_2)

# Number of columns = (n1 - 1) + (n2 - 1)
dim(dat)

## [1] 530  14

names(dat)

##  [1] "ID"            "Age"           "AgeGp"         "Sex"           "Yrs_From_Dx"  
##  [6] "CDAI"          "CDAI_YN"       "DAS_28"        "DAS28_YN"      "Steroids_GT_5"
## [11] "DMARDs"        "Biologics"     "sDMARDS"       "OsteopScreen"

By default, merge() uses all variables that are common to the two datasets as keys to merge on. If this is not what you want, then modify your datasets ahead of time and/or use the by option. See ?merge for more information.

NOTE: In the example above, the two datasets each had 530 observations and the same set of 530 IDs. If there are rows in one that do not match any rows in the other, then merge() will exclude those rows by default. You can choose to include these rows using the all, all.x, and all.y options (examples below). If including rows that do not match, missing values will be assigned to the variables from the dataset that did not have that row.

Below are examples demonstrating four possible ways in base R of merging two datasets that do not have all rows in common:

# Create the datasets needed for this example
# (two dataset that have some but not all rows in common)
X <- dat_merge_1[  1:400,]
Y <- dat_merge_2[201:500,]

# X has 400 rows
nrow(X)

## [1] 400

# Y has 300 rows
nrow(Y)

## [1] 300

# They have 200 rows in common (rows 201 to 400)

# By default, merge only keeps rows that are in both X and Y ("inner join")
Z <- merge(X, Y)
nrow(Z)

## [1] 200

# Keep all rows from X, but only those from Y that match ("left join")
Z <- merge(X, Y, all.x = T)
nrow(Z)

## [1] 400

# Keep all rows from Y, but only those from X that match ("right join")
Z <- merge(X, Y, all.y = T)
nrow(Z)

## [1] 300

# Keep all rows from either X or Y ("full join")
Z <- merge(X, Y, all = T)
nrow(Z)

## [1] 500

In tidyverse, a “merge” is called a “join” which is the term more commonly used in relational databases. There are four main types of joins: left_join, right_join, inner_join, and full_join, which correspond to base R merge() with various values of all.x, all.y, and all as shown in the previous example. Let X be the dataset on the left and Y be the dataset on the right.

inner_join() corresponds to merge(X, Y) – only retain rows that match between X and Y.
left_join() corresponds to merge(X, Y, all.x = T) – retain all rows from X, whether or not they match in Y.
right_join() corresponds to merge(X, Y, all.y = T) – retain all rows from Y, whether or not they match in X.
full_join() corresponds to merge(X, Y, all = T) – retain all rows from both X and Y, whether or not they match.

Also, in tidyverse, you can use these functions in a pipe.

dat <- inner_join(dat_merge_1,
                  dat_merge_2)
dim(dat)

## [1] 530  14

names(dat)

##  [1] "ID"            "Age"           "AgeGp"         "Sex"           "Yrs_From_Dx"  
##  [6] "CDAI"          "CDAI_YN"       "DAS_28"        "DAS28_YN"      "Steroids_GT_5"
## [11] "DMARDs"        "Biologics"     "sDMARDS"       "OsteopScreen"

# Or, if in a pipe
dat <- dat_merge_1 %>% 
  inner_join(dat_merge_2)
dim(dat)

## [1] 530  14

names(dat)

##  [1] "ID"            "Age"           "AgeGp"         "Sex"           "Yrs_From_Dx"  
##  [6] "CDAI"          "CDAI_YN"       "DAS_28"        "DAS28_YN"      "Steroids_GT_5"
## [11] "DMARDs"        "Biologics"     "sDMARDS"       "OsteopScreen"

The following illustrates the four types of joins using two datasets that do not have all rows in common (the same X and Y as used above).

# X has 400 rows
nrow(X)

## [1] 400

# Y has 300 rows
nrow(Y)

## [1] 300

# They have 200 rows in common (rows 201 to 400)

# Only keep rows that are in both X and Y
Z <- inner_join(X, Y)
nrow(Z)

## [1] 200

# Keep all rows from X, but only those from Y that match
Z <- left_join(X, Y)
nrow(Z)

## [1] 400

# Keep all rows from Y, but only those from X that match
Z <- right_join(X, Y)
nrow(Z)

## [1] 300

# Keep all rows from either X or Y ("full join")
Z <- full_join(X, Y)
nrow(Z)

## [1] 500

For more information on joining datasets using tidyverse, see R for Data Science: Relational Data (Wickham and Grolemund 2017).

4.6.1 Example of a one-to-many merge

For this example, I randomly added a county-level geographic identifier (FIPS code) to the rheumatoid arthritis data. The arthritis data does not actually have any geographic information; this is an artificial example that is just being used to demonstrate a one-to-many merge. I then obtained the population for each U.S county from the USAFACTS COVID-19 data (Source: USAFacts https://usafacts.org/). Now we can merge the county-level population values into the arthritis data.

load("Data/county.RData")

names(mydat)

##  [1] "ID"            "Age"           "AgeGp"         "Sex"           "Yrs_From_Dx"  
##  [6] "CDAI"          "CDAI_YN"       "DAS_28"        "DAS28_YN"      "Steroids_GT_5"
## [11] "DMARDs"        "Biologics"     "sDMARDS"       "OsteopScreen"  "FIPS"

names(county)

## [1] "countyFIPS"  "County Name" "State"       "population"

intersect(names(mydat),
          names(county))

## character(0)

The two datasets have no names are in common, but mydat has FIPS and county has countyFIPS. These contain the same information but have different names.

length(unique(county$countyFIPS))

## [1] 3142

nrow(mydat)

## [1] 530

length(unique(mydat$FIPS))

## [1] 25

There are 3142 unique counties. mydat has 530 observations but only 25 unique counties. Thus, when we merge in the population values, each of the 25 county populations will be matched with multiple observations in the arthritis data.

Use by.x and by.y to tell R that these contain the information to match on. Use all.x = T since we want to retain all the rows in X even if a row has missing or mis-spelled county information.

mydat_pop <- merge(mydat,
                   county,
                   by.x = "FIPS",
                   by.y = "countyFIPS",
                   all.x = T)

dim(mydat)

## [1] 530  15

dim(county)

## [1] 3142    4

dim(mydat_pop)

## [1] 530  18

names(mydat_pop)

##  [1] "FIPS"          "ID"            "Age"           "AgeGp"         "Sex"          
##  [6] "Yrs_From_Dx"   "CDAI"          "CDAI_YN"       "DAS_28"        "DAS28_YN"     
## [11] "Steroids_GT_5" "DMARDs"        "Biologics"     "sDMARDS"       "OsteopScreen" 
## [16] "County Name"   "State"         "population"

FIPS is retained, but not countyFIPS. Since this is the key (by) variable, retaining both would be redundant.

Let’s examine more closely to see how this was a one-to-many merge. The following are the unique FIPS values in the dataset and how many observations correspond to each.

table(mydat_pop$FIPS)

## 
##  1043  5111 13031 13065 13237 17135 17153 18019 21043 21223 23023 26061 26149 28003 
##    18    21    24    21    22    21    25    20    21    20    17    25    20    17 
## 31131 35009 36091 37003 40025 41025 46109 47187 48427 51595 55073 
##    20    17    26    19    21    25    26    25    22    20    17

Let’s pick the first two and examine the population values. You can see that all ID values with the same FIPS have the same population value, and those with different FIPS have different values. In other words, “one” population value was merged with “many” individuals.

mydat_pop %>% 
  filter(FIPS %in% c(1043, 5111)) %>% 
  select(ID, FIPS, population)

##     ID FIPS population
## 1  216 1043      83768
## 2  398 1043      83768
## 3  232 1043      83768
## 4  103 1043      83768
## 5   65 1043      83768
## 6  489 1043      83768
## 7  478 1043      83768
## 8  557 1043      83768
## 9  218 1043      83768
## 10 483 1043      83768
## 11 140 1043      83768
## 12 157 1043      83768
## 13  45 1043      83768
## 14 173 1043      83768
## 15 292 1043      83768
## 16 355 1043      83768
## 17 221 1043      83768
## 18 499 1043      83768
## 19 199 5111      23528
## 20 541 5111      23528
## 21 434 5111      23528
## 22 267 5111      23528
## 23  38 5111      23528
## 24  50 5111      23528
## 25 490 5111      23528
## 26 383 5111      23528
## 27 354 5111      23528
## 28  36 5111      23528
## 29 402 5111      23528
## 30 320 5111      23528
## 31  55 5111      23528
## 32 445 5111      23528
## 33 407 5111      23528
## 34 375 5111      23528
## 35 547 5111      23528
## 36 117 5111      23528
## 37 465 5111      23528
## 38 302 5111      23528
## 39 182 5111      23528

In tidyverse, use by to tell R that these two variables contain the information to match on. Use left_join() since we want to retain all the rows in mydat even if a row has missing or mis-spelled county information.

mydat_tibble_pop <- mydat_tibble %>% 
  left_join(county,
            by = c("FIPS" = "countyFIPS"))

dim(mydat_tibble)
dim(county)
dim(mydat_tibble_pop)
names(mydat_tibble_pop)

NOTE: The examples given used relatively clean data. If there are missing values for the key variable, you may get unexpected results. See the R help file ?inner_join for options regarding missing values.

References

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st edition. Sebastopol, CA: O’Reilly Media.