Chapter 1 dplyr Merge Functions

The dplyr package has a suite of four functions for merging two datasets together. The syntax is borrowed from SQL joins.

1.1 Sample datasets, for illustration:

print(patient, row.names=F)
##  PT_ID Age Sex
##      1  69   M
##      2  54   F
##      3  70   F
##      4  64   M
print(treatment, row.names=F)
##  PT_ID     first_line_tx
##      1           Surgery
##      2           Surgery
##      4 Surgery+ Adjuvant
##      5         Radiation
print(toxicity, row.names=F)
##  PT_ID Time_since_dx toxicity_grade
##      1           151              2
##      1            46              1
##      1           262              3
##      2            89              1
##      2           277              4
##      3           192              2
##      4           193              1
##      4           195              1
##      5           124              3
##      5            84              1
print(healthutil, row.names=F)
##  PT_ID Time_since_dx   HUS
##      4            47 1.000
##      4           193 0.933
##      4           195 0.933
##      4           361 0.877
##      5            17 0.654
##      5            84 0.654
##      5           218 0.933
##      5           273 0.741

1.2 Full join: every observation from dataset A and B (the union of both sets)

full_join(patient, treatment, by="PT_ID")
##   PT_ID Age  Sex     first_line_tx
## 1     1  69    M           Surgery
## 2     2  54    F           Surgery
## 3     3  70    F              <NA>
## 4     4  64    M Surgery+ Adjuvant
## 5     5  NA <NA>         Radiation

1.3 Inner join: only observations in both A and B (the intersection of both sets)

inner_join(patient, treatment, by="PT_ID")
##   PT_ID Age Sex     first_line_tx
## 1     1  69   M           Surgery
## 2     2  54   F           Surgery
## 3     4  64   M Surgery+ Adjuvant

1.4 Left join: all observations from dataset A

left_join(patient, treatment, by="PT_ID")
##   PT_ID Age Sex     first_line_tx
## 1     1  69   M           Surgery
## 2     2  54   F           Surgery
## 3     3  70   F              <NA>
## 4     4  64   M Surgery+ Adjuvant

1.5 Right join: all observations from dataset B

right_join(patient, treatment, by="PT_ID")
##   PT_ID Age  Sex     first_line_tx
## 1     1  69    M           Surgery
## 2     2  54    F           Surgery
## 3     4  64    M Surgery+ Adjuvant
## 4     5  NA <NA>         Radiation