Chapter 3 Types of Data Merging

3.1 One-to-one merge

Each unique identifier corresponds to only one row in both datasets.

full_join(patient, treatment)
## Joining, by = "PT_ID"
##   PT_ID Age  Sex     first_line_tx
## 1     1  69    M           Surgery
## 2     2  54    F           Surgery
## 3     3  70    F              <NA>
## 4     4  64    M Surgery+ Adjuvant
## 5     5  NA <NA>         Radiation

3.2 One-to-many merge

Each unique identifier in dataset A corresponds to one row. Unique identifiers in dataset B may correspond to multiple rows.

full_join(x=patient, y=toxicity,  by="PT_ID")
##    PT_ID Age  Sex Time_since_dx toxicity_grade
## 1      1  69    M           151              2
## 2      1  69    M            46              1
## 3      1  69    M           262              3
## 4      2  54    F            89              1
## 5      2  54    F           277              4
## 6      3  70    F           192              2
## 7      4  64    M           193              1
## 8      4  64    M           195              1
## 9      5  NA <NA>           124              3
## 10     5  NA <NA>            84              1

3.3 Many-to-many

Identifiers can correspond to multiple rows in both datasets.

full_join(x=toxicity, y=healthutil, by="PT_ID")
##    PT_ID Time_since_dx.x toxicity_grade Time_since_dx.y   HUS
## 1      1             151              2              NA    NA
## 2      1              46              1              NA    NA
## 3      1             262              3              NA    NA
## 4      2              89              1              NA    NA
## 5      2             277              4              NA    NA
## 6      3             192              2              NA    NA
## 7      4             193              1              47 1.000
## 8      4             193              1             193 0.933
## 9      4             193              1             195 0.933
## 10     4             193              1             361 0.877
## 11     4             195              1              47 1.000
## 12     4             195              1             193 0.933
## 13     4             195              1             195 0.933
## 14     4             195              1             361 0.877
## 15     5             124              3              17 0.654
## 16     5             124              3              84 0.654
## 17     5             124              3             218 0.933
## 18     5             124              3             273 0.741
## 19     5              84              1              17 0.654
## 20     5              84              1              84 0.654
## 21     5              84              1             218 0.933
## 22     5              84              1             273 0.741

3.4 Merging on multiple identifiers

Often when merging two datasets with multiple rows per patient, we want to match on multiple factors to avoid duplicates.

  • Ex: merging toxicity and healthutil on patient identifier AND time since diagnosis will give us only the observations where toxicity and health utility were measured on the same day:
inner_join(toxicity, healthutil, by=c("PT_ID", "Time_since_dx"))
##   PT_ID Time_since_dx toxicity_grade   HUS
## 1     4           193              1 0.933
## 2     4           195              1 0.933
## 3     5            84              1 0.654

This is no longer a many-to-many merge; each row is unique with respect to both PT_ID and time in both datasets, so we’re merging one-to-one.