Chapter 3 Types of Data Merging
3.1 One-to-one merge
Each unique identifier corresponds to only one row in both datasets.
## Joining, by = "PT_ID"
## PT_ID Age Sex first_line_tx
## 1 1 69 M Surgery
## 2 2 54 F Surgery
## 3 3 70 F <NA>
## 4 4 64 M Surgery+ Adjuvant
## 5 5 NA <NA> Radiation
3.2 One-to-many merge
Each unique identifier in dataset A corresponds to one row. Unique identifiers in dataset B may correspond to multiple rows.
## PT_ID Age Sex Time_since_dx toxicity_grade
## 1 1 69 M 151 2
## 2 1 69 M 46 1
## 3 1 69 M 262 3
## 4 2 54 F 89 1
## 5 2 54 F 277 4
## 6 3 70 F 192 2
## 7 4 64 M 193 1
## 8 4 64 M 195 1
## 9 5 NA <NA> 124 3
## 10 5 NA <NA> 84 1
3.3 Many-to-many
Identifiers can correspond to multiple rows in both datasets.
## PT_ID Time_since_dx.x toxicity_grade Time_since_dx.y HUS
## 1 1 151 2 NA NA
## 2 1 46 1 NA NA
## 3 1 262 3 NA NA
## 4 2 89 1 NA NA
## 5 2 277 4 NA NA
## 6 3 192 2 NA NA
## 7 4 193 1 47 1.000
## 8 4 193 1 193 0.933
## 9 4 193 1 195 0.933
## 10 4 193 1 361 0.877
## 11 4 195 1 47 1.000
## 12 4 195 1 193 0.933
## 13 4 195 1 195 0.933
## 14 4 195 1 361 0.877
## 15 5 124 3 17 0.654
## 16 5 124 3 84 0.654
## 17 5 124 3 218 0.933
## 18 5 124 3 273 0.741
## 19 5 84 1 17 0.654
## 20 5 84 1 84 0.654
## 21 5 84 1 218 0.933
## 22 5 84 1 273 0.741
3.4 Merging on multiple identifiers
Often when merging two datasets with multiple rows per patient, we want to match on multiple factors to avoid duplicates.
- Ex: merging toxicity and healthutil on patient identifier AND time since diagnosis will give us only the observations where toxicity and health utility were measured on the same day:
## PT_ID Time_since_dx toxicity_grade HUS
## 1 4 193 1 0.933
## 2 4 195 1 0.933
## 3 5 84 1 0.654
This is no longer a many-to-many merge; each row is unique with respect to both PT_ID and time in both datasets, so we’re merging one-to-one.