Chapter 4 Quality Control
4.1 Merging using Indices
One way to check your merging is by adding indices to each dataset before you merge. SAS can do this and merge your data in a single data step; in R, we need to add indices beforehand.
# Creating a duplicate row in the "patient" dataset:
patient <- bind_rows(patient, patient[4,])
patient
## PT_ID Age Sex
## 1 1 69 M
## 2 2 54 F
## 3 3 70 F
## 4 4 64 M
## 5 4 64 M
patient$IndexP <- 1:nrow(patient)
toxicity$IndexT <- 1:nrow(toxicity)
pt_tox <- right_join(patient, toxicity, by="PT_ID")
pt_tox
## PT_ID Age Sex IndexP Time_since_dx toxicity_grade IndexT
## 1 1 69 M 1 151 2 1
## 2 1 69 M 1 46 1 2
## 3 1 69 M 1 262 3 3
## 4 2 54 F 2 89 1 4
## 5 2 54 F 2 277 4 5
## 6 3 70 F 3 192 2 6
## 7 4 64 M 4 193 1 7
## 8 4 64 M 4 195 1 8
## 9 4 64 M 5 193 1 7
## 10 4 64 M 5 195 1 8
## 11 5 NA <NA> NA 124 3 9
## 12 5 NA <NA> NA 84 1 10
4.2 Duplicates
We wanted to merge one-to-many, and we knew all patients in “toxicity” should also appear in “patient”. This should mean we have only 10 observations (same number of rows as the toxicity dataset). We’ve ended up with 12, meaning we have to resolve some duplicated observations.
The janitor package has a function get_dupe() for finding duplicated observations:
- Identifying all duplicated rows from ‘patient’
## IndexP dupe_count PT_ID Age Sex Time_since_dx toxicity_grade IndexT
## 1 1 3 1 69 M 151 2 1
## 2 1 3 1 69 M 46 1 2
## 3 1 3 1 69 M 262 3 3
## 4 2 2 2 54 F 89 1 4
## 5 2 2 2 54 F 277 4 5
## 6 4 2 4 64 M 193 1 7
## 7 4 2 4 64 M 195 1 8
## 8 5 2 4 64 M 193 1 7
## 9 5 2 4 64 M 195 1 8
## 10 NA 2 5 NA <NA> 124 3 9
## 11 NA 2 5 NA <NA> 84 1 10
- Identifying all duplicated rows from ‘toxicity’
## IndexT dupe_count PT_ID Age Sex IndexP Time_since_dx toxicity_grade
## 1 7 2 4 64 M 4 193 1
## 2 7 2 4 64 M 5 193 1
## 3 8 2 4 64 M 4 195 1
## 4 8 2 4 64 M 5 195 1
From here, we can look back at ‘patient’ and remove the duplicated observations:
## PT_ID dupe_count Age Sex IndexP
## 1 4 2 64 M 4
## 2 4 2 64 M 5
And merge again:
## No duplicate combinations found of: IndexT
## [1] IndexT dupe_count PT_ID Age Sex
## [6] IndexP Time_since_dx toxicity_grade
## <0 rows> (or 0-length row.names)
4.3 Dropped Observations
You can also use indices to keep track of the rows you’ve dropped during merging. anti_join() returns all rows in dataset A without a match in dataset B. (note: order matters)
Rows in ‘patient’ that do not appear in ‘treatment’
## PT_ID Age Sex IndexP
## 1 3 70 F 3
and from ‘treatment’:
## PT_ID first_line_tx IndexTr
## 1 5 Radiation 4