Chapter 4 Quality Control

4.1 Merging using Indices

One way to check your merging is by adding indices to each dataset before you merge. SAS can do this and merge your data in a single data step; in R, we need to add indices beforehand.

# Creating a duplicate row in the "patient" dataset:
patient <- bind_rows(patient, patient[4,])
patient

##   PT_ID Age Sex
## 1     1  69   M
## 2     2  54   F
## 3     3  70   F
## 4     4  64   M
## 5     4  64   M

patient$IndexP <- 1:nrow(patient)
toxicity$IndexT <- 1:nrow(toxicity)

pt_tox <- right_join(patient, toxicity, by="PT_ID")
pt_tox

##    PT_ID Age  Sex IndexP Time_since_dx toxicity_grade IndexT
## 1      1  69    M      1           151              2      1
## 2      1  69    M      1            46              1      2
## 3      1  69    M      1           262              3      3
## 4      2  54    F      2            89              1      4
## 5      2  54    F      2           277              4      5
## 6      3  70    F      3           192              2      6
## 7      4  64    M      4           193              1      7
## 8      4  64    M      4           195              1      8
## 9      4  64    M      5           193              1      7
## 10     4  64    M      5           195              1      8
## 11     5  NA <NA>     NA           124              3      9
## 12     5  NA <NA>     NA            84              1     10

4.2 Duplicates

We wanted to merge one-to-many, and we knew all patients in “toxicity” should also appear in “patient”. This should mean we have only 10 observations (same number of rows as the toxicity dataset). We’ve ended up with 12, meaning we have to resolve some duplicated observations.

The janitor package has a function get_dupe() for finding duplicated observations:

Identifying all duplicated rows from ‘patient’

# We should have duplicates here:
get_dupes(pt_tox, IndexP)

##    IndexP dupe_count PT_ID Age  Sex Time_since_dx toxicity_grade IndexT
## 1       1          3     1  69    M           151              2      1
## 2       1          3     1  69    M            46              1      2
## 3       1          3     1  69    M           262              3      3
## 4       2          2     2  54    F            89              1      4
## 5       2          2     2  54    F           277              4      5
## 6       4          2     4  64    M           193              1      7
## 7       4          2     4  64    M           195              1      8
## 8       5          2     4  64    M           193              1      7
## 9       5          2     4  64    M           195              1      8
## 10     NA          2     5  NA <NA>           124              3      9
## 11     NA          2     5  NA <NA>            84              1     10

Identifying all duplicated rows from ‘toxicity’

# should not return duplicates ina  one-to-many merge:
get_dupes(pt_tox, IndexT)

##   IndexT dupe_count PT_ID Age Sex IndexP Time_since_dx toxicity_grade
## 1      7          2     4  64   M      4           193              1
## 2      7          2     4  64   M      5           193              1
## 3      8          2     4  64   M      4           195              1
## 4      8          2     4  64   M      5           195              1

From here, we can look back at ‘patient’ and remove the duplicated observations:

get_dupes(patient, PT_ID)

##   PT_ID dupe_count Age Sex IndexP
## 1     4          2  64   M      4
## 2     4          2  64   M      5

patient <- patient[!duplicated(patient$PT_ID),]

And merge again:

pt_tox <- right_join(patient, toxicity, by="PT_ID")
get_dupes(pt_tox, IndexT)

## No duplicate combinations found of: IndexT

## [1] IndexT         dupe_count     PT_ID          Age            Sex           
## [6] IndexP         Time_since_dx  toxicity_grade
## <0 rows> (or 0-length row.names)

4.3 Dropped Observations

You can also use indices to keep track of the rows you’ve dropped during merging. anti_join() returns all rows in dataset A without a match in dataset B. (note: order matters)

patient$IndexP <- 1:nrow(patient)
treatment$IndexTr <- 1:nrow(treatment)

Rows in ‘patient’ that do not appear in ‘treatment’

anti_join(patient, treatment, by="PT_ID")

##   PT_ID Age Sex IndexP
## 1     3  70   F      3

and from ‘treatment’:

anti_join(treatment, patient, by="PT_ID")

##   PT_ID first_line_tx IndexTr
## 1     5     Radiation       4