Chapter 2 Advantages of dplyr suite

2.1 Advantages

  • intuitive naming conventions

  • doesn’t let you merge identifiers of different data classes (this can cause issues with MRNs)

treatment$PT_ID <- as.character(treatment$PT_ID)
merge(patient, treatment, by="PT_ID")
##   PT_ID Age Sex     first_line_tx
## 1     1  69   M           Surgery
## 2     2  54   F           Surgery
## 3     4  64   M Surgery+ Adjuvant
full_join(patient, treatment, by="PT_ID")
## Error: Can't join on `x$PT_ID` x `y$PT_ID` because of incompatible types.
## i `x$PT_ID` is of type <double>>.
## i `y$PT_ID` is of type <character>>.

2.2 Disadvantages

  • Doesn’t allow merging if one of your source datasets has multiple columns with the same name
patient <- data.frame(patient, 
                 c("Andrews","Benson","Cho","Doherty"),
                 c("","","","")
                 )
names(patient)[4:5] <- "Doctor"
print(patient)
##   PT_ID Age Sex  Doctor Doctor
## 1     1  69   M Andrews       
## 2     2  54   F  Benson       
## 3     3  70   F     Cho       
## 4     4  64   M Doherty
full_join(patient, treatment, by="PT_ID")
## Error: Input columns in `x` must be unique.
## x Problem with `Doctor`.

However, we can clean this up using the janitor package:

patient <- clean_names(patient) # note: this will coerce your variable names to lower case

full_join(patient, treatment, by=c("pt_id"="PT_ID"))
##   pt_id age  sex  doctor doctor_2     first_line_tx
## 1     1  69    M Andrews                    Surgery
## 2     2  54    F  Benson                    Surgery
## 3     3  70    F     Cho                       <NA>
## 4     4  64    M Doherty          Surgery+ Adjuvant
## 5     5  NA <NA>    <NA>     <NA>         Radiation