10 Compare datasets: Tidy R script file for data preparation can help you avoid many different versions of datasets

Data Wrangling Recipes in R: Hilary Watt

It is so valuable to have an R script file that amends your dataset. It should start by reading in your data, then clean, check and create new variables as required. Variables that are not needed can be dropped. Then, if you find you need extra variables, you can rerun your code, whilst adding the new variables in.

Always remember to keep any ID variables in your dataset, as another way of adding in extra data as required.

With a tidy R script file, you may feel confident to keep on overwriting different versions of your data (keeping the version sent to you). Then you avoid being confused over which version to use.

In case you are confused, here are some ways to explore that may help you to compare data-frames, to see how similar they are.

anaemia2$deathICU <- NULL
anaemia2$test <- anaemia2$sex
anaemia2$hb_pre <- anaemia2$hb_pre^2
intersect(names(anaemia), names(anaemia2)) # shows variables in common
##  [1] "id2"                             "sex"                            
##  [3] "operat"                          "hb_pre"                         
##  [5] "hb_post"                         "death"                          
##  [7] "return"                          "hpt"                            
##  [9] "weight"                          "height"                         
## [11] "FFP"                             "plts"                           
## [13] "rbc"                             "los_pre"                        
## [15] "los_post"                        "los"                            
## [17] "futime"                          "death_fu"                       
## [19] "date_operat"                     "datetime_end_fu"                
## [21] "wgt"                             "height2"                        
## [23] "factorx"                         "hb_diff"                        
## [25] "bmi"                             "log_los_post"                   
## [27] "los_post_0_5"                    "log_los_post_0_5"               
## [29] "height2.n"                       "height2.mm"                     
## [31] "factorx.n"                       "wgt2.n"                         
## [33] "plts.f"                          "plts2"                          
## [35] "plts.f.n"                        "operat.f"                       
## [37] "operat.n"                        "operat.f2"                      
## [39] "operat.n2"                       "urgent"                         
## [41] "sex3"                            "sex_first"                      
## [43] "long_post"                       "elective"                       
## [45] "BMI_grp"                         "ht_q5"                          
## [47] "death_fu2"                       "date_operat2"                   
## [49] "datetime_end_fu2"                "date_end_fu"                    
## [51] "followup_time_from5nov2022.days"
setdiff(names(anaemia), names(anaemia2)) # shows variables in anaemia alone
## [1] "deathICU" "dup"
setdiff(names(anaemia2), names(anaemia)) # shows variables in anaemia2 alone
## [1] "test"
all.equal(anaemia, anaemia2, check.names = FALSE) # compares variables
##  [1] "Attributes: < Component \"row.names\": Numeric: lengths (1040, 0) differ >"                           
##  [2] "Length mismatch: comparison on first 52 components"                                                   
##  [3] "Component \"id2\": Lengths (1040, 0) differ (string compare on first 0)"                              
##  [4] "Component \"sex\": Lengths: 1040, 0"                                                                  
##  [5] "Component \"sex\": Lengths (1040, 0) differ (string compare on first 0)"                              
##  [6] "Component \"operat\": Lengths: 1040, 0"                                                               
##  [7] "Component \"operat\": Lengths (1040, 0) differ (string compare on first 0)"                           
##  [8] "Component \"hb_pre\": Numeric: lengths (1040, 0) differ"                                              
##  [9] "Component \"hb_post\": Numeric: lengths (1040, 0) differ"                                             
## [10] "Component \"death\": Lengths: 1040, 0"                                                                
## [11] "Component \"death\": Lengths (1040, 0) differ (string compare on first 0)"                            
## [12] "Component 7: Lengths: 1040, 0"                                                                        
## [13] "Component 7: Attributes: < Component \"levels\": Lengths (4, 3) differ (string compare on first 3) >" 
## [14] "Component 7: Attributes: < Component \"levels\": 2 string mismatches >"                               
## [15] "Component 7: Lengths (1040, 0) differ (string compare on first 0)"                                    
## [16] "Component 8: Lengths: 1040, 0"                                                                        
## [17] "Component 8: Lengths (1011, 0) differ (string compare on first 0)"                                    
## [18] "Component 9: 'current' is not a factor"                                                               
## [19] "Component 10: Numeric: lengths (1040, 0) differ"                                                      
## [20] "Component 11: Numeric: lengths (1040, 0) differ"                                                      
## [21] "Component 12: Numeric: lengths (1040, 0) differ"                                                      
## [22] "Component 13: Numeric: lengths (1040, 0) differ"                                                      
## [23] "Component 14: Numeric: lengths (1040, 0) differ"                                                      
## [24] "Component 15: Numeric: lengths (1040, 0) differ"                                                      
## [25] "Component 16: Numeric: lengths (1040, 0) differ"                                                      
## [26] "Component 17: Numeric: lengths (1040, 0) differ"                                                      
## [27] "Component 18: Numeric: lengths (1040, 0) differ"                                                      
## [28] "Component 19: Modes: numeric, character"                                                              
## [29] "Component 19: Lengths: 1040, 0"                                                                       
## [30] "Component 19: target is numeric, current is character"                                                
## [31] "Component 20: Lengths (1040, 0) differ (string compare on first 0)"                                   
## [32] "Component 21: Lengths (1040, 0) differ (string compare on first 0)"                                   
## [33] "Component 22: Lengths (1040, 0) differ (string compare on first 0)"                                   
## [34] "Component 23: Lengths (1040, 0) differ (string compare on first 0)"                                   
## [35] "Component 24: Modes: character, numeric"                                                              
## [36] "Component 24: Lengths: 1040, 0"                                                                       
## [37] "Component 24: target is character, current is numeric"                                                
## [38] "Component 25: Numeric: lengths (1040, 0) differ"                                                      
## [39] "Component 26: Numeric: lengths (1040, 0) differ"                                                      
## [40] "Component 27: Numeric: lengths (1040, 0) differ"                                                      
## [41] "Component 28: Numeric: lengths (1040, 0) differ"                                                      
## [42] "Component 29: Numeric: lengths (1040, 0) differ"                                                      
## [43] "Component 30: Numeric: lengths (1040, 0) differ"                                                      
## [44] "Component 31: Numeric: lengths (1040, 0) differ"                                                      
## [45] "Component 32: Numeric: lengths (1040, 0) differ"                                                      
## [46] "Component 33: Lengths: 1040, 0"                                                                       
## [47] "Component 33: Attributes: < target is NULL, current is list >"                                        
## [48] "Component 33: target is numeric, current is factor"                                                   
## [49] "Component 34: 'current' is not a factor"                                                              
## [50] "Component 35: Numeric: lengths (1040, 0) differ"                                                      
## [51] "Component 36: Lengths: 1040, 0"                                                                       
## [52] "Component 36: Attributes: < target is NULL, current is list >"                                        
## [53] "Component 36: target is numeric, current is factor"                                                   
## [54] "Component 37: 'current' is not a factor"                                                              
## [55] "Component 38: Lengths: 1040, 0"                                                                       
## [56] "Component 38: Attributes: < target is NULL, current is list >"                                        
## [57] "Component 38: target is numeric, current is factor"                                                   
## [58] "Component 39: 'current' is not a factor"                                                              
## [59] "Component 40: Numeric: lengths (1040, 0) differ"                                                      
## [60] "Component 41: Lengths: 1040, 0"                                                                       
## [61] "Component 41: Attributes: < target is NULL, current is list >"                                        
## [62] "Component 41: target is numeric, current is factor"                                                   
## [63] "Component 42: Lengths: 1040, 0"                                                                       
## [64] "Component 42: Attributes: < Component \"levels\": Lengths (2, 3) differ (string compare on first 2) >"
## [65] "Component 42: Attributes: < Component \"levels\": 2 string mismatches >"                              
## [66] "Component 42: Lengths (1037, 0) differ (string compare on first 0)"                                   
## [67] "Component 43: 'current' is not a factor"                                                              
## [68] "Component 44: Numeric: lengths (1040, 0) differ"                                                      
## [69] "Component 45: Lengths: 1040, 0"                                                                       
## [70] "Component 45: Attributes: < target is NULL, current is list >"                                        
## [71] "Component 45: target is numeric, current is factor"                                                   
## [72] "Component 46: Lengths: 1040, 0"                                                                       
## [73] "Component 46: Attributes: < Component \"levels\": Lengths (4, 5) differ (string compare on first 4) >"
## [74] "Component 46: Attributes: < Component \"levels\": 4 string mismatches >"                              
## [75] "Component 46: Lengths (1021, 0) differ (string compare on first 0)"                                   
## [76] "Component 47: 'current' is not a factor"                                                              
## [77] "Component 48: Lengths: 1040, 0"                                                                       
## [78] "Component 48: Attributes: < target is NULL, current is list >"                                        
## [79] "Component 48: target is numeric, current is Date"                                                     
## [80] "Component 49: Lengths: 1040, 0"                                                                       
## [81] "Component 49: Attributes: < Length mismatch: comparison on first 1 components >"                      
## [82] "Component 49: Attributes: < Component \"class\": Lengths (1, 2) differ (string compare on first 1) >" 
## [83] "Component 49: Attributes: < Component \"class\": 1 string mismatch >"                                 
## [84] "Component 49: target is Date, current is POSIXct"                                                     
## [85] "Component 50: 'current' is not a POSIXt"                                                              
## [86] "Component 51: Lengths: 1040, 0"                                                                       
## [87] "Component 51: Attributes: < Modes: list, NULL >"                                                      
## [88] "Component 51: Attributes: < Lengths: 1, 0 >"                                                          
## [89] "Component 51: Attributes: < names for target but not for current >"                                   
## [90] "Component 51: Attributes: < current is not list-like >"                                               
## [91] "Component 51: target is Date, current is numeric"                                                     
## [92] "Component 52: Lengths: 1040, 0"                                                                       
## [93] "Component 52: Attributes: < target is NULL, current is list >"                                        
## [94] "Component 52: target is numeric, current is factor"

The following package might be useful. Remember to install once per computer, then use library command each time you open R and want to use this package.

# install.packages("waldo")
library(waldo)
compare(anaemia2, anaemia)
## `old` is length 52
## `new` is length 53
## 
##     names(old) | names(new)     
## [4] "hb_pre"   | "hb_pre"   [4] 
## [5] "hb_post"  | "hb_post"  [5] 
## [6] "death"    | "death"    [6] 
##                - "deathICU" [7] 
## [7] "return"   | "return"   [8] 
## [8] "hpt"      | "hpt"      [9] 
## [9] "weight"   | "weight"   [10]
## 
## names(old)[49:52] vs names(new)[50:53]
##   "datetime_end_fu2"
##   "date_end_fu"
##   "followup_time_from5nov2022.days"
## - "test"
## + "dup"
## 
##     attr(old, 'row.names') | attr(new, 'row.names')                  
##                            - 392                    [1]              
##                            - 840                    [2]              
##                            - 429                    [3]              
##                            - 877                    [4]              
##                            - 794                    [5]              
##                            - 346                    [6]              
##                            - 617                    [7]              
##                            - 169                    [8]              
##                            - 12                     [9]              
##                            - 342                    [10]             
## ... ...                      ...                    and 1030 more ...
## 
##     old$id2 | new$id2                  
##             - "AB392" [1]              
##             - "AB840" [2]              
##             - "AB429" [3]              
##             - "AB877" [4]              
##             - "AB794" [5]              
##             - "AB346" [6]              
##             - "AB617" [7]              
##             - "AB169" [8]              
##             - "AB012" [9]              
##             - "AB342" [10]             
## ... ...       ...     and 1030 more ...
## 
##     old$sex | new$sex                   
##             - "female" [1]              
##             - "female" [2]              
##             - "female" [3]              
##             - "female" [4]              
##             - "male  " [5]              
##             - "male  " [6]              
##             - "male  " [7]              
##             - "male  " [8]              
##             - "male  " [9]              
##             - "male  " [10]             
## ... ...       ...      and 1030 more ...
## 
##     old$operat | new$operat                   
##                - "Elective " [1]              
##                - "Elective " [2]              
##                - "Elective " [3]              
##                - "Elective " [4]              
##                - "Urgent   " [5]              
##                - "Urgent   " [6]              
##                - "Elective " [7]              
##                - "Elective " [8]              
##                - "Emergency" [9]              
##                - "Urgent   " [10]             
## ... ...          ...         and 1030 more ...
## 
## `old$hb_pre` is a double vector ()
## `new$hb_pre` is an integer vector (185, 185, 178, 178, 175, ...)
## 
## `old$hb_post`:                                         and 1030 more...
## `new$hb_post`: 110 110 134 137 142 141 120 119 109 124              ...
## 
##     old$death | new$death                  
##               - "died "   [1]              
##               - "died "   [2]              
##               - "alive"   [3]              
##               - "alive"   [4]              
##               - "alive"   [5]              
##               - "alive"   [6]              
##               - "alive"   [7]              
##               - "alive"   [8]              
##               - "alive"   [9]              
##               - "alive"   [10]             
## ... ...         ...       and 1030 more ...
## 
## And 48 more differences ...

You can also compare by merging datasets. When you specify which variables to merge on, then for any variable that is in both datasets, you get 2 version (with .x and .y on the end of their names). You can then compare these variables, to see how similar they are.


The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.