3.9 Exercise

  1. Load RheumArth_Exercise_Ch3.RData. The dataset in R will be called exercise_dat. Population for county of residence was randomly added to this dataset for the purpose of this exercise.

  2. Examine the dataset using the methods in Section 2.6 to find data anomalies.

  3. Clean the data using the methods in this Chapter.

  1. Fix data anomalies
  2. Deal with any missing value codes
  3. Use a log-transformation for population. Look at a histogram before and after the transformation. Does it look more symmetric after the transformation?
  4. Collapse Yrs_From_Dx into the following groups: “0 to < 5”, “5 to < 10”, “10 to < 25”, and “25 to 70”.
  5. Convert variables with just a few levels to factors. Use the labels shown in the “Codes” column of the codebook (RheumArth_Tx_AgeComparisons_Data Dictionary.pdf).
  6. For the categorical version of Yrs_From_Dx you created in the previous step, if it is already a factor with labels "[0,5)", "[5,10)", "[10,25)", "[25,70]" then you do not need to convert it. Otherwise, convert it to a factor with those or similar labels of your choice.

So you can check your work, here is a summary of the dataset AFTER data cleaning:

> summary(exercise_dat)
       ID             Age                             AgeGp         Sex       Yrs_From_Dx  
 Min.   :  1.0   Min.   :42.00   40 to 70 years (control):459   Female:428   Min.   : 1.0  
 1st Qu.:162.2   1st Qu.:54.00   75 and older (elderly)  : 71   Male  :102   1st Qu.: 3.0  
 Median :294.5   Median :59.00                                               Median : 7.0  
 Mean   :290.6   Mean   :60.64                                               Mean   : 9.4  
 3rd Qu.:426.8   3rd Qu.:66.00                                               3rd Qu.:11.0  
 Max.   :559.0   Max.   :90.00                                               Max.   :70.0  
                 NA's   :1                                                   NA's   :15    

      CDAI      CDAI_YN       DAS_28       DAS28_YN  Steroids_GT_5     DMARDs      
 Min.   : 0.0   No :324   Min.   : 0.000   No :464   No  :405      Min.   :0.0000  
 1st Qu.: 6.0   Yes:206   1st Qu.: 1.825   Yes: 66   Yes :124      1st Qu.:0.0000  
 Median :10.0             Median : 2.500             NA's:  1      Median :1.0000  
 Mean   :13.1             Mean   : 2.923                           Mean   :0.7183  
 3rd Qu.:17.0             3rd Qu.: 3.310                           3rd Qu.:1.0000  
 Max.   :71.0             Max.   :23.000                           Max.   :1.0000  
 NA's   :324              NA's   :464                              NA's   :1       

 Biologics  sDMARDS    OsteopScreen      FIPS         population      Yrs_From_Dx_Group
 No  :332   No  :502   No  :216     Min.   : 1007   Min.   :    820   [0,5)  :188      
 Yes :197   Yes : 27   Yes :306     1st Qu.:19105   1st Qu.:  11214   [5,10) :142      
 NA's:  1   NA's:  1   NA's:  8     Median :29179   Median :  25736   [10,25):145      
                                    Mean   :30484   Mean   : 117983   [25,70]: 40      
                                    3rd Qu.:45779   3rd Qu.:  79186   NA's   : 15      
                                    Max.   :56019   Max.   :3175692                    
                                                                                       
 log_population  
 Min.   : 6.709  
 1st Qu.: 9.325  
 Median :10.156  
 Mean   :10.367  
 3rd Qu.:11.280  
 Max.   :14.971  
                 
> str(exercise_dat)
'data.frame':   530 obs. of  18 variables:
 $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Age              : num  85 86 83 83 85 79 90 90 87 82 ...
 $ AgeGp            : Factor w/ 2 levels "40 to 70 years (control)",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Sex              : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 1 1 1 ...
 $ Yrs_From_Dx      : int  27 27 10 9 NA NA 51 11 36 4 ...
 $ CDAI             : num  NA 23 14.5 NA NA NA NA 40 6 NA ...
 $ CDAI_YN          : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 2 2 1 ...
 $ DAS_28           : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DAS28_YN         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Steroids_GT_5    : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
 $ DMARDs           : int  1 1 1 1 0 0 1 0 0 1 ...
 $ Biologics        : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 1 2 1 ...
 $ sDMARDS          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ OsteopScreen     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
 $ FIPS             : int  31129 17133 41023 26147 5109 18017 47185 21041 17151 37001 ...
 $ population       : int  4148 34637 7199 159128 10718 37689 27345 10631 4177 169509 ...
 $ Yrs_From_Dx_Group: Factor w/ 4 levels "[0,5)","[5,10)",..: 4 4 3 2 NA NA 4 3 4 1 ...
 $ log_population   : num  8.33 10.45 8.88 11.98 9.28 ...