3.9 Exercise
Load
RheumArth_Exercise_Ch3.RData
. The dataset in R will be calledexercise_dat
. Population for county of residence was randomly added to this dataset for the purpose of this exercise.Examine the dataset using the methods in Section 2.6 to find data anomalies.
Clean the data using the methods in this Chapter.
- Fix data anomalies
- Deal with any missing value codes
- Use a log-transformation for
population
. Look at a histogram before and after the transformation. Does it look more symmetric after the transformation? - Collapse
Yrs_From_Dx
into the following groups: “0 to < 5”, “5 to < 10”, “10 to < 25”, and “25 to 70”. - Convert variables with just a few levels to factors. Use the labels shown in the “Codes” column of the codebook (RheumArth_Tx_AgeComparisons_Data Dictionary.pdf).
- For the categorical version of
Yrs_From_Dx
you created in the previous step, if it is already a factor with labels"[0,5)", "[5,10)", "[10,25)", "[25,70]"
then you do not need to convert it. Otherwise, convert it to a factor with those or similar labels of your choice.
So you can check your work, here is a summary of the dataset AFTER data cleaning:
> summary(exercise_dat)
ID Age AgeGp Sex Yrs_From_Dx
Min. : 1.0 Min. :42.00 40 to 70 years (control):459 Female:428 Min. : 1.0
1st Qu.:162.2 1st Qu.:54.00 75 and older (elderly) : 71 Male :102 1st Qu.: 3.0
Median :294.5 Median :59.00 Median : 7.0
Mean :290.6 Mean :60.64 Mean : 9.4
3rd Qu.:426.8 3rd Qu.:66.00 3rd Qu.:11.0
Max. :559.0 Max. :90.00 Max. :70.0
NA's :1 NA's :15
CDAI CDAI_YN DAS_28 DAS28_YN Steroids_GT_5 DMARDs
Min. : 0.0 No :324 Min. : 0.000 No :464 No :405 Min. :0.0000
1st Qu.: 6.0 Yes:206 1st Qu.: 1.825 Yes: 66 Yes :124 1st Qu.:0.0000
Median :10.0 Median : 2.500 NA's: 1 Median :1.0000
Mean :13.1 Mean : 2.923 Mean :0.7183
3rd Qu.:17.0 3rd Qu.: 3.310 3rd Qu.:1.0000
Max. :71.0 Max. :23.000 Max. :1.0000
NA's :324 NA's :464 NA's :1
Biologics sDMARDS OsteopScreen FIPS population Yrs_From_Dx_Group
No :332 No :502 No :216 Min. : 1007 Min. : 820 [0,5) :188
Yes :197 Yes : 27 Yes :306 1st Qu.:19105 1st Qu.: 11214 [5,10) :142
NA's: 1 NA's: 1 NA's: 8 Median :29179 Median : 25736 [10,25):145
Mean :30484 Mean : 117983 [25,70]: 40
3rd Qu.:45779 3rd Qu.: 79186 NA's : 15
Max. :56019 Max. :3175692
log_population
Min. : 6.709
1st Qu.: 9.325
Median :10.156
Mean :10.367
3rd Qu.:11.280
Max. :14.971
> str(exercise_dat)
'data.frame': 530 obs. of 18 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Age : num 85 86 83 83 85 79 90 90 87 82 ...
$ AgeGp : Factor w/ 2 levels "40 to 70 years (control)",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 1 1 1 ...
$ Yrs_From_Dx : int 27 27 10 9 NA NA 51 11 36 4 ...
$ CDAI : num NA 23 14.5 NA NA NA NA 40 6 NA ...
$ CDAI_YN : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 2 2 1 ...
$ DAS_28 : num NA NA NA NA NA NA NA NA NA NA ...
$ DAS28_YN : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Steroids_GT_5 : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
$ DMARDs : int 1 1 1 1 0 0 1 0 0 1 ...
$ Biologics : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 1 2 1 ...
$ sDMARDS : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ OsteopScreen : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
$ FIPS : int 31129 17133 41023 26147 5109 18017 47185 21041 17151 37001 ...
$ population : int 4148 34637 7199 159128 10718 37689 27345 10631 4177 169509 ...
$ Yrs_From_Dx_Group: Factor w/ 4 levels "[0,5)","[5,10)",..: 4 4 3 2 NA NA 4 3 4 1 ...
$ log_population : num 8.33 10.45 8.88 11.98 9.28 ...