Chapter 6 Day 2
6.1 Outline for Day 2
Disclaimer: The focus of the training is not on understanding the statistical concepts, but rather on how to implement common tasks in R - with a few stats sprinkles. I’m trying to show you the quick & basic way of doing this and give you an idea of the fancier approaches - which gives you more flexibility (but also has a steeper learning curve). You can of course dive deeper on those topics or request follow-up training. :)
Today we will cover the following topics:
- plotting (base R, ggplot2)
- regression models
- testing regression assumptions
- significance tests (chi2, t-test)
6.1.1 Take home messages to start with
- Work on a copy of your data in R - do not touch the raw data
- Work in R projects
- Structure your code with sections (Ctr shift R)
- help(x) or “?x” is your friend when trying to understand function x… scroll down to “Examples” in the helper window that pops up
# load all required packages
pacman::p_load(dplyr, readr, ggplot2, corrplot, NHANES, survey, broom, lmtest, car)
## # A tibble: 6 x 78
## ID SurveyYr Gender Age AgeMonths Race1 Race3 Education
## <int> <fct> <fct> <int> <int> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 409 White <NA> High School
## 2 51625 2009_10 male 4 49 Other <NA> <NA>
## 3 51626 2009_10 male 16 202 Black <NA> <NA>
## 4 51627 2009_10 male 10 131 Black <NA> <NA>
## 5 51628 2009_10 female 60 722 Black <NA> High School
## 6 51629 2009_10 male 26 313 Mexican <NA> 9 - 11th Gra~
## # ... with 70 more variables: MaritalStatus <fct>, HHIncome <fct>,
## # HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
## # HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>,
## # HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## # BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
## # BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
## # BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## # Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>,
## # UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>,
## # UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
## # HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
## # LittleInterest <fct>, Depressed <fct>, nPregnancies <int>,
## # nBabies <int>, Age1stBaby <int>, SleepHrsNight <int>,
## # SleepTrouble <fct>, PhysActive <fct>, PhysActiveDays <int>,
## # TVHrsDay <fct>, CompHrsDay <fct>, TVHrsDayChild <int>,
## # CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>,
## # Smoke100 <fct>, SmokeAge <int>, Marijuana <fct>,
## # AgeFirstMarij <int>, RegularMarij <fct>, AgeRegMarij <int>,
## # HardDrugs <fct>, SexEver <fct>, SexAge <int>,
## # SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
## # SexOrientation <fct>, WTINT2YR <dbl>, WTMEC2YR <dbl>,
## # SDMVPSU <int>, SDMVSTRA <int>, PregnantNow <fct>
## Rows: 20,293
## Columns: 78
## $ ID <int> 51624, 51625, 51626, 51627, 51628, 51629~
## $ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009~
## $ Gender <fct> male, male, male, male, female, male, fe~
## $ Age <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80, 10~
## $ AgeMonths <int> 409, 49, 202, 131, 722, 313, 596, 12, 12~
## $ Race1 <fct> White, Other, Black, Black, Black, Mexic~
## $ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Education <fct> High School, NA, NA, NA, High School, 9 ~
## $ MaritalStatus <fct> Married, NA, NA, NA, Widowed, Married, L~
## $ HHIncome <fct> 25000-34999, 20000-24999, 45000-54999, 2~
## $ HHIncomeMid <int> 30000, 22500, 50000, 22500, 12500, 30000~
## $ Poverty <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, 1.91~
## $ HomeRooms <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, 7, N~
## $ HomeOwn <fct> Own, Own, Own, Rent, Rent, Rent, Rent, R~
## $ Work <fct> NotWorking, NA, NotWorking, NA, NotWorki~
## $ Weight <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6, 86.~
## $ Length <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA, NA~
## $ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Height <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 173.0~
## $ BMI <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 32.61~
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ BMI_WHO <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, 12.0~
## $ Pulse <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, 88, ~
## $ BPSysAve <int> 113, NA, 109, 93, 150, 104, 112, NA, 108~
## $ BPDiaAve <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, 43, ~
## $ BPSys1 <int> 114, NA, 112, 92, 154, 102, 118, NA, 106~
## $ BPDia1 <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, 62, ~
## $ BPSys2 <int> 114, NA, 114, 94, 150, 104, 108, NA, 106~
## $ BPDia2 <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, 46, ~
## $ BPSys3 <int> 112, NA, 104, 92, 150, 104, 116, NA, 110~
## $ BPDia3 <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, 40, ~
## $ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ DirectChol <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.16, ~
## $ TotChol <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.70, ~
## $ UrineVol1 <int> 352, NA, 281, 139, 30, 202, 77, NA, 39, ~
## $ UrineFlow1 <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, 0.09~
## $ UrineVol2 <int> NA, NA, NA, NA, 246, NA, NA, NA, NA, NA,~
## $ UrineFlow2 <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA, NA~
## $ Diabetes <fct> No, No, No, No, Yes, No, No, No, No, No,~
## $ DiabetesAge <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, NA, ~
## $ HealthGen <fct> Good, NA, Vgood, NA, Fair, Good, Good, N~
## $ DaysPhysHlthBad <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, NA, 0~
## $ DaysMentHlthBad <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0, NA~
## $ LittleInterest <fct> Most, NA, NA, NA, Most, None, Several, N~
## $ Depressed <fct> Several, NA, NA, NA, Most, Most, Several~
## $ nPregnancies <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA~
## $ nBabies <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA~
## $ Age1stBaby <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, NA, ~
## $ SleepHrsNight <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA, 9,~
## $ SleepTrouble <fct> Yes, NA, No, NA, No, No, Yes, NA, NA, No~
## $ PhysActive <fct> No, NA, Yes, NA, No, Yes, No, NA, NA, Ye~
## $ PhysActiveDays <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4, NA,~
## $ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ TVHrsDayChild <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA, 3, ~
## $ CompHrsDayChild <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA, 0, ~
## $ Alcohol12PlusYr <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, Y~
## $ AlcoholDay <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1, NA~
## $ AlcoholYear <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52, NA~
## $ SmokeNow <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA, No~
## $ Smoke100 <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, ~
## $ SmokeAge <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, 16, ~
## $ Marijuana <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, NA, N~
## $ AgeFirstMarij <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, NA, ~
## $ RegularMarij <fct> No, NA, NA, NA, NA, Yes, No, NA, NA, NA,~
## $ AgeRegMarij <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, NA, ~
## $ HardDrugs <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, N~
## $ SexEver <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, ~
## $ SexAge <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, NA, N~
## $ SexNumPartnLife <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA, NA~
## $ SexNumPartYear <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA,~
## $ SameSex <fct> No, NA, NA, NA, No, No, Yes, NA, NA, NA,~
## $ SexOrientation <fct> Heterosexual, NA, NA, NA, NA, Heterosexu~
## $ WTINT2YR <dbl> 80100.544, 53901.104, 13953.078, 11664.8~
## $ WTMEC2YR <dbl> 81528.772, 56995.035, 14509.279, 12041.6~
## $ SDMVPSU <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2~
## $ SDMVSTRA <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, 77, ~
## $ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## [1] "None" "Several" "Most"
##
## None Several Most <NA>
## 7926 1774 814 9779
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.40 19.79 24.92 25.65 30.10 84.87 2279