Chapter 6 Day 2

6.1 Outline for Day 2

Disclaimer: The focus of the training is not on understanding the statistical concepts, but rather on how to implement common tasks in R - with a few stats sprinkles. I’m trying to show you the quick & basic way of doing this and give you an idea of the fancier approaches - which gives you more flexibility (but also has a steeper learning curve). You can of course dive deeper on those topics or request follow-up training. :)

Today we will cover the following topics:

  • plotting (base R, ggplot2)
  • regression models
  • testing regression assumptions
  • significance tests (chi2, t-test)

6.1.1 Take home messages to start with

  1. Work on a copy of your data in R - do not touch the raw data
  2. Work in R projects
  3. Structure your code with sections (Ctr shift R)
  4. help(x) or “?x” is your friend when trying to understand function x… scroll down to “Examples” in the helper window that pops up
# load all required packages
pacman::p_load(dplyr, readr, ggplot2, corrplot, NHANES, survey, broom, lmtest, car)
# inspect data
head(nhanes)
## # A tibble: 6 x 78
##      ID SurveyYr Gender   Age AgeMonths Race1   Race3 Education    
##   <int> <fct>    <fct>  <int>     <int> <fct>   <fct> <fct>        
## 1 51624 2009_10  male      34       409 White   <NA>  High School  
## 2 51625 2009_10  male       4        49 Other   <NA>  <NA>         
## 3 51626 2009_10  male      16       202 Black   <NA>  <NA>         
## 4 51627 2009_10  male      10       131 Black   <NA>  <NA>         
## 5 51628 2009_10  female    60       722 Black   <NA>  High School  
## 6 51629 2009_10  male      26       313 Mexican <NA>  9 - 11th Gra~
## # ... with 70 more variables: MaritalStatus <fct>, HHIncome <fct>,
## #   HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
## #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>,
## #   HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## #   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
## #   BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
## #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>,
## #   UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>,
## #   UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
## #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
## #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>,
## #   nBabies <int>, Age1stBaby <int>, SleepHrsNight <int>,
## #   SleepTrouble <fct>, PhysActive <fct>, PhysActiveDays <int>,
## #   TVHrsDay <fct>, CompHrsDay <fct>, TVHrsDayChild <int>,
## #   CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>,
## #   Smoke100 <fct>, SmokeAge <int>, Marijuana <fct>,
## #   AgeFirstMarij <int>, RegularMarij <fct>, AgeRegMarij <int>,
## #   HardDrugs <fct>, SexEver <fct>, SexAge <int>,
## #   SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
## #   SexOrientation <fct>, WTINT2YR <dbl>, WTMEC2YR <dbl>,
## #   SDMVPSU <int>, SDMVSTRA <int>, PregnantNow <fct>
dplyr::glimpse(nhanes) # if there are many columns (from dplyr package)
## Rows: 20,293
## Columns: 78
## $ ID               <int> 51624, 51625, 51626, 51627, 51628, 51629~
## $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009~
## $ Gender           <fct> male, male, male, male, female, male, fe~
## $ Age              <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80, 10~
## $ AgeMonths        <int> 409, 49, 202, 131, 722, 313, 596, 12, 12~
## $ Race1            <fct> White, Other, Black, Black, Black, Mexic~
## $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Education        <fct> High School, NA, NA, NA, High School, 9 ~
## $ MaritalStatus    <fct> Married, NA, NA, NA, Widowed, Married, L~
## $ HHIncome         <fct> 25000-34999, 20000-24999, 45000-54999, 2~
## $ HHIncomeMid      <int> 30000, 22500, 50000, 22500, 12500, 30000~
## $ Poverty          <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, 1.91~
## $ HomeRooms        <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, 7, N~
## $ HomeOwn          <fct> Own, Own, Own, Rent, Rent, Rent, Rent, R~
## $ Work             <fct> NotWorking, NA, NotWorking, NA, NotWorki~
## $ Weight           <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6, 86.~
## $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA, NA~
## $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Height           <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 173.0~
## $ BMI              <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 32.61~
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ BMI_WHO          <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, 12.0~
## $ Pulse            <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, 88, ~
## $ BPSysAve         <int> 113, NA, 109, 93, 150, 104, 112, NA, 108~
## $ BPDiaAve         <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, 43, ~
## $ BPSys1           <int> 114, NA, 112, 92, 154, 102, 118, NA, 106~
## $ BPDia1           <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, 62, ~
## $ BPSys2           <int> 114, NA, 114, 94, 150, 104, 108, NA, 106~
## $ BPDia2           <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, 46, ~
## $ BPSys3           <int> 112, NA, 104, 92, 150, 104, 116, NA, 110~
## $ BPDia3           <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, 40, ~
## $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ DirectChol       <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.16, ~
## $ TotChol          <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.70, ~
## $ UrineVol1        <int> 352, NA, 281, 139, 30, 202, 77, NA, 39, ~
## $ UrineFlow1       <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, 0.09~
## $ UrineVol2        <int> NA, NA, NA, NA, 246, NA, NA, NA, NA, NA,~
## $ UrineFlow2       <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA, NA~
## $ Diabetes         <fct> No, No, No, No, Yes, No, No, No, No, No,~
## $ DiabetesAge      <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, NA, ~
## $ HealthGen        <fct> Good, NA, Vgood, NA, Fair, Good, Good, N~
## $ DaysPhysHlthBad  <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, NA, 0~
## $ DaysMentHlthBad  <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0, NA~
## $ LittleInterest   <fct> Most, NA, NA, NA, Most, None, Several, N~
## $ Depressed        <fct> Several, NA, NA, NA, Most, Most, Several~
## $ nPregnancies     <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA~
## $ nBabies          <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA~
## $ Age1stBaby       <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, NA, ~
## $ SleepHrsNight    <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA, 9,~
## $ SleepTrouble     <fct> Yes, NA, No, NA, No, No, Yes, NA, NA, No~
## $ PhysActive       <fct> No, NA, Yes, NA, No, Yes, No, NA, NA, Ye~
## $ PhysActiveDays   <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4, NA,~
## $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ TVHrsDayChild    <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA, 3, ~
## $ CompHrsDayChild  <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA, 0, ~
## $ Alcohol12PlusYr  <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, Y~
## $ AlcoholDay       <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1, NA~
## $ AlcoholYear      <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52, NA~
## $ SmokeNow         <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA, No~
## $ Smoke100         <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, ~
## $ SmokeAge         <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, 16, ~
## $ Marijuana        <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, NA, N~
## $ AgeFirstMarij    <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, NA, ~
## $ RegularMarij     <fct> No, NA, NA, NA, NA, Yes, No, NA, NA, NA,~
## $ AgeRegMarij      <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, NA, ~
## $ HardDrugs        <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, N~
## $ SexEver          <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, ~
## $ SexAge           <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, NA, N~
## $ SexNumPartnLife  <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA, NA~
## $ SexNumPartYear   <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA,~
## $ SameSex          <fct> No, NA, NA, NA, No, No, Yes, NA, NA, NA,~
## $ SexOrientation   <fct> Heterosexual, NA, NA, NA, NA, Heterosexu~
## $ WTINT2YR         <dbl> 80100.544, 53901.104, 13953.078, 11664.8~
## $ WTMEC2YR         <dbl> 81528.772, 56995.035, 14509.279, 12041.6~
## $ SDMVPSU          <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2~
## $ SDMVSTRA         <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, 77, ~
## $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
# get a feeling for variables
# categorical
levels(nhanes$Depressed)
## [1] "None"    "Several" "Most"
table(nhanes$Depressed,  useNA = "ifany")
## 
##    None Several    Most    <NA> 
##    7926    1774     814    9779
# numerical
summary(nhanes$BMI, useNA = "ifany")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.40   19.79   24.92   25.65   30.10   84.87    2279