Chapter 9 Regression model

9.1 Simple linear regression

The basic regression, i.e. linear model (lm), formula in R is: lm(formula = y ~ x, data = data_frame_name).

Let’s run a basic model to model BMI as a function of the Poverty indicator:

# call model
lm(BMI ~ Poverty, data = nhanes)

## 
## Call:
## lm(formula = BMI ~ Poverty, data = nhanes)
## 
## Coefficients:
## (Intercept)      Poverty  
##     25.2187       0.1856

Always save the model object - it allows you to access more of the outputs that are produced automatically when running the regression. If you only need the coefficients, call the ‘coef’ function

model_bmi_pov <- lm(BMI ~ Poverty, data = nhanes)

coef(model_bmi_pov)

## (Intercept)     Poverty 
##  25.2186624   0.1856308

Call the ‘summary’ function on the model to see not only coefficients, but also the R-squared value, the associated standard error and p-value for each coefficient, the adjusted and the residual standard error.

summary(model_bmi_pov)

## 
## Call:
## lm(formula = BMI ~ Poverty, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.547  -5.896  -0.757   4.467  58.887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.21866    0.10389 242.738  < 2e-16 ***
## Poverty      0.18563    0.03747   4.954 7.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.784 on 16426 degrees of freedom
##   (3865 observations deleted due to missingness)
## Multiple R-squared:  0.001492,   Adjusted R-squared:  0.001431 
## F-statistic: 24.54 on 1 and 16426 DF,  p-value: 7.36e-07

Inspecting model outputs: Once we have run a model, we often want to explore the fitted values and residuals (remember mean of resid = 0 and mean of fitted values = mean of actual y data). They are stored as part of the model object an can easily be accessed individually:

#resid
residuals(model_bmi_pov)[1:5]

##          1          2          3          4          5 
##   6.748880 -10.117287  -3.640044  -7.149023  17.043252

# fitted values
fitted.values(model_bmi_pov)[1:5]

##        1        2        3        4        5 
## 25.47112 25.41729 25.64004 25.36902 25.34675

Explore all that is saved with your model To have a look at all the different aspects that are saved as part of your model, scroll down to ‘Value’ section for list of all things part of your model:

?lm()

9.2 Multiple linear regression

Bulding on the simple linear regression formula, the mulitple linear regression follows suit: lm(formula = y ~ x1 + x2 + x3…, data = data_frame_name)

# 1. save model - allows you to access more of the outputs
model_mult <- lm(BMI ~ Poverty + PhysActiveDays + SleepHrsNight + Diabetes + SmokeNow + Gender, data = adults)

# 2. if you only need the coefficients, call the 'coef' function
coef(model_mult)

##    (Intercept)        Poverty PhysActiveDays  SleepHrsNight 
##     31.5376229     -0.1918165     -0.1567075     -0.2390670 
##    DiabetesYes    SmokeNowYes     Gendermale 
##      2.9612169     -1.3255010     -0.8920421

# 3. Or call 'summary' for a complete output
summary(model_mult)

## 
## Call:
## lm(formula = BMI ~ Poverty + PhysActiveDays + SleepHrsNight + 
##     Diabetes + SmokeNow + Gender, data = adults)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.682  -4.064  -0.695   3.216  33.250 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    31.53762    0.78469  40.191  < 2e-16 ***
## Poverty        -0.19182    0.08205  -2.338 0.019500 *  
## PhysActiveDays -0.15671    0.06908  -2.268 0.023412 *  
## SleepHrsNight  -0.23907    0.09518  -2.512 0.012091 *  
## DiabetesYes     2.96122    0.41912   7.065 2.22e-12 ***
## SmokeNowYes    -1.32550    0.28006  -4.733 2.37e-06 ***
## Gendermale     -0.89204    0.26941  -3.311 0.000946 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.803 on 1954 degrees of freedom
##   (10430 observations deleted due to missingness)
## Multiple R-squared:  0.04844,    Adjusted R-squared:  0.04552 
## F-statistic: 16.58 on 6 and 1954 DF,  p-value: < 2.2e-16

9.3 Lunch Break

Try to answer the following questions drawing on what you have learned about creating tables, plots and regression analysis.

Is the age in months variable actually the age in months, or just derived from age in years?
How is age related to sleep?
Which of these variables is most strongly associated with BMI in adults: age, poverty, or number of hours sleep? 3.b And how much of the variance in BMI can be explained through these variables?

If you’d like to go a bit further: Explore the dataset that we have provided to you and see if you can find variables that you are interested in. Try making a histogram, boxplot, scatter plot. Next pick a continuous variable that could contribute to overall reported XXX and run a linear model. Then plot the model (i.e. a scatter plot and a line of best fit). Also report how much of the variance in XXX is explained through your predictor variable.

9.4 ggplot cheat sheet reminder:

If you’re lost or want to explore more options for plotting charts, you can consult a ggplot cheat sheet, e.g. https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

9.5 Regression output inspection (cont.)

In the ‘tidy’ universe of R there are a few functions that make your life easier & prettier (after a steeper learning curve). In the broom package there are a few helpful functions that give you objects that you can not only read, but transform or save as neat csv files.

The ‘tidy’ function gives you a data frame of the coefficients (& confidence intervals if you so choose) at the confidence level of your liking.

# 1. save model - allows you to access all model outputs
model_mult <- lm(BMI ~ Poverty + PhysActiveDays + SleepHrsNight + Diabetes + SmokeNow + Gender, data = adults)

# 2. Create tidy output
broom::tidy(model_mult, conf.int = FALSE, 
            conf.level = 0.95)

## # A tibble: 7 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      31.5      0.785      40.2  6.09e-258
## 2 Poverty          -0.192    0.0821     -2.34 1.95e-  2
## 3 PhysActiveDays   -0.157    0.0691     -2.27 2.34e-  2
## 4 SleepHrsNight    -0.239    0.0952     -2.51 1.21e-  2
## 5 DiabetesYes       2.96     0.419       7.07 2.22e- 12
## 6 SmokeNowYes      -1.33     0.280      -4.73 2.37e-  6
## 7 Gendermale       -0.892    0.269      -3.31 9.46e-  4

The ‘glance’ function gives you the model-level metrics:

# Create tidy output
broom::glance(model_mult)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>
## 1    0.0484        0.0455  5.80      16.6 1.00e-18     6 -6227.
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

Now we can save the ‘tidy’ output to easily further manipulate it or save it as a csv file.

# Store tidy regression model output as object in R
model_mult_tidy <- broom::tidy(model_mult, conf.int = FALSE, 
                                  conf.level = 0.95)

# Save as a csv file at your chosen location (in "outputs" folder) and name in a meaningful way (e.g. BMI_model_output_p0.05.csv)

readr::write_csv(model_mult_tidy, file = "./outputs/BMI_model_output_p0.05.csv")

9.5.1 Try this for yourselves

Using either your last regression model from the lunch break or a new one, play around with the broom package and try writing the outputs that you are interested in. Try including and excluding the confidence intervals and using different significance levels.