4 Multiple Linear Regression Application

Learning objectives

By the end of this week you should be able to:

Understand and explain the effects of uncontrolled confounding, and the concept of its control by holding extraneous factors constant
Formulate a multiple linear regression model and interpret its parameters
Formulate and test hypotheses based on linear combinations of regression parameters
Use residuals to test multiple linear regression assumptions

Learning activities

This week’s learning activities include:

Learning Activity	Learning objectives
Lecture 1	1
Lecture 2	2
Reading	1, 2
Lecture 3	3
Independent exercises	4

Introduction to confounding

Until this week we have focussed on regression between an outcome and a single covariate and called this simple linear regression. This week we introduce the concept of multiple linear regression where the outcome is regressed on more than one covariate. The motivating reason for multiple linear regression that we will present here is for the adjustment of confounding by other factors. However as we will discover in subsequent weeks that multiple linear regression is a powerful tool that allows regression analysis to be much more adaptable, and have greater predictive power.

This week begins with a brief recap on the topic of confounding in the lecture below.

Introduction to multiple linear regression

In this video we introduce the multiple linear regression model, where multiple covariates are included. We also look at an example of this implemented in Stata and R and interpret the multiple linear regression output.

Stata instructions

R instructions

Note

Just a general note for R users, care must be taken when using the anova command.

If a multiple linear regression model has been fitted using the lm command, then we can’t then apply anova to just that model as it will use the wrong types of sums of squares and give the wrong p-values, e.g.

Show the code

lm.multiple<- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
anova(lm.multiple)

In this situation, we would need to use the “Anova” command from the car package

Show the code

library(car)
lm.multiple<- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
Anova(lm.multiple, type="III")

and this will give you the correct result. The type = “III” argument refers to the types of sums of squares. (There are three types: Type I, Type II, Type III).

You can only use anova with a lowercase “a” for models fitted using the lm command if you are using anova to compare two models, anova(model1, model2).

Book Chapter 4. Linear regression to 4.2.1.1 (pages 69-73).

This reading supplements the above two lectures by providing some examples of confounding, and how this is adjusted for in multiple linear regression.

Book Chapter 4. 4.2.2 to 4.2.3 (pages 73-75).

This reading reinforces the content from Lecture 2 on the important output generated from multiple linear regressions including: the variance of regression coefficients, confidence intervals, and measures of goodness of fit with R squared.

Book Chapter 4. 4.3 to 4.3.2 (pages 76-81).

This reading supplements the lecture video by describing how categorical variables are included in multiple linear regression - particularly when those categorical variables have more than 2 categories.

Linear combinations of regression coefficients

We frequently wish to make inferences on linear combinations of regression coefficients, particularly when dealing with categorical variables with multiple categories/levels. For categorical variables, the regression coefficients represent the mean difference between one of the groups and the reference category. In this section we learn how to make different comparisons (i.e., not just with the reference category) - any comparison that is a linear combination of the regression coefficients.

Let us return to the regression from the video above, using the hers_subset data from the Heart and Estrogen/progestin study (HERS). In the video we looked at a regression on systolic blood pressure (SBP) against age, BMI, alcohol consumption (drinkany), and physical activity (physact). We obtain the following regression results:

Stata code

Show the code

import delimited "hers_subset.csv", clear 

encode drinkany, generate(drink_fact)
encode physact, generate(phys_fact)

codebook phys_fact
reg sbp age bmi i.drink_fact i.phys_fact
## (encoding automatically selected: ISO-8859-1)
## (38 vars, 276 obs)
## 
## 
## 
## 
## -------------------------------------------------------------------------------
## phys_fact                                                           (unlabeled)
## -------------------------------------------------------------------------------
## 
##                   Type: Numeric (long)
##                  Label: phys_fact
## 
##                  Range: [1,5]                         Units: 1
##          Unique values: 5                         Missing .: 0/276
## 
##             Tabulation: Freq.   Numeric  Label
##                            91         1  about as active
##                            21         2  much less active
##                            35         3  much more active
##                            42         4  somewhat less active
##                            87         5  somewhat more active
## 
## 
##       Source |       SS           df       MS      Number of obs   =       276
## -------------+----------------------------------   F(7, 268)       =      2.72
##        Model |  7440.99084         7  1062.99869   Prob > F        =    0.0097
##     Residual |  104826.237       268  391.142677   R-squared       =    0.0663
## -------------+----------------------------------   Adj R-squared   =    0.0419
##        Total |  112267.228       275  408.244466   Root MSE        =    19.777
## 
## ------------------------------------------------------------------------------
##          sbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          age |   .6807476   .1903908     3.58   0.000     .3058957      1.0556
##          bmi |   .3459596   .2241932     1.54   0.124    -.0954443    .7873636
##              |
##   drink_fact |
##         yes  |  -3.717148   2.453773    -1.51   0.131    -8.548272    1.113975
##              |
##    phys_fact |
## much less..  |  -3.092056   4.861608    -0.64   0.525    -12.66386    6.479747
## much more..  |  -5.483479   3.954334    -1.39   0.167    -13.26899    2.302031
## somewhat ..  |  -1.574858   3.734678    -0.42   0.674    -8.927898    5.778182
## somewhat ..  |   -.081981   3.004154    -0.03   0.978    -5.996725    5.832763
##              |
##        _cons |   84.16584   14.86302     5.66   0.000     54.90271     113.429
## ------------------------------------------------------------------------------

R code

Show the code

hers_subset <- read.csv("hers_subset.csv", stringsAsFactors = T)
lm.multiple <- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
summary(lm.multiple)
## 
## Call:
## lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.190 -14.050  -2.257  13.858  61.159 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 84.16584   14.86302   5.663 3.83e-08 ***
## age                          0.68075    0.19039   3.576 0.000415 ***
## BMI                          0.34596    0.22419   1.543 0.123979    
## drinkanyyes                 -3.71715    2.45377  -1.515 0.130984    
## physactmuch less active     -3.09206    4.86161  -0.636 0.525309    
## physactmuch more active     -5.48348    3.95433  -1.387 0.166685    
## physactsomewhat less active -1.57486    3.73468  -0.422 0.673593    
## physactsomewhat more active -0.08198    3.00415  -0.027 0.978249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.78 on 268 degrees of freedom
## Multiple R-squared:  0.06628,    Adjusted R-squared:  0.04189 
## F-statistic: 2.718 on 7 and 268 DF,  p-value: 0.009724
confint(lm.multiple)
##                                   2.5 %      97.5 %
## (Intercept)                  54.9027060 113.4289776
## age                           0.3058957   1.0555995
## BMI                          -0.0954443   0.7873636
## drinkanyyes                  -8.5482722   1.1139755
## physactmuch less active     -12.6638586   6.4797472
## physactmuch more active     -13.2689900   2.3020311
## physactsomewhat less active  -8.9278976   5.7781819
## physactsomewhat more active  -5.9967253   5.8327632

Note that the physical activity variable consists of 5 categories (About as active, Much less active, Much more active, Somewhat less active, Somewhat more active). Here our reference category (in both Stata and R) is the “About as active” category. The regression coefficients for each of the other 4 physical activity categories represent the mean difference from that category (e.g. “Much less active”) to the reference category (“About as active”), after adjusting for other covariates.

Now, suppose we wish to make some specific pairwise comparisons that are not captured by the comparison with the reference category. For example, perhaps we wish to compare the difference between the means of the “much less” category, and the “much more” category for the physical activity (physact) variable. Let’s think about what this comparison would be from the regression coefficients of this regression equation using the following acroynms:

MLA - Much less active
MMA - Much more active
SLA - Somewhat less active
SMA - Somewhat more active

Here out regression equation is:

\[ E[SBP] = \beta_0 + \beta_1 \text{age} + \beta_2 \text{BMI} + \beta_3 \text{drinkany} + \beta_4 \text{MLA} + \beta_5 \text{MMA} + \beta_6 \text{SLA} + \beta_7 \text{SMA}\]

Now, given that

\[\beta_4 = \text{mean of MLA} - \text{mean of reference category}\]

and

\[\beta_5 = \text{mean of MMA} - \text{mean of reference category}\]

then it follows that

\[\beta_4 - \beta_5 = \text{mean of MLA} - \text{mean of MMA}\]

Therefore, a calculation of \(\beta_4 - \beta_5\) will give us the desired mean difference between the much less active group and the much more active group - after adjusting for age, BMI and alcohol consumption. We can of course do this manually from the regression output, however we save a lot of time if we do this in Stata or R, as those packages will also automatically calculate P-values and confidence intervals for the associated comparison.

Stata code

In Stata, we do this with the “lincom” command, and specify the levels of the physical activity category (that we are interested in comparing) with a numeral followed by a “.”. i.e. for the comparison above, the “much less active” group is level 2 of the physical activity variable, and the “much more active” group is level 3. (You can check the encoding using the “codebook phys_fact” command). So the Stata code and output would be

Show the code

import delimited "hers_subset.csv", clear 
encode drinkany, generate(drink_fact)
encode physact, generate(phys_fact)
reg sbp age bmi i.drink_fact i.phys_fact
lincom 2.phys_fact - 3.phys_fact
## (encoding automatically selected: ISO-8859-1)
## (38 vars, 276 obs)
## 
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =       276
## -------------+----------------------------------   F(7, 268)       =      2.72
##        Model |  7440.99084         7  1062.99869   Prob > F        =    0.0097
##     Residual |  104826.237       268  391.142677   R-squared       =    0.0663
## -------------+----------------------------------   Adj R-squared   =    0.0419
##        Total |  112267.228       275  408.244466   Root MSE        =    19.777
## 
## ------------------------------------------------------------------------------
##          sbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          age |   .6807476   .1903908     3.58   0.000     .3058957      1.0556
##          bmi |   .3459596   .2241932     1.54   0.124    -.0954443    .7873636
##              |
##   drink_fact |
##         yes  |  -3.717148   2.453773    -1.51   0.131    -8.548272    1.113975
##              |
##    phys_fact |
## much less..  |  -3.092056   4.861608    -0.64   0.525    -12.66386    6.479747
## much more..  |  -5.483479   3.954334    -1.39   0.167    -13.26899    2.302031
## somewhat ..  |  -1.574858   3.734678    -0.42   0.674    -8.927898    5.778182
## somewhat ..  |   -.081981   3.004154    -0.03   0.978    -5.996725    5.832763
##              |
##        _cons |   84.16584   14.86302     5.66   0.000     54.90271     113.429
## ------------------------------------------------------------------------------
## 
## 
##  ( 1)  2.phys_fact - 3.phys_fact = 0
## 
## ------------------------------------------------------------------------------
##          sbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          (1) |   2.391424   5.577128     0.43   0.668    -8.589133    13.37198
## ------------------------------------------------------------------------------

R code

In R, we do this calculation by first specifying a matrix which designates the comparison we would like to make (this is known as the “contrast matrix”). The matrix here must have the same number of columns as the number of regression coefficients in our regression equation - in this example there are 8 (\(\beta_0\) to \(\beta_7\)). We would like to make a subtraction between \(\beta_4\) and \(\beta_5\) corresponding to the fifth and sixth regression coefficients (i.e., the 5th and 6th columns of our matrix).

So our matrix comparison is defined as comparison <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1) (the 1s represent the columns we are interested in, and the “-” represents the subtraction). We then use the glht command from the multcomp library to calculate this linear combination, and use the summary and confint commands to output the P-value and confidence intervals.

Show the code


library(multcomp) #for the glht() function
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser

hers_subset <- read.csv("hers_subset.csv", stringsAsFactors = T)
lm.multiple <- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
comparison <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1)
lincom <- glht(lm.multiple, linfct = comparison)
summary(lincom)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0    2.391      5.577   0.429    0.668
## (Adjusted p values reported -- single-step method)
confint(lincom)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
## 
## Quantile = 1.9689
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate lwr     upr    
## 1 == 0  2.3914  -8.5891 13.3720

In both Stata and R, we observe that the “much less” active group has a SBP mean that is 2.39mmHg greater than the “much more” group (95%CI 8.59mmHg lower to 13.37mmHg higher) - corresponding to no evidence for a difference (P = 0.67). (Note that the 95%CI goes from negative to positive in the output.)

Alternatively, you could interpret this as the “much more” group has a SBP mean that is 2.39mmHg lower than the “much less” active group (95% CI 8.59mmHg greater to 13.37mmHg lower) - corresponding to no evidence for a difference (P = 0.67).

Model checking for multiple linear regression

In Week 2 we investigated how residuals can be used to check assumptions 1-3 of linear regression. These tests are already fit for purpose for multiple linear regression as described below

Linearity. A residual versus fitted plot is still an excellent tool for assessing linearity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. Remember, this is only useful for continuous covariates and does not need checking for categorical covariates
Homoscedasticity (constant variance). A residual versus fitted plot is still an excellent tool for assessing homoscedasticity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. For continuous covariates, this is best shown with a scatter plot. For categorical variables, a boxplot is best for comparing the variance across categories.
Normality. A normal quantile plot of the residuals, or a histogram of the residuals is still an excellent tool for assessing the normality of the residuals
Independence of observations. Remember that this is usually investigated by reviewing the study design.

Independent exercise

Continuing on with the hers_subset example. Write down the regression equation for a regression with an outcome of body mass index (BMI), and with age and physical activity (physact) as covariates. Interpret each parameter in this equation.

Carry out this regression and report on the key findings. Make sure you write a paragraph interpreting the regression coefficients, give the 95%CI and p-values (use the “Introduction to multiple linear regression” Stata or R demonstration to help with the interpretation).

Finally, express the following comparisons in terms of the regression coefficients of your equation above, and calculate these using Stata or R

The mean difference between much more active, and much less active
The mean difference between much more active, and somewhat more active
[Challenge question] The mean difference between the more active groups (somewhat and much more active combined), and the less active groups (somewhat and less active combined). Hint: we would like you to use contrasts here (rather than creating any new variables for physact).

Summary

This week’s key concepts are:

Multiple linear regression is the natural extension to simple linear regression where more than one covariate is included as an independent variable. The formula for a multiple linear regression is \[Y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \beta_3 x_{3,i} + \dots \epsilon_i\]

\(\beta_0\) still represents the intercept, the estimated mean outcome when all covariates (\(x\)’s) equal zero.
\(\beta_i\) still represents the mean change in the outcome for a one unit increase in \(x_i\). For categorical variables, this represents the mean change from group \(x_{j,i}\) to the reference category

The P-value for a group of regression coefficients can be calculated with an F test or likelihood ratio test. This is important for categorical variables of more than 2 categories as it provides a single P-value for that categorical variable that does not depend on the reference category
Linear combinations of regression coefficients can be calculated with the lincom command in Stata and the glht command from the multcomp package in R (using the contrast matrix).
Checking model assumptions can be done in largely the same way for multiple linear regression as for simple linear regression. Residual versus predictor plots are an additional plot you can use to investigate deviations from linearity or homoscedasticity.