Understand and explain the effects of uncontrolled confounding, and the concept of its control by holding extraneous factors constant
Formulate a multiple linear regression model and interpret its parameters
Formulate and test hypotheses based on linear combinations of regression parameters
Use residuals to test multiple linear regression assumptions
Learning activities
This week’s learning activities include:
Learning Activity
Learning objectives
Lecture 1
1
Lecture 2
2
Reading
1, 2
Lecture 3
3
Independent exercises
4
Introduction to confounding
Until this week we have focussed on regression between an outcome and a single covariate and called this simple linear regression. This week we introduce the concept of multiple linear regression where the outcome is regressed on more than one covariate. The motivating reason for multiple linear regression that we will present here is for the adjustment of confounding by other factors. However as we will discover in subsequent weeks that multiple linear regression is a powerful tool that allows regression analysis to be much more adaptable, and have greater predictive power.
This week begins with a brief recap on the topic of confounding in the lecture below.
Introduction to multiple linear regression
In this video we introduce the multiple linear regression model, where multiple covariates are included. We also look at an example of this implemented in Stata and R and interpret the multiple linear regression output.
Stata instructions
R instructions
Book Chapter 4. Linear regression to 4.2.1.1 (pages 69-73).
This reading supplements the above two lectures by providing some examples of confounding, and how this is adjusted for in multiple linear regression.
Book Chapter 4. 4.2.2 to 4.2.3 (pages 73-75).
This reading reinforces the content from Lecture 2 on the important output generated from multiple linear regressions including: the variance of regression coefficients, confidence intervals, and measures of goodness of fit with R squared.
Chapter 4. 4.3 to 4.3.2 (pages 76-81).
This reading supplements the lecture video by describing how categorical variables are included in multiple linear regression - particularly when those categorical variables have more than 2 categories.
Linear combinations of regression coefficients
We frequently wish to make inferences on linear combinations of regression coefficients, particularly when dealing with categorical variables. For example, with categorical variables, the regression coefficients represent the mean difference between one of the groups and the reference category. In this section we learn how to make different comparisons (i.e., not just with the reference category) - any comparison that is a linear combination of the regression coefficients.
Let us return to the regression from the video above, using the hers_subset data from the Heart and Estrogen/progestin study (HERS). In the video we looked at a regression on systolic blood pressure (SBP) against age, BMI, alcohol consumption (drinkany), and physical activity (physact). We obtain the following regression results after encoding the string variables (Stata):
Vhers_subset <-read.csv("hers_subset.csv", stringsAsFactors = T)lm.multiple <-lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)## Error in eval(mf, parent.frame()): object 'hers_subset' not foundsummary(lm.multiple)## Error in eval(expr, envir, enclos): object 'lm.multiple' not foundconfint(lm.multiple)## Error in eval(expr, envir, enclos): object 'lm.multiple' not found
Note that the physical activity variable consists of 5 categories (About as active, Much less active Much more active, Somewhat less active, Somewhat more active). Here our reference category (in both Stata and R) is the “About as active” category. The regression coefficients for each of the other 4 physical activity categories represent the mean difference from that category (e.g. “Much less active”) to the reference category (“About as active”), after adjusting for other covariates.
Now, suppose we wish to make some specific pairwise comparisons that are not captured by the comparison with the reference category. For example, perhaps we wish to compare the difference between the means of the “much less” category, and the “much more” category for the physical activity (physact) variable. Let’s think about what this comparison would be from the regression coefficients of this regression equation using the following acroynms:
MLA - Much less active
MMA - Much more active
SLA - Somewhat less active
SMA - Somewhat more active
Here out regression equation is:
Now, given that
and
then it follows that
Therefore, a calculation of will give us the desired mean difference between the much less active group and the much more active group - after adjusting for age, BMI and alcohol consumption. We can of course do this manually from the regression output, however we save a lot of time if we do this in Stata or R, as those packages will also automatically calculate P-values and confidence intervals for the associated comparison.
Stata code
In Stata, we do this with the “lincom” command, and specify the levels of the physical activity category (that we are interested in comparing) with a numeral followed by a “.”. i.e. for the comparison above, the “much less active” group is level 2 of the physical activity variable, and the “much more active” group is level 3. (You can check the encoding using the “codebook phys_fact” command) So the Stata code and output would be
In R, we do this calculation by first specifying a matrix which designates the comparison we would like to make (this is known as the “contrast matrix”). The matrix here must have the same number of columns as the number of regression coefficients in our regression equation - in this example there are 8 ( to ). We would like to make a subtraction between and corresponding to the fifth and sixth regression coefficients (i.e., the 5th and 6th columns of our matrix).
So our matrix comparison is defined as comparison <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1) (the 1s represent the columns we are interested in, and the “-” represents the subtraction). We then use the glht command from the multcomp library to calculate this linear combination, and use the summary and confint commands to output the P-value and confidence intervals.
Show the code
library(multcomp) #for the glht() function## Loading required package: mvtnorm## Loading required package: survival## Loading required package: TH.data## Loading required package: MASS## ## Attaching package: 'TH.data'## The following object is masked from 'package:MASS':## ## geyserhers_subset <-read.csv("hers_subset.csv", stringsAsFactors = T)lm.multiple <-lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)comparison <-matrix(c(0,0,0,0,1,-1,0,0), nrow=1)lincom <-glht(lm.multiple, linfct = comparison)summary(lincom)## ## Simultaneous Tests for General Linear Hypotheses## ## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)## ## Linear Hypotheses:## Estimate Std. Error t value Pr(>|t|)## 1 == 0 2.391 5.577 0.429 0.668## (Adjusted p values reported -- single-step method)confint(lincom)## ## Simultaneous Confidence Intervals## ## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)## ## Quantile = 1.9689## 95% family-wise confidence level## ## ## Linear Hypotheses:## Estimate lwr upr ## 1 == 0 2.3914 -8.5891 13.3720
In both Stata and R, we observe that the “much less” active group has a SBP mean that is 2.39mmHg greater than the “much more” group (95%CI 8.59mmHg lower to 13.37mmHg higher) - corresponding to no evidence for a difference (P = 0.67). (Note that the 95%CI goes from negative to positive in the output.)
Alternatively, you could interpret this as the “much more” group has a SBP mean that is 2.39mmHG lower than the “much less” active group (95% CI 8.59mmHG greater to 13.37mmHG lower) - corresponding to no evidence for a difference (P = 0.67).
Model checking for multiple linear regression
In Week 2 we investigated how residuals can be used to check assumptions 1-3 of linear regression. These tests are already fit for purpose for multiple linear regression as described below
Linearity. A residual versus fitted plot is still an excellent tool for assessing linearity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. Remember, this is only useful for continuous covariates and does not need checking for categorical covariates
Homoscedasticity (constant variance). A residual versus fitted plot is still an excellent tool for assessing homoscedasticity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. For continuous covariates, this is best shown with a scatter plot. For categorical variables, a boxplot is best for comparing the variance across categories.
Normality. A normal quantile plot of the residuals, or a histogram of the residuals is still an excellent tool for assessing the normality of the residuals
Independence of observations. Remember that this is usually investigated by reviewing the study design.
Independent exercise
Continuing on with the hers_subset example. Write down the regression equation for a regression with an outcome of body mass index (BMI), and with age and physical activity (physact) as covariates. Interpret each parameter in this equation.
Carry out this regression and report on the key findings. Make sure you write a paragraph interpreting the regression coefficients, give the 95%CI and p-values (use the “Introduction to multiple linear regression” Stata or R demonstration to help with the interpretation).
Finally, express the following comparisons in terms of the regression coefficients of your equation above, and calculate these using Stata or R
The mean difference between much more active, and much less active
The mean difference between much more active, and somewhat more active
[Challenge question] The mean difference between the more active groups (somewhat and much more active combined), and the less active groups (somewhat and less active combined). Hint: we would like you to use contrasts here (rather than creating any new variables for physact).
Summary
This week’s key concepts are:
Multiple linear regression is the natural extension to simple linear regression where more than one covariate is included as an independent variable. The formula for a multiple linear regression is
still represents the intercept, the estimated mean outcome when all covariates (’s) equal zero.
still represents the mean change in the outcome for a one unit increase in . For categorical variables, this represents the mean change from group to the reference category
The P-value for a group of regression coefficients can be calculated with an F test or likelihood ratio test. This is important for categorical variables of more than 2 categories as it provides a single P-value for that categorical variable that does not depend on the reference category
Linear combinations of regression coefficients can be calculated with the lincom command in Stata and the glht command from the multcomp package in R (using the contrast matrix).
Checking model assumptions can be done in largely the same way for multiple linear regression as for simple linear regression. Residual versus predictor plots are an additional plot you can use to investigate deviations from linearity or homoscedasticity.