Show the code
<- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
lm.multipleanova(lm.multiple)
By the end of this week you should be able to:
Understand and explain the effects of uncontrolled confounding, and the concept of its control by holding extraneous factors constant
Formulate a multiple linear regression model and interpret its parameters
Formulate and test hypotheses based on linear combinations of regression parameters
Use residuals to test multiple linear regression assumptions
This week’s learning activities include:
Learning Activity | Learning objectives |
---|---|
Lecture 1 | 1 |
Lecture 2 | 2 |
Reading | 1, 2 |
Lecture 3 | 3 |
Independent exercises | 4 |
Until this week we have focussed on regression between an outcome and a single covariate and called this simple linear regression. This week we introduce the concept of multiple linear regression where the outcome is regressed on more than one covariate. The motivating reason for multiple linear regression that we will present here is for the adjustment of confounding by other factors. However as we will discover in subsequent weeks that multiple linear regression is a powerful tool that allows regression analysis to be much more adaptable, and have greater predictive power.
This week begins with a brief recap on the topic of confounding in the lecture below.
In this video we introduce the multiple linear regression model, where multiple covariates are included. We also look at an example of this implemented in Stata and R and interpret the multiple linear regression output.
Stata instructions
R instructions
Just a general note for R users, care must be taken when using the anova command.
If a multiple linear regression model has been fitted using the lm command, then we can’t then apply anova to just that model as it will use the wrong types of sums of squares and give the wrong p-values, e.g.
<- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
lm.multipleanova(lm.multiple)
In this situation, we would need to use the “Anova” command from the car package
library(car)
<- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
lm.multipleAnova(lm.multiple, type="III")
and this will give you the correct result. The type = “III” argument refers to the types of sums of squares. (There are three types: Type I, Type II, Type III).
You can only use anova with a lowercase “a” for models fitted using the lm command if you are using anova to compare two models, anova(model1, model2).
This reading supplements the above two lectures by providing some examples of confounding, and how this is adjusted for in multiple linear regression.
This reading reinforces the content from Lecture 2 on the important output generated from multiple linear regressions including: the variance of regression coefficients, confidence intervals, and measures of goodness of fit with R squared.
This reading supplements the lecture video by describing how categorical variables are included in multiple linear regression - particularly when those categorical variables have more than 2 categories.
We frequently wish to make inferences on linear combinations of regression coefficients, particularly when dealing with categorical variables with multiple categories/levels. For categorical variables, the regression coefficients represent the mean difference between one of the groups and the reference category. In this section we learn how to make different comparisons (i.e., not just with the reference category) - any comparison that is a linear combination of the regression coefficients.
Let us return to the regression from the video above, using the hers_subset
data from the Heart and Estrogen/progestin study (HERS). In the video we looked at a regression on systolic blood pressure (SBP) against age, BMI, alcohol consumption (drinkany), and physical activity (physact). We obtain the following regression results:
Stata code
"hers_subset.csv", clear
import delimited
encode drinkany, generate(drink_fact)
encode physact, generate(phys_fact)
codebook phys_fact
reg sbp age bmi i.drink_fact i.phys_fact
## (encoding automatically selected: ISO-8859-1)obs)
## (38 vars, 276
##
##
##
##
## -------------------------------------------------------------------------------
## phys_fact (unlabeled)
## -------------------------------------------------------------------------------
## long)
## Type: Numeric (
## Label: phys_fact
##
## Range: [1,5] Units: 1values: 5 Missing .: 0/276
## Unique
##
## Tabulation: Freq. Numeric Labelas active
## 91 1 about
## 21 2 much less activemore active
## 35 3 much
## 42 4 somewhat less activemore active
## 87 5 somewhat
##
## of obs = 276
## Source | SS df MS Number F(7, 268) = 2.72
## -------------+---------------------------------- F = 0.0097
## Model | 7440.99084 7 1062.99869 Prob >
## Residual | 104826.237 268 391.142677 R-squared = 0.0663
## -------------+---------------------------------- Adj R-squared = 0.0419
## Total | 112267.228 275 408.244466 Root MSE = 19.777
##
## ------------------------------------------------------------------------------
## sbp | Coefficient Std. err. t P>|t| [95% conf. interval]
## -------------+----------------------------------------------------------------
## age | .6807476 .1903908 3.58 0.000 .3058957 1.0556
## bmi | .3459596 .2241932 1.54 0.124 -.0954443 .7873636
## |
## drink_fact |
## yes | -3.717148 2.453773 -1.51 0.131 -8.548272 1.113975
## |
## phys_fact |
## much less.. | -3.092056 4.861608 -0.64 0.525 -12.66386 6.479747more.. | -5.483479 3.954334 -1.39 0.167 -13.26899 2.302031
## much
## somewhat .. | -1.574858 3.734678 -0.42 0.674 -8.927898 5.778182
## somewhat .. | -.081981 3.004154 -0.03 0.978 -5.996725 5.832763
## |_cons | 84.16584 14.86302 5.66 0.000 54.90271 113.429
## ## ------------------------------------------------------------------------------
R code
<- read.csv("hers_subset.csv", stringsAsFactors = T)
hers_subset <- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
lm.multiple summary(lm.multiple)
##
## Call:
## lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.190 -14.050 -2.257 13.858 61.159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.16584 14.86302 5.663 3.83e-08 ***
## age 0.68075 0.19039 3.576 0.000415 ***
## BMI 0.34596 0.22419 1.543 0.123979
## drinkanyyes -3.71715 2.45377 -1.515 0.130984
## physactmuch less active -3.09206 4.86161 -0.636 0.525309
## physactmuch more active -5.48348 3.95433 -1.387 0.166685
## physactsomewhat less active -1.57486 3.73468 -0.422 0.673593
## physactsomewhat more active -0.08198 3.00415 -0.027 0.978249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.78 on 268 degrees of freedom
## Multiple R-squared: 0.06628, Adjusted R-squared: 0.04189
## F-statistic: 2.718 on 7 and 268 DF, p-value: 0.009724
confint(lm.multiple)
## 2.5 % 97.5 %
## (Intercept) 54.9027060 113.4289776
## age 0.3058957 1.0555995
## BMI -0.0954443 0.7873636
## drinkanyyes -8.5482722 1.1139755
## physactmuch less active -12.6638586 6.4797472
## physactmuch more active -13.2689900 2.3020311
## physactsomewhat less active -8.9278976 5.7781819
## physactsomewhat more active -5.9967253 5.8327632
Note that the physical activity variable consists of 5 categories (About as active, Much less active, Much more active, Somewhat less active, Somewhat more active). Here our reference category (in both Stata and R) is the “About as active” category. The regression coefficients for each of the other 4 physical activity categories represent the mean difference from that category (e.g. “Much less active”) to the reference category (“About as active”), after adjusting for other covariates.
Now, suppose we wish to make some specific pairwise comparisons that are not captured by the comparison with the reference category. For example, perhaps we wish to compare the difference between the means of the “much less” category, and the “much more” category for the physical activity (physact) variable. Let’s think about what this comparison would be from the regression coefficients of this regression equation using the following acroynms:
MLA - Much less active
MMA - Much more active
SLA - Somewhat less active
SMA - Somewhat more active
Here out regression equation is:
\[ E[SBP] = \beta_0 + \beta_1 \text{age} + \beta_2 \text{BMI} + \beta_3 \text{drinkany} + \beta_4 \text{MLA} + \beta_5 \text{MMA} + \beta_6 \text{SLA} + \beta_7 \text{SMA}\]
Now, given that
\[\beta_4 = \text{mean of MLA} - \text{mean of reference category}\]
and
\[\beta_5 = \text{mean of MMA} - \text{mean of reference category}\]
then it follows that
\[\beta_4 - \beta_5 = \text{mean of MLA} - \text{mean of MMA}\]
Therefore, a calculation of \(\beta_4 - \beta_5\) will give us the desired mean difference between the much less active group and the much more active group - after adjusting for age, BMI and alcohol consumption. We can of course do this manually from the regression output, however we save a lot of time if we do this in Stata or R, as those packages will also automatically calculate P-values and confidence intervals for the associated comparison.
Stata code
In Stata, we do this with the “lincom” command, and specify the levels of the physical activity category (that we are interested in comparing) with a numeral followed by a “.”. i.e. for the comparison above, the “much less active” group is level 2 of the physical activity variable, and the “much more active” group is level 3. (You can check the encoding using the “codebook phys_fact” command). So the Stata code and output would be
"hers_subset.csv", clear
import delimited encode drinkany, generate(drink_fact)
encode physact, generate(phys_fact)
reg sbp age bmi i.drink_fact i.phys_fact
lincom 2.phys_fact - 3.phys_fact
## (encoding automatically selected: ISO-8859-1)obs)
## (38 vars, 276
##
##
##
## of obs = 276
## Source | SS df MS Number F(7, 268) = 2.72
## -------------+---------------------------------- F = 0.0097
## Model | 7440.99084 7 1062.99869 Prob >
## Residual | 104826.237 268 391.142677 R-squared = 0.0663
## -------------+---------------------------------- Adj R-squared = 0.0419
## Total | 112267.228 275 408.244466 Root MSE = 19.777
##
## ------------------------------------------------------------------------------
## sbp | Coefficient Std. err. t P>|t| [95% conf. interval]
## -------------+----------------------------------------------------------------
## age | .6807476 .1903908 3.58 0.000 .3058957 1.0556
## bmi | .3459596 .2241932 1.54 0.124 -.0954443 .7873636
## |
## drink_fact |
## yes | -3.717148 2.453773 -1.51 0.131 -8.548272 1.113975
## |
## phys_fact |
## much less.. | -3.092056 4.861608 -0.64 0.525 -12.66386 6.479747more.. | -5.483479 3.954334 -1.39 0.167 -13.26899 2.302031
## much
## somewhat .. | -1.574858 3.734678 -0.42 0.674 -8.927898 5.778182
## somewhat .. | -.081981 3.004154 -0.03 0.978 -5.996725 5.832763
## |_cons | 84.16584 14.86302 5.66 0.000 54.90271 113.429
##
## ------------------------------------------------------------------------------
##
##
## ( 1) 2.phys_fact - 3.phys_fact = 0
##
## ------------------------------------------------------------------------------
## sbp | Coefficient Std. err. t P>|t| [95% conf. interval]
## -------------+----------------------------------------------------------------
## (1) | 2.391424 5.577128 0.43 0.668 -8.589133 13.37198 ## ------------------------------------------------------------------------------
R code
In R, we do this calculation by first specifying a matrix which designates the comparison we would like to make (this is known as the “contrast matrix”). The matrix here must have the same number of columns as the number of regression coefficients in our regression equation - in this example there are 8 (\(\beta_0\) to \(\beta_7\)). We would like to make a subtraction between \(\beta_4\) and \(\beta_5\) corresponding to the fifth and sixth regression coefficients (i.e., the 5th and 6th columns of our matrix).
So our matrix comparison is defined as comparison <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1)
(the 1s represent the columns we are interested in, and the “-” represents the subtraction). We then use the glht
command from the multcomp
library to calculate this linear combination, and use the summary
and confint
commands to output the P-value and confidence intervals.
library(multcomp) #for the glht() function
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
<- read.csv("hers_subset.csv", stringsAsFactors = T)
hers_subset <- lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)
lm.multiple <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1)
comparison <- glht(lm.multiple, linfct = comparison)
lincom summary(lincom)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 2.391 5.577 0.429 0.668
## (Adjusted p values reported -- single-step method)
confint(lincom)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)
##
## Quantile = 1.9689
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 1 == 0 2.3914 -8.5891 13.3720
In both Stata and R, we observe that the “much less” active group has a SBP mean that is 2.39mmHg greater than the “much more” group (95%CI 8.59mmHg lower to 13.37mmHg higher) - corresponding to no evidence for a difference (P = 0.67). (Note that the 95%CI goes from negative to positive in the output.)
Alternatively, you could interpret this as the “much more” group has a SBP mean that is 2.39mmHg lower than the “much less” active group (95% CI 8.59mmHg greater to 13.37mmHg lower) - corresponding to no evidence for a difference (P = 0.67).
In Week 2 we investigated how residuals can be used to check assumptions 1-3 of linear regression. These tests are already fit for purpose for multiple linear regression as described below
Linearity. A residual versus fitted plot is still an excellent tool for assessing linearity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. Remember, this is only useful for continuous covariates and does not need checking for categorical covariates
Homoscedasticity (constant variance). A residual versus fitted plot is still an excellent tool for assessing homoscedasticity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. For continuous covariates, this is best shown with a scatter plot. For categorical variables, a boxplot is best for comparing the variance across categories.
Normality. A normal quantile plot of the residuals, or a histogram of the residuals is still an excellent tool for assessing the normality of the residuals
Independence of observations. Remember that this is usually investigated by reviewing the study design.
Continuing on with the hers_subset
example. Write down the regression equation for a regression with an outcome of body mass index (BMI), and with age and physical activity (physact) as covariates. Interpret each parameter in this equation.
Carry out this regression and report on the key findings. Make sure you write a paragraph interpreting the regression coefficients, give the 95%CI and p-values (use the “Introduction to multiple linear regression” Stata or R demonstration to help with the interpretation).
Finally, express the following comparisons in terms of the regression coefficients of your equation above, and calculate these using Stata or R
The mean difference between much more active, and much less active
The mean difference between much more active, and somewhat more active
[Challenge question] The mean difference between the more active groups (somewhat and much more active combined), and the less active groups (somewhat and less active combined). Hint: we would like you to use contrasts here (rather than creating any new variables for physact).
The P-value for a group of regression coefficients can be calculated with an F test or likelihood ratio test. This is important for categorical variables of more than 2 categories as it provides a single P-value for that categorical variable that does not depend on the reference category
Linear combinations of regression coefficients can be calculated with the lincom command in Stata and the glht command from the multcomp package in R (using the contrast matrix).
Checking model assumptions can be done in largely the same way for multiple linear regression as for simple linear regression. Residual versus predictor plots are an additional plot you can use to investigate deviations from linearity or homoscedasticity.