Understand and explain the effects of uncontrolled confounding, and the concept of its control by holding extraneous factors constant
Formulate a multiple linear regression model and interpret it’s parameters
Formulate and test hypothesis based on linear combinations of regression parameters
Use residuals to test multiple linear regression assumptions
Learning activities
This week’s learning activities include:
Learning Activity
Learning objectives
Lecture 1
1
Lecture 2
2
Reading
1, 2
Lecture 3
3
Independent exercises
4
Introduction to confounding
Until this week we have focussed on regression between an outcome and a single covariate and called this simple linear regression. This week we introduce the concept of multiple linear regression where the outcome is regressed on more than one covariate. The motivating reason for multiple linear regression we will present here is for the adjustment of confounding by other factors. However as we will discover in subsequent weeks multiple covariates is a powerful tool that allows regression analysis to be much more adaptable, and have greater predictive power.
This week begins with a brief recap on confounding in the lecture below.
In this video we introduce the multiple linear regression model, where multiple covariates are included. We also look at an example of this implemented in Stata and R and interpret the multiple linear regression output
Chapter 4. Linear regression to 4.2.1.1 (pages 69-73).
This reading supplements the above two lectures by providing some examples of confounding, and how this adjusted for in multiple linear regression
Chapter 4. 4.2.2 to 4.2.3 (pages 73-75).
This reading reinforces the content from lecture 2 on important output generated from multiple linear regressions including: the variance of regression coefficients, confidence intervals, and measures of goodness of fit with R squared.
Chapter 4. 4.3 to 4.3.2 (pages 76-81).
This reading supplements the lecture video by describing how categorical variables are included in multiple linear regression - particularrly when those categorical variables have more than 2 categories.
Linear combinations of regression coefficients
Particularly with categorical variables of more than one category, we frequently wish to make inferences on linear combinations of regression coefficients. For example, with categorical variables, the regression coefficients represent the mean difference between one of the groups and the reference category. In this section we learn how to make different comparisons - any comparison that is a linear combination of the regression coefficients. Let us return to the regression for the recorded video earlier, using the hers_subset data from the Heart and Estrogen/progestin study (HERS). In the video we looked at a regression on systolic blood pressure (sbp) against age, BMI, alcohol consumption (drinkany), and physical activity (physact). With the following regression results:
Stata code
Show the code
use hersdata, clearsetseed 90896sample 10reg SBP age BMI i.drink i.physact## . use hersdata, cl. setseed 90896## ## . sample 10## (2,487 observations deleted)## ## . ## . reg SBP age BMI i.drink i.physact## ## Source | SS df MS Number ofobs = 276## -------------+---------------------------------- F(7, 268) = 2.72## Model | 7440.99084 7 1062.99869 Prob > F = 0.0097## Residual | 104826.237 268 391.142677 R-squared = 0.0663## -------------+---------------------------------- Adj R-squared = 0.0419## Total | 112267.228 275 408.244466 Root MSE = 19.777## ## ---------------------------------------------------------------------------------------## SBP | Coefficient Std. err. t P>|t| [95% conf. interval]## ----------------------+----------------------------------------------------------------## age | .6807476 .1903908 3.58 0.000 .3058957 1.0556## BMI | .3459596 .2241932 1.54 0.124 -.0954443 .7873636## |## drinkany |## yes | -3.717148 2.453773 -1.51 0.131 -8.548272 1.113975## |## physact |## somewhat less active | 1.517198 5.29391 0.29 0.775 -8.905744 11.94014## about as active | 3.092056 4.861608 0.64 0.525 -6.479747 12.66386## somewhat more active | 3.010075 4.977285 0.60 0.546 -6.789479 12.80963## much more active | -2.391424 5.577128 -0.43 0.668 -13.37198 8.589133## |## _cons | 81.07379 15.1853 5.34 0.000 51.17613 110.9714## ---------------------------------------------------------------------------------------
R code
Show the code
hers_subset <-read.csv("hers_subset.csv")lm.multiple <-lm(SBP ~ age + BMI + drinkany + physact, data = hers_subset)summary(lm.multiple)## ## Call:## lm(formula = SBP ~ age + BMI + drinkany + physact, data = hers_subset)## ## Residuals:## Min 1Q Median 3Q Max ## -53.190 -14.050 -2.257 13.858 61.159 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 84.16584 14.86302 5.663 3.83e-08 ***## age 0.68075 0.19039 3.576 0.000415 ***## BMI 0.34596 0.22419 1.543 0.123979 ## drinkanyyes -3.71715 2.45377 -1.515 0.130984 ## physactmuch less active -3.09206 4.86161 -0.636 0.525309 ## physactmuch more active -5.48348 3.95433 -1.387 0.166685 ## physactsomewhat less active -1.57486 3.73468 -0.422 0.673593 ## physactsomewhat more active -0.08198 3.00415 -0.027 0.978249 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 19.78 on 268 degrees of freedom## Multiple R-squared: 0.06628, Adjusted R-squared: 0.04189 ## F-statistic: 2.718 on 7 and 268 DF, p-value: 0.009724confint(lm.multiple)## 2.5 % 97.5 %## (Intercept) 54.9027060 113.4289776## age 0.3058957 1.0555995## BMI -0.0954443 0.7873636## drinkanyyes -8.5482722 1.1139755## physactmuch less active -12.6638586 6.4797472## physactmuch more active -13.2689900 2.3020311## physactsomewhat less active -8.9278976 5.7781819## physactsomewhat more active -5.9967253 5.8327632
Now suppose we wish to make some specific pairwise comparisons that are not captured by the comparison between the reference category. For example, perhaps we wish to compare the difference between the means of the “much less” category, and the “much more” category. Let’s think about what this comparison would be from the regression coefficients of this regression equation using the following acroynms:
\[\beta_4 = \text{mean of MLA} - \text{mean of reference category}\]
and
\[\beta_5 = \text{mean of MMA} - \text{mean of reference category}\]
then it follows that
\[\beta_4 - \beta_5 = \text{mean of MLA} - \text{mean of MMA}\]
Therefore a calculation of \(\beta_4 - \beta_5\) will give us the desired mean difference between the much less active group and the much more active group - after adjusting for age, bmi and alcohol consumption. We can of course do this manually from the regression output, however we save a lot of time if we do this in Stata or R, as those packages will also automatically calulate P-values and confidence intervals for the associated comparison.
In Stata, we do this with the “lincom” command, and specify the levels of the physical activity category with a numeral followed by a “.”. i.e. for the comparison above the much less active group is level 2 of the physical activity variable, and the much more active group is level 3. So the Stata code and output would be
Show the code
use hersdata, clearsetseed 90896sample 10reg SBP age BMI i.drink i.physactlincom 2.physact - 3.physact## . use hersdata, cl. setseed 90896## ## . sample 10## (2,487 observations deleted)## ## . ## . reg SBP age BMI i.drink i.physact## ## Source | SS df MS Number ofobs = 276## -------------+---------------------------------- F(7, 268) = 2.72## Model | 7440.99084 7 1062.99869 Prob > F = 0.0097## Residual | 104826.237 268 391.142677 R-squared = 0.0663## -------------+---------------------------------- Adj R-squared = 0.0419## Total | 112267.228 275 408.244466 Root MSE = 19.777## ## ---------------------------------------------------------------------------------------## SBP | Coefficient Std. err. t P>|t| [95% conf. interval]## ----------------------+----------------------------------------------------------------## age | .6807476 .1903908 3.58 0.000 .3058957 1.0556## BMI | .3459596 .2241932 1.54 0.124 -.0954443 .7873636## |## drinkany |## yes | -3.717148 2.453773 -1.51 0.131 -8.548272 1.113975## |## physact |## somewhat less active | 1.517198 5.29391 0.29 0.775 -8.905744 11.94014## about as active | 3.092056 4.861608 0.64 0.525 -6.479747 12.66386## somewhat more active | 3.010075 4.977285 0.60 0.546 -6.789479 12.80963## much more active | -2.391424 5.577128 -0.43 0.668 -13.37198 8.589133## |## _cons | 81.07379 15.1853 5.34 0.000 51.17613 110.9714## ---------------------------------------------------------------------------------------## ## . lincom 2.physact - 3.physact## ## ( 1) 2.physact - 3.physact = 0## ## ------------------------------------------------------------------------------## SBP | Coefficient Std. err. t P>|t| [95% conf. interval]## -------------+----------------------------------------------------------------## (1) | -1.574858 3.734678 -0.42 0.674 -8.927898 5.778182## ------------------------------------------------------------------------------## ## .
In R, we do this calculation by first specifying a matrix which designates the comparison we would like to make. The matrix here must have the same number of columns as the number of regression coefficients in our regression equation - in this example 8 (\(\beta_0\) to \(\beta_7\)). We would like to make a subtraction between the \(\beta_4\) and \(\beta_5\) corresponding to the fifth and sixth regression coefficients. So our matrix comparison is defined as comparison <- matrix(c(0,0,0,0,1,-1,0,0), nrow=1). We then use the glht command from the multcomp library to calculate this linear combination, and use the summary and confint commands to output the P-value and confidence intervals.
In both Stata and R, we observe that the much more group has a lower SBP mean of 2.39mmHG compared with the much less active group (95% CI 8.59mmHG greater to 13.37mmHG lower) - corresponding to no evidence for a difference (P = 0.67).
Model checking for multiple linear regression
In week 2 we investigated how residuals can be used to check assumptions 1-3 of linear regression. These tests are already fit for purpose for multiple linear regression as described below
Linearity. A residual versus fitted plot is still an excellent tool for assessing linearity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. Remember, this is only useful for continuous covariates and does not need checking for categorical covariates
Homoscedasticity (constant variance). A residual versus fitted plot is still an excellent tool for assessing homoscedasticity in multiple linear regression. If there are concerns with this plot, the assumption can be further investigated with a residual versus predictor plot for each covariate in the regression. For continuous covariates, this is best shown with a scatter plot. For categorical variables, a boxplot is best for comparing the variance across categories.
Normality. A normal quantile plot of the residuals, or a histogram of the residuals is still an excellent tool for assessing the normality of the residuals
Independent exercise
Continuing on with the hers_subset example. Write down the regression equation for a regression with an outcome of body mass index, and with age and physical activity (physact) as covariates. Interpret each parameter in this equation.
Carry out this regression and report on the key findings.
Finally, express the following comparisons in terms of the regression coefficients of your equation above, and calculate these using Stata or R
The mean difference between much more active, and much less active
The mean difference between much more active, and somewhat more active
[Challenge question] The mean difference between the more active groups (somewhat and much more active combined), and the less active groups (somewhat and less active combined).
Summary
This weeks key concepts are:
Multiple linear regression is the natural extension to simple linear regression where more than one covariate is included as an independent variable. The formula for a multiple linear regression is \[Y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \beta_3 x_{3,i} + \dots \epsilon_i\]
\(\beta_0\) still represents the intercept, the estimated mean outcome when all covariates (\(x\)’s) equal zero.
\(\beta_i\) still represents the mean change in the outcome for a one unit increase in \(x_i\). For categorical variables, this represents the mean change from group \(x_i\) to the reference category
The statistical evidence (P-value) for a group of regression coefficients can be calculated with an F test or likelihood ratio test. This is important for categorical variables of more than 2 categories as it provides a single P-value for that categorical variable that does not depend on the reference category
Linear combinations of regression coefficients can be calculated with the lincom command in Stata and the glht command from the multcomp package in R.
Checking model assumptions can be done in largely the same was for multiple linear regresssion as for simple linear regression. Residual versus predictor plots are an additional plot you can use to investigate deviations to linearity or homoscedasticity.