5.5 Fitting the MLR model
This section demonstrates how to fit a MLR model in R. For comparison, we first look at the unadjusted models we fit in Chapter 4 (each only had one predictor variable). Second, we will fit a model adjusted for confounding due to other predictors. Since, in this chapter, we removed cases with missing values on any of a set of variables, the unadjusted results here differ slightly from those in Chapter 4.
5.5.1 Unadjusted
The unadjusted results are shown in Tables 5.1 and 5.2.
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 3.204 | 2.574, 3.833 | <0.001 |
Waist Circumference (cm) | 0.0288 | 0.0227, 0.0350 | <0.001 |
1 CI = Confidence Interval |
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 5.96 | 5.81, 6.10 | <0.001 |
Smoking Status | 0.002 | ||
Never | — | — | |
Past | 0.451 | 0.192, 0.711 | <0.001 |
Current | 0.240 | -0.072, 0.552 | 0.132 |
1 CI = Confidence Interval |
If just using the unadjusted models, we would conclude the following:
- There is a significant positive unadjusted association between fasting glucose and waist circumference (B = 0.029; 95% CI = 0.023, 0.035; p <.001).
- There is a significant association between fasting glucose and smoking status (p = .002). Past Smokers had significantly greater fasting glucose than Never Smokers (B = 0.451; 95% CI = 0.192, 0.711; p <.001). See Section 4.4 for how to re-level smoking in order to compare Past vs. Current.
However, as this is observational data, we should be wary of drawing conclusions from unadjusted models.
5.5.2 Adjusted
To fit an adjusted model, include multiple predictors in lm()
separated by +
signs. Then, just as with a SLR model, use summary()
, confint()
, and car::Anova(, type = 3)
to summarize the results.
fit.ex5.1 <- lm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income, data = nhanesf.complete)
summary(fit.ex5.1)
##
## Call:
## lm(formula = LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR +
## race_eth + income, data = nhanesf.complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.694 -0.749 -0.234 0.290 13.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.97301 0.37665 7.89 0.0000000000000091 ***
## BMXWAIST 0.02451 0.00306 8.00 0.0000000000000040 ***
## smokerPast 0.21121 0.12515 1.69 0.09185 .
## smokerCurrent 0.09776 0.15063 0.65 0.51650
## RIDAGEYR 0.02505 0.00331 7.58 0.0000000000000926 ***
## RIAGENDRFemale -0.32795 0.10514 -3.12 0.00188 **
## race_ethNon-Hispanic White -0.50864 0.14524 -3.50 0.00049 ***
## race_ethNon-Hispanic Black -0.24764 0.19919 -1.24 0.21411
## race_ethNon-Hispanic Other 0.01644 0.21618 0.08 0.93938
## income$25,000 to <$55,000 -0.10038 0.16273 -0.62 0.53753
## income$55,000+ -0.09381 0.14724 -0.64 0.52421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.51 on 846 degrees of freedom
## Multiple R-squared: 0.171, Adjusted R-squared: 0.161
## F-statistic: 17.4 on 10 and 846 DF, p-value: <0.0000000000000002
## 2.5 % 97.5 %
## (Intercept) 2.23373 3.71228
## BMXWAIST 0.01850 0.03052
## smokerPast -0.03443 0.45685
## smokerCurrent -0.19789 0.39342
## RIDAGEYR 0.01856 0.03154
## RIAGENDRFemale -0.53431 -0.12158
## race_ethNon-Hispanic White -0.79371 -0.22356
## race_ethNon-Hispanic Black -0.63860 0.14331
## race_ethNon-Hispanic Other -0.40786 0.44075
## income$25,000 to <$55,000 -0.41978 0.21903
## income$55,000+ -0.38281 0.19519
## Anova Table (Type III tests)
##
## Response: LBDGLUSI
## Sum Sq Df F value Pr(>F)
## (Intercept) 142 1 62.30 0.0000000000000091 ***
## BMXWAIST 146 1 64.02 0.0000000000000040 ***
## smoker 7 2 1.44 0.23638
## RIDAGEYR 131 1 57.42 0.0000000000000926 ***
## RIAGENDR 22 1 9.73 0.00188 **
## race_eth 39 3 5.68 0.00075 ***
## income 1 2 0.23 0.79082
## Residuals 1929 846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 5.3 combines all of the output into a single table.
library(gtsummary)
TABLE <- fit.ex5.1 %>%
tbl_regression(intercept = T,
estimate_fun = function(x) style_sigfig(x, digits = 3),
pvalue_fun = function(x) style_pvalue(x, digits = 3),
label = list(BMXWAIST ~ "Waist Circumference (cm)",
smoker ~ "Smoking Status",
RIDAGEYR ~ "Age (years)",
RIAGENDR ~ "Gender",
race_eth ~ "Race/Ethnicity",
income ~ "Annual Income")) %>%
add_global_p(keep = T) %>%
modify_caption("Linear regression results for fasting glucose (mmol/L) vs.
waist circumference (cm) and smoking, adjusted for demographics")
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 2.97 | 2.23, 3.71 | <0.001 |
Waist Circumference (cm) | 0.025 | 0.018, 0.031 | <0.001 |
Smoking Status | 0.236 | ||
Never | — | — | |
Past | 0.211 | -0.034, 0.457 | 0.092 |
Current | 0.098 | -0.198, 0.393 | 0.516 |
Age (years) | 0.025 | 0.019, 0.032 | <0.001 |
Gender | 0.002 | ||
Male | — | — | |
Female | -0.328 | -0.534, -0.122 | 0.002 |
Race/Ethnicity | <0.001 | ||
Hispanic | — | — | |
Non-Hispanic White | -0.509 | -0.794, -0.224 | <0.001 |
Non-Hispanic Black | -0.248 | -0.639, 0.143 | 0.214 |
Non-Hispanic Other | 0.016 | -0.408, 0.441 | 0.939 |
Annual Income | 0.791 | ||
<$25,000 | — | — | |
$25,000 to <$55,000 | -0.100 | -0.420, 0.219 | 0.538 |
$55,000+ | -0.094 | -0.383, 0.195 | 0.524 |
1 CI = Confidence Interval |
How do we interpret these regression coefficients? The regression equation is as follows.
\[\begin{array}{rcl} \textrm{FG} & = & \beta_0 \\ & + & \beta_1 \textrm{WC} \\ & + & \beta_2 I(\textrm{Smoker = Past}) + \beta_3 I(\textrm{Smoker = Current}) \\ & + & \beta_4 \textrm{Age} \\ & + & \beta_5 I(\textrm{Gender = Female}) \\ & + & \beta_6 I(\textrm{Race = Non-Hispanic White}) + \beta_7 I(\textrm{Race = Non-Hispanic Black}) \\ & + & \beta_8 I(\textrm{Race = Non-Hispanic Other}) \\ & + & \beta_9 I(\textrm{Income = \$25,000 to <\$55,000}) + \beta_{10} I(\textrm{Income = \$55,000+}) + \epsilon \end{array}\]
The intercept is the mean outcome when all continuous predictors are 0 and all categorical predictors are at their reference level.
- In this example, the intercept (2.97 mmol/L) corresponds to the estimated fasting glucose for individuals with a waist circumference of 0 cm who have never smoked, are age = 0 years, male, Hispanic, and have <$25,000 annual household income. Waist circumference = 0 cm is, of course, not possible; also, the model was fit to data on adults so age = 0 years is far beyond the range of ages analyzed. Had we centered waist circumference and age, then the intercept would be more interpretable (see Section 4.3.4). The lack of centering here does not invalidate the model, it only makes it so the intercept does not correspond to a prediction for a plausible combination of predictor values.
For a continuous predictor, the regression coefficient is the estimated difference in the outcome associated with a 1-unit difference in the predictor, when holding all other predictors fixed.
- When holding all other predictors fixed, individuals with 1 cm greater waist circumference are estimated to have, on average, 0.025 mmol/L greater fasting glucose. “When holding all other predictors fixed” means that this statement is the same for any combination of the other predictors; it does not matter at what values you hold the other predictors fixed. Another way to say this is that individuals with the same smoking status, age, gender, race/ethnicity, and household income who differ in waist circumference by 1 cm differ in average fasting glucose by 0.025 mmol/L.
- The estimated difference in the outcome associated with an other than 1-unit difference is obtained by multiplying the coefficient by the units. For example, when holding all other predictors fixed, individuals with 5 cm greater waist circumference are estimated to have, on average, 5 \(\times\) 0.025 = 0.125 mmol/L greater fasting glucose.
For a categorical predictor with \(L\) levels, there will be \(L-1\) regression coefficients, each representing the difference in mean outcome between individuals at that level and individuals at the reference level, when holding all other predictors fixed.
- Comparing individuals who have the same values for all predictors except smoking status, past smokers have an average fasting glucose that is 0.211 mmol/L greater than never smokers.
The summaries below compare the results for each of waist circumference and smoking status before and after adjusting for the other predictors.
Both before (B = 0.029; 95% CI = 0.023, 0.035; p <.001) and after adjusting for confounding (B = 0.025; 95% CI = 0.018, 0.031; p <.001), we found a significant association between fasting glucose and waist circumference. Additionally, the magnitude of the association did not change much, indicating that there was not much confounding.
Before adjusting for confounding, we found a significant association between fasting glucose and smoking status (p = .002). However, after adjusting for waist circumference, age, gender, race/ethnicity, and income, there is no significant association between fasting glucose and smoking status (p = .236). Additionally, after adjusting for confounding, the magnitude of the regression coefficients were strongly attenuated (much closer to 0). Thus, for smoking status, failure to adjust for confounding was strongly biasing the results.