5.2 Notation and interpretation

The data are from n independent sets of observed values of an outcome Y and predictors X1,X2,...,XK. The data for the ith case (or individual, or observation) are denoted (yi,xi1,xi2,...,xiK) – each case has an associated outcome value and set of predictor values. Equation (5.1) describes the multiple linear regression model.

Y=β0+β1X1+β2X2++βKXK+ϵ

β0 is the intercept, the other β terms are the effects of the predictors, and ϵ is the error term, which is assumed to have a normal distribution with the same variance at all predictor values. We denote the assumption about the error term using the notation ϵN(0,σ2), which is read “epsilon has a normal distribution with a mean of zero and a variance of sigma squared”. The fact that σ2 does not have a subscript i means that all individuals have the same error variance – individuals’ true observed values vary about the values predicted by the regression, and how much they vary is assumed to not depend on the characteristics of any individual. The model described by Equation (5.1) assumes the relationship between Y and each Xk,(k=1,,K), is linear when all the other predictors are held constant, and the error term captures random variation between individuals, as well as outcome measurement error.

For a continuous predictor Xk, the corresponding regression coefficient βk is a slope, interpreted as the difference in the mean Y associated with a one-unit difference in Xk when holding all other predictors fixed (or “controlling for” or “adjusted for” the other predictors).

As discussed in Chapter 4, if Xk is a categorical predictor with L levels, then instead of just one corresponding X term in the model, there are L1 indicator variables. Each of the L1 corresponding regression coefficients is interpreted as the difference between the mean Y at that level of Xk and the mean Y at the reference level, when holding all other predictors fixed (or “controlling for” or “adjusted for” the other predictors).

NOTE: The interpretations of the MLR regression coefficients are the same as in SLR except for the addition of “when holding all other predictors fixed”. Imagine two groups of cases that have the same characteristics in all but one respect; they have the same predictor values for all Xs other than Xk and they differ in Xk by one unit. The model assumes that the mean outcome in these two groups differs by βk. If the predictor in which they differ is categorical, then “differ by one unit” corresponds to “one group is at the reference level of X and the other is at a non-reference level”.

Example 5.1: In the previous chapter, we estimated the relationship between fasting glucose (FG; mmol/L) (LBDGLUSI) and each of waist circumference (WC; BMXWAIST) and smoking status (smoker; Never, Past, Current) among a subset of adult participants in NHANES 2017-2018. We found that each was significantly associated with fasting glucose. However, we did not adjust for potential confounders of these relationships. In this chapter, we will use MLR to adjust each for confounding due to the other, as well as due to age (RIDAGEYR), gender (RIAGENDR; Male, Female), race/ethnicity (RIDRETH3; Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, Other/Multi), and income (<$25,000, $25,000 to <$55,000, $55,000+).

Based on Equation (5.1), the MLR model for Example 5.1 is written as follows.

FG=β0+β1WC+β2I(Smoker = Past)+β3I(Smoker = Current)+β4Age+β5I(Gender = Female)+β6I(Race = Other Hispanic)+β7I(Race = Non-Hispanic White)+β8I(Race = Non-Hispanic Black)+β9I(Race = Non-Hispanic Asian)+β10I(Race = Other/Multi)+β11I(Income = $25,000 to <$55,000)+β12I(Income = $55,000+)+ϵ

For each categorical predictor, the first level is left out, as it is the reference level.