Chapter 6 Multiple linear regression

6.1 Introduction

Multiple linear regression (MLR) extends simple linear regression (SLR) to allow more than one predictor in the model. In an observational study, MLR is how you estimate the association between an exposure and outcome adjusted for confounding due to other variables.

Some users of regression treat it like a “black box”: identify an outcome, a predictor, and some confounders, throw them into a regression model, and the computer outputs the results. It is, unfortunately, far from that simple. Even with just one predictor, you have to check the assumptions and look for outliers and influential observations. With multiple predictors it becomes even more complicated:

  • You still have to check assumptions and look for outliers and influential observations,
  • the predictors can impact each other in unexpected ways, and
  • the interpretation of the results is more complex.

Knowing enough to be dangerous

Knowing the basics of regression can give a researcher a false sense of confidence in their regression results, also known as “knowing enough to be dangerous.” If you have no idea how to use regression, you will seek help. If you know the basics but do not realize that there is a lot more, you may have incorrect or misleading results but not know it and fail to seek help. To become highly proficient at using regression requires more than what we can cover in this course. The goal here is to introduce you to the various important concepts and the tools available in R. Until you have a lot of practice using regression, it is wise to consult with a statistician who can guide you and give you feedback; but you will be able to carry out the analyses yourself, in many cases, using the tools you learn here. You will know a lot more than you did, you will know your limits, and you will have the foundation to expand those limits over time with experience and further study.

6.2 Notation and interpretation

MLR is an extension of SLR to the case of multiple predictors. We use the word “predictor” here generically… there might be one primary predictor the rest are confounders, but we can use the term “predictor” for all of them, as well.

The data are \(n\) sets of observed values of the outcome \(Y\) and predictors \(X_1, X_2, \ldots, X_p\). The data vector for the \(i^{th}\) case is denoted \((y_i, x_{1i}, x_{2i}, \ldots, x_{pi})\). The multiple linear regression model is written

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon \]

where \(\beta_0\) is the intercept, the other \(\beta\) terms are the effects (slopes) for each of the predictors, and \(\epsilon\) is the error term, or residual which is assumed to have a normal distribution with the same variance at all predictor values. We denote the assumption about the error term using the notation \(\epsilon \sim N(0, \sigma^2)\). This model assumes that the relationship between \(Y\) and each \(X_j, (j=1,…,p)\), is linear when all the other variables are held constant, and that the error term captures random measurement error and model misspecification (e.g., if there really should be additional predictors in the model or a predictor should be modeled using a curve).

For a continuous predictor \(X_j\), the corresponding regression coefficient \(\beta_j\) is interpreted as the difference in the mean \(Y\) associated with a one-unit difference in \(X_j\) when holding all other variables fixed (or “controlling for” or “adjusted for” the other variables).

As discussed in Chapter 5, if predictor \(X_j\) is categorical with \(K\) levels, then the corresponding \(\beta_j X_j\) term would be replaced by \(K-1\) terms with indicator variables.

For a categorical predictor \(X_j\) with \(K\) levels, each of the \(K-1\) associated regression coefficients are interpreted as the difference between the mean \(Y\) at that level of \(X_j\) and the mean \(Y\) at the reference level, when holding all other variables fixed (or “controlling for” or “adjusted for” the other variables).

Notice that the interpretations of the MLR regression coefficients is the same as in SLR, except for the addition of when holding all other variables fixed.

Example 2 (continued): In the previous chapter, we estimated the relationship between systolic blood pressure (BPXSY1) and smoking status (smoker) among adult participants in NHANES 2017-2018. However, we did not adjust for potential confounders of this relationship. Let’s use MLR to adjust for confounding due to age (RIDAGEYR), gender (RIAGENDR), race (RIDRETH3), education (DMDEDUC2), and income (income). We will use this example in the following sections.

Before working through all the steps to get to the final model, let’s look at the final regression results to see what we’re heading for and how to interpret the results. In Table 1 below, the “B” column contains the estimated regression coefficients.

Table 1. Multiple linear regression results for systolic blood pressure (SBP, mmHg) vs. smoking status (N = 814), adjusted for age, gender, race, education, and income

How do we interpret the regression coefficients?

The intercept is the mean outcome when all continuous predictors are 0 and all categorical predictors are at their reference level.

  • In this example, the intercept (99.5 mmHg) corresponds to the estimated SBP for individuals who are never smokers, age = 0 years, male, Mexican American, have < 9th grade education, and < $25,000 household income. The model was fit to data on adults so age = 0 years is not possible. Had we centered age then the intercept would be more interpretable.

For continuous predictors, the regression coefficient is the estimated difference in the outcome associated with a 1-unit difference in the predictor, when holding all other variables fixed.

  • In this example, age is the only continuous predictor. When holding all other variables constant, a 1-year age difference is associated with a 0.53 mmHg difference in SBP. This would be true for any combination of the other variables; it does not matter at what values you hold the other variables constant. Another way to say this is that among individuals with the same smoking status, gender, race, education level, and household income, but who differ in age by 1 year, their average SBP is expected to differ by 0.53 mmHg.

For a categorical predictor with K levels, there will be K-1 regression coefficients, each representing the difference in mean outcome between individuals at that level and individuals at the reference level, when holding all other variables fixed.

  • In this example, consider “Smoker.” Comparing individuals who have the same values for all variables except smoking status, past smokers have an average SBP that is 1.99 mmHg lower than never smokers.

NOTE: The interpretation of regression coefficients in MLR is the same as for SLR, except that each one is adjusted for the other variables in the model. They describe differences in mean outcome between individuals who differ in ONLY that variable.

6.3 Examine the data

See Chapter 3 for numerical and visual summaries of each variables one at a time. Before fitting a linear regression model, it is a good idea to look at all the variables individually.

  • Outcome: Create a numerical summary and plot a histogram of the outcome. Later, after you have learned more about regression diagnostics, you will be able to identify skew and outliers in these plots and may immediately realize that a transformation is needed. You will investigate these potential problems more closely when checking the model assumptions in Section 6.10, but sometimes you can save time by noticing the need for a transformation before fitting the model.

  • Predictors:

    • Continuous: Create a numerical summary and plot a histogram for each continuous predictor. Skew and outliers in predictor histograms may lead to influential observations. Consider if a transformation is needed. Again, we will go into more detail in Section 6.10.
    • Categorical: Create a frequency table for each categorical predictor. Look for levels that only have a few observations as these may need to be collapsed with other levels. The outcome cannot be reliably estimated at levels with very small sample sizes. I usually use 30 as a cutoff, but that is subjective.

Referring back to the summaries in Chapter 3, we conclude the following:

  • The outcome (SBP) may be skewed, but it is not drastic, so it may be OK to leave it as is for now and wait until the assumption checking step. What matters are the residuals, and after adjusting for all the other variables the skew may lessen.
  • Age does not have any skew so there is no need for a transformation at this point.
  • None of the categorical variables have levels with small sample sizes.
  • SBP, education, and income have missing values (NA). Before creating descriptive statistics for the variables in a multiple linear regression model, remove all cases with any missing data. That way, the sample for your descriptive table will match your sample for your regression. Alternatively, use multiple imputation to handle missing data (see Chapter 11).

NOTE: After removing missing data, you should re-examine the categorical variables as now levels that were not sparse (cells with ≥ 30 observations) may now be sparse (< 30 observations) and need to be grouped with other levels.

6.4 Visualize the unadjusted relationships

As in Chapter 5, we want to visualize the association between the outcome and each predictor. These plots will not correspond exactly to the MLR model since they will not be adjusted for any other variables, but they are a useful place to start. Later in Section 6.10 we will examine plots that are adjusted for other variables.

Example 2 (continued): Let’s plot the relationship between systolic blood pressure (BPXSY1) and each of smoking status (smoker), age (RIDAGEYR), gender (RIAGENDR), race (RIDRETH3), education (DMDEDUC2), and income (income).

You could use the code found in 5.3, or you can make use of the following functions which allow you to easily create plots for continuous predictors (contplot()) or categorical predictors (catplot()). In the function for continuous variables, we have added a “smoother” which allows a quick check of the linearity assumption; if the dashed line (the smoother, which is not constrained to be linear) is pretty close to the solid line then the linearity assumption is reasonable.

In this example, we will use the data after removing all cases with missing values so each of the plots is using the same sample.

# Since we will be plotting relationships often, let's write functions to plot the relationships
# One for categorical variables, # One for continuous variables

catplot <- function(OUTCOME, PREDICTOR, DAT, ...) {
  # Plotting OUTCOME vs. a categorical predictor
  # OUTCOME is a string indicating the outcome
  # PREDICTOR is a string indicating the predictor
  # DAT is the dataframe containing this data
  # ... allows you to pass additional arguments to plot below

  # Assign to y and x
  if(tibble::is_tibble(DAT)) DAT <- as.data.frame(DAT)
  y <- DAT[, OUTCOME]
  x <- DAT[, PREDICTOR]
  # Plot
  plot.default(x, y, las = 1, pch = 20, font.lab = 2, font.axis = 2, xaxt = "n", ...)
  LEVELS <- levels(x)
  NX     <- length(LEVELS)
  axis(1, at = 1:NX, labels = LEVELS, font = 2)
  # Compute means
  MEANS <- tapply(y, x, mean, na.rm = T)
  # Add means to the plot
  points(1:NX, MEANS, col = "red", pch = 20, cex = 3)
  lines( 1:NX, MEANS, col = "red")  
}

contplot <- function(OUTCOME, PREDICTOR, DAT, COL = "gray", ...) {
  # Plotting OUTCOME vs. a continuous predictor
  # OUTCOME is a string indicating the outcome
  # PREDICTOR is a string indicating the predictor
  # DAT is the dataframe containing this data
  # ... allows you to pass additional arguments to plot below
  
  # Assign to y and x
  if(tibble::is_tibble(DAT)) DAT <- as.data.frame(DAT)
  y <- DAT[, OUTCOME]
  x <- DAT[, PREDICTOR]
  
  # Scatterplot with regression line and smoother
  plot(x, y, las = 1, pch = 20, font.lab = 2, font.axis = 2, col=COL, ...)
  abline(lm(y ~ x), col = "red", lwd = 2)
  # Remove missing data to use smooth.spline
  SUB <- !is.na(y) &  !is.na(x)
  lines(smooth.spline(x[SUB], y[SUB]), col = "blue", lwd = 2, lty = 2)
}

load("Data/NHANES-2017-2018-sub20.Rdata")

library(tidyverse)
complete.dat <- nhanes %>% 
  select(BPXSY1, smoker, RIDAGEYR, RIAGENDR, RIDRETH3, DMDEDUC2, income) %>% 
  drop_na()

# Change the margins, which helps with a grid of plots
# Default par(mar=c(5, 4, 4, 2) + 0.1) for c(bottom, left, top, right)
par(mar=c(4, 2, 2, 1))
par(mfrow=c(3,2))
catplot("BPXSY1",  "smoker",   complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Smoking Status")
contplot("BPXSY1", "RIDAGEYR", complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Age (years)")
catplot("BPXSY1",  "RIAGENDR", complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Gender")
catplot("BPXSY1",  "RIDRETH3", complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Race/Ethnicity")
catplot("BPXSY1",  "DMDEDUC2", complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Education")
catplot("BPXSY1",  "income",   complete.dat, ylab = "Systolic BP (mmHg)", xlab = "Income")

par(mfrow=c(1,1))
par(mar=c(5, 4, 4, 2) + 0.1)

NOTE: When initially looking at the data, look for obvious model violations (non-linearity, non-normal residuals). You can choose to deal with them at this point, or wait until the assumption checking steps. Sometimes problems you see here go away after all the variables are adjusted for each other in the multiple linear regression model.

6.5 Fitting the MLR model

The section below demonstrates how to fit a multiple regression model in R. We will fit both an “unadjusted” model that only has the primary predictor of interest (in this example, smoking status) and an “adjusted model” which adjusts for confounding due to other variables. In order to have both analyses be based on the same sample size, we will first remove cases with missing values on any of the analysis variables (done already above, complete.dat).

# Unadjusted
fit.ex2.unadj <- lm(BPXSY1 ~ smoker, data = complete.dat)
summary(fit.ex2.unadj)
## 
## Call:
## lm(formula = BPXSY1 ~ smoker, data = complete.dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.39 -14.39  -2.46   9.61  85.61 
## 
## Coefficients:
##               Estimate Std. Error t value
## (Intercept)    126.462      0.947  133.58
## smokerPast       3.931      1.645    2.39
## smokerCurrent   -0.919      1.867   -0.49
##                          Pr(>|t|)    
## (Intercept)   <0.0000000000000002 ***
## smokerPast                  0.017 *  
## smokerCurrent               0.623    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.9 on 811 degrees of freedom
## Multiple R-squared:  0.00887,    Adjusted R-squared:  0.00643 
## F-statistic: 3.63 on 2 and 811 DF,  p-value: 0.0269
confint(fit.ex2.unadj)
##                  2.5 %  97.5 %
## (Intercept)   124.6032 128.320
## smokerPast      0.7026   7.160
## smokerCurrent  -4.5838   2.746
car::Anova(fit.ex2.unadj, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##              Sum Sq  Df  F value              Pr(>F)
## (Intercept) 7068694   1 17842.46 <0.0000000000000002
## smoker         2876   2     3.63               0.027
## Residuals    321296 811                             
##                
## (Intercept) ***
## smoker      *  
## Residuals      
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the unadjusted model there was a significant association between SBP and smoking status (F[2, 811] = 3.63, p = .027, as shown in the “Anova Table (Type III tests)” above. Past Smokers had significantly greater SBP than Never Smokers (B = 3.93; 95% CI = 0.70, 7.16; p = .017). See Section 5.4.2 for how to re-level smoking in order to compare Past vs. Current.

# Adjusted
fit.ex2.adj <- lm(BPXSY1 ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3 + DMDEDUC2 + income,
           data = complete.dat)
summary(fit.ex2.adj)
## 
## Call:
## lm(formula = BPXSY1 ~ smoker + RIDAGEYR + RIAGENDR + RIDRETH3 + 
##     DMDEDUC2 + income, data = complete.dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40.95 -10.63  -1.81   9.46  76.24 
## 
## Coefficients:
##                            Estimate Std. Error t value
## (Intercept)                 99.4677     3.3694   29.52
## smokerPast                  -1.9944     1.5282   -1.31
## smokerCurrent               -2.1385     1.7204   -1.24
## RIDAGEYR                     0.5261     0.0366   14.37
## RIAGENDRFemale              -1.6290     1.2490   -1.30
## RIDRETH3Other Hispanic       0.9629     2.6900    0.36
## RIDRETH3Non-Hispanic White  -0.8733     2.0259   -0.43
## RIDRETH3Non-Hispanic Black   7.5025     2.1781    3.44
## RIDRETH3Non-Hispanic Asian  -2.9613     2.4231   -1.22
## RIDRETH3Other/Multi          1.7860     3.0238    0.59
## DMDEDUC29-11                 6.7844     2.9917    2.27
## DMDEDUC2HS/GED               2.3818     2.7111    0.88
## DMDEDUC2Some college/AA      3.1514     2.7005    1.17
## DMDEDUC2College grad         1.1160     2.9130    0.38
## income$25,000 to < $55,000  -0.3080     1.6106   -0.19
## income$55,000+              -3.2407     1.6510   -1.96
##                                       Pr(>|t|)    
## (Intercept)                <0.0000000000000002 ***
## smokerPast                              0.1923    
## smokerCurrent                           0.2142    
## RIDAGEYR                   <0.0000000000000002 ***
## RIAGENDRFemale                          0.1925    
## RIDRETH3Other Hispanic                  0.7205    
## RIDRETH3Non-Hispanic White              0.6665    
## RIDRETH3Non-Hispanic Black              0.0006 ***
## RIDRETH3Non-Hispanic Asian              0.2220    
## RIDRETH3Other/Multi                     0.5549    
## DMDEDUC29-11                            0.0236 *  
## DMDEDUC2HS/GED                          0.3799    
## DMDEDUC2Some college/AA                 0.2436    
## DMDEDUC2College grad                    0.7017    
## income$25,000 to < $55,000              0.8484    
## income$55,000+                          0.0500 .  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.1 on 798 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.264 
## F-statistic: 20.4 on 15 and 798 DF,  p-value: <0.0000000000000002
confint(fit.ex2.adj)
##                              2.5 %     97.5 %
## (Intercept)                92.8537 106.081649
## smokerPast                 -4.9942   1.005449
## smokerCurrent              -5.5155   1.238504
## RIDAGEYR                    0.4542   0.597948
## RIAGENDRFemale             -4.0808   0.822705
## RIDRETH3Other Hispanic     -4.3173   6.243171
## RIDRETH3Non-Hispanic White -4.8499   3.103371
## RIDRETH3Non-Hispanic Black  3.2270  11.777986
## RIDRETH3Non-Hispanic Asian -7.7177   1.795133
## RIDRETH3Other/Multi        -4.1494   7.721508
## DMDEDUC29-11                0.9119  12.656993
## DMDEDUC2HS/GED             -2.9399   7.703526
## DMDEDUC2Some college/AA    -2.1494   8.452327
## DMDEDUC2College grad       -4.6020   6.834050
## income$25,000 to < $55,000 -3.4695   2.853533
## income$55,000+             -6.4815   0.000141
car::Anova(fit.ex2.adj, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##             Sum Sq  Df F value               Pr(>F)
## (Intercept) 255828   1  871.47 < 0.0000000000000002
## smoker         723   2    1.23                0.293
## RIDAGEYR     60630   1  206.53 < 0.0000000000000002
## RIAGENDR       499   1    1.70                0.193
## RIDRETH3      9953   5    6.78            0.0000033
## DMDEDUC2      2306   4    1.96                0.098
## income        1543   2    2.63                0.073
## Residuals   234260 798                             
##                
## (Intercept) ***
## smoker         
## RIDAGEYR    ***
## RIAGENDR       
## RIDRETH3    ***
## DMDEDUC2    .  
## income      .  
## Residuals      
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adjusting for confounding, there was no significant association between SBP and smoking status (F[2, 1] = 1.23, p = .293, as shown in the “Anova Table (Type III tests)” above.

Each of the predictors, continuous or categorical, is interpreted as discussed in Chapter 5 except with the addition adjusted for all other variables in the model. For example, we could conclude that, after adjusting for all other variables, SBP significantly increases with age (B = 0.53; 95% CI = 0.45, 0.60; p < .001).

6.6 Visualizing the adjusted relationships

Earlier, we plotted the outcome vs. each predictor to visualize the unadjusted relationships. How do we visualize the adjusted relationships? Remember that now the regression coefficients are adjusted for all other terms in the model. So the relationship we would like to see is the relationship between the outcome and each predictor after removing the effect of all the other predictors. We can do this using an added variable plot, using the following steps:

  • Denote the outcome as \(Y\), the predictor of interest as \(X\), and the other predictors as \(Z_1, Z_2, ...\) (collectively denoted as \(\mathbf{Z}\))
  • Regress \(Y\) on \(\mathbf{Z}\) and store the residuals \((R_{yz})\) - This gives us the part of the outcome not explained by the other predictors.
  • Regress \(X\) on \(\mathbf{Z}\) and store the residuals \((R_{xz})\) - This gives us the part of the predictor of interest not explained by the other predictors.
  • Plot \(R_{yz}\) vs. \(R_{xz}\).
  • The correlation between \(R_{yz}\) and \(R_{xz}\) is called the partial correlation and is interpreted as the correlation between \(Y\) and \(X\) adjusted for \(\mathbf{Z}\).

Fortunately, you do not have to actually do these steps yourself! They are automatically carried out by the car::avPlots function (Fox and Weisberg 2019). For example, using just a subset of the predictors we have been using:

fit.av <- lm(BPXSY1 ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = nhanes)
car::avPlots(fit.av, ask = F)

Notice the axis labels… The vertical axis is labeled “BPXSY1 | others” which corresponds to the outcome given all the predictors other than the one on the horizontal axis (in our notation from above, \(Y\) given \(\mathbf{Z}\). The horizontal axis on each plot tells you to which predictor each added variable plot corresponds. For example, “RIDAGEYR | others” corresponds to age adjusted for all the other covariates. For the categorical variable smoker, you get a plot for each non-reference level. Each level of a categorical variable has two possible values (0 or 1) so the added variable plot will have, generally, two clouds of points, one for cases where that level is 0 and one for cases where that level is 1. The slope of each line in each panel is the regression coefficient for that term in the model. Thus, if the solid line is horizontal then there is no association between \(Y\) and \(X\) after adjusting for the other predictors.

6.7 Interactions

In a MLR model, each regression coefficient is interpreted as the effect of a predictor while holding other predictors at fixed values. That also implies that the effect does not depend on at which specific values you hold the other predictors. However, when there is an interaction (effect modification), the effect of a predictor differs depending on the level of another variable (and vice versa).

NOTE: Interaction is not the same as correlation. Correlation has to do with how two variables are related to each other whereas interaction has to do with how each of two variables impacts the effect of the other on the outcome.

When including an interaction between two predictors in a model, you always include each of the predictors individually (the main effects) as well as their interaction. The syntax in R is lm(Y ~ X + Z + X:Z) where X:Z is the interaction term.

Example 3: In the regression of SBP on body mass index (BMI, kg/m2), does the effect of BMI on SBP depend on gender? Center BMI at 25 kg/m2 before fitting the model (you will see why that’s important later).

# Center BMI at 25 – create a new variables called "cBMXBMI"
nhanes <- nhanes %>% mutate(cBMXBMI = BMXBMI - 25)
# Fit with no interaction (for comparison)
fit1 <- lm(BPXSY1 ~ cBMXBMI + RIAGENDR, data = nhanes)
# Fit with interaction
fit2 <- lm(BPXSY1 ~ cBMXBMI + RIAGENDR + cBMXBMI:RIAGENDR, data = nhanes)

Before explaining the output from the model with an interaction, let’s review the interpretation of the output of the model without an interaction.

Regression equation WITHOUT interaction

The model with no interaction can be written as:

\[ SBP = \beta_0 + \beta_1 (BMI - 25) + \beta_2 I(Gender = Female) + \epsilon \]

round(summary(fit1)$coef, 4)
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)    125.9226     0.9578 131.472   0.0000
## cBMXBMI          0.3735     0.0873   4.277   0.0000
## RIAGENDRFemale  -2.4695     1.2355  -1.999   0.0459
confint(fit1)
##                   2.5 %    97.5 %
## (Intercept)    124.0430 127.80221
## cBMXBMI          0.2021   0.54492
## RIAGENDRFemale  -4.8941  -0.04488

From the output above, we see that the equation for the estimated mean is (after rounding):

\[ E(SBP | BMI, Gender) = 125.9 +0.37 \times (BMI - 25) -2.47 \times I(Gender = Female) \]

Notice how, when we wrote the estimated model, we dropped the error term on the right and replaced \(Y\) with \(E(Y|predictors)\) on the left.

The interpretation of these coefficients is as follows:

  • Intercept: The mean SBP among males with BMI = 25 is 125.9 mmHg. [The intercept is the mean outcome when all the other variables are 0 or at their reference level. In this example, (BMI – 25) = 0 when BMI = 25 and I(Gender = Female) = 0 when Gender = Male, which is the reference level. The reason we centered BMI at 25 is so that the intercept would represent a plausible outcome value – at BMI = 25 kg/m2 rather than at 0 kg/m2 which is an impossible value.]
  • Coefficient for BMI: A 1 kg/m2 difference in BMI is associated with a 0.37 mmHg difference in SBP. This model assumes this effect is the same for males and females.
  • Coefficient for Gender: Females have an average SBP that is -2.47 mmHg lower (since the coefficient is negative) than males. This model assumes this gender difference is the same at all BMI values.

Regression equation WITH interaction

The model with and interaction can be written as:

\[ SBP = \beta_0 + \beta_1 (BMI - 25) + \beta_2 I(Gender = Female) + \beta_3 (BMI - 25) I(Gender = Female) + \epsilon \]

round(summary(fit2)$coef, 4)
##                        Estimate Std. Error  t value
## (Intercept)            126.9632     1.0917 116.2941
## cBMXBMI                  0.1419     0.1461   0.9714
## RIAGENDRFemale          -4.1391     1.4952  -2.7682
## cBMXBMI:RIAGENDRFemale   0.3598     0.1821   1.9762
##                        Pr(>|t|)
## (Intercept)              0.0000
## cBMXBMI                  0.3316
## RIAGENDRFemale           0.0057
## cBMXBMI:RIAGENDRFemale   0.0484
confint(fit2)
##                            2.5 %   97.5 %
## (Intercept)            124.82077 129.1057
## cBMXBMI                 -0.14477   0.4286
## RIAGENDRFemale          -7.07340  -1.2048
## cBMXBMI:RIAGENDRFemale   0.00251   0.7172
car::Anova(fit2, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##                   Sum Sq  Df  F value
## (Intercept)      5002725   1 13524.32
## cBMXBMI              349   1     0.94
## RIAGENDR            2835   1     7.66
## cBMXBMI:RIAGENDR    1445   1     3.91
## Residuals         358439 969         
##                               Pr(>F)    
## (Intercept)      <0.0000000000000002 ***
## cBMXBMI                       0.3316    
## RIAGENDR                      0.0057 ** 
## cBMXBMI:RIAGENDR              0.0484 *  
## Residuals                               
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, we see that the equation for the estimated mean is (after rounding):

\[\begin{array}{rcl} E(SBP | BMI, Gender) & = & 127.0 +0.14 \times (BMI - 25) -4.14 \times I(Gender = Female) \\ & & +0.36 \times (BMI - 25) \times I(Gender = Female) \end{array}\]

The interpretation of these coefficients is as follows. Notice in particular the underlined words.

  • Intercept: The mean SBP among males with BMI = 25 is 127.0 mmHg.
  • Coefficient for BMI (“main effect”): Among males, a 1 kg/m2 difference in BMI is associated with a 0.14 mmHg difference in SBP. This model allows the BMI effect to differ between males and females. See below for an explanation of why this is the effect for males, and how to derive the effect for females.
  • Coefficient for Gender (“main effect”): Among those with BMI = 25, females have an average SBP that is 4.14 mmHg lower (since the coefficient is negative) than males. This model allows the gender difference to be different at different BMI levels. See below for an explanation of why this effect is for those with BMI = 25 and how to derive the effect at other BMI levels.
  • Coefficient for BMI × Gender (“interaction effect”): This term has two interpretations. First, the BMI effect on SBP is 0.36 greater for females than for males. Second, the gender difference in mean SBP increases by 0.36 mmHg for each 1 kg/m2 above BMI = 25, and decreases by 0.36 for each 1 kg/m2 below BMI = 25.

NOTE: When there is an interaction in the model, be very careful when interpreting the main effects – they are NOT the overall effects of each variable since each variable involved in an interaction does not have a single overall effect. The main effects are the effects of each variable at a particular level of the other variable. In this example, the answer to “what is the effect of BMI on SBP” is “it depends on gender” (and vice versa). There is no single main effect for BMI or for gender – each depends on the other.

Let’s break down the regression equation by gender to make this clearer:

For males, \(I(Gender=Female)=0\), so the above regression equation reduces to

\[ SBP = \beta_0 + \beta_1 (BMI - 25) + \epsilon \]

which is estimated as

\[ E(SBP | BMI, Gender) = 127.0 +0.14 \times (BMI - 25) \].

For females, \(I(Gender=Female)=1\), so the regression equation reduces to

\[\begin{array}{rcl} SBP & = & \beta_0 + \beta_1 (BMI - 25) + \beta_2 + \beta_3 (BMI - 25) + \epsilon \\ & = & (\beta_0 + \beta_2) + (\beta_1 + \beta_3) (BMI - 25) + \epsilon \end{array}\]

which is estimated as

\[\begin{array}{rcl} E(SBP | BMI, Gender) & = & (127.0 -4.14) + (0.14 +0.36) \times (BMI - 25) + \epsilon \\ & = & 122.8 +0.50 \times (BMI - 25) \end{array}\]

For males:

  • Intercept: The mean SBP among males with BMI = 25 is 127.0 mmHg.
  • Coefficient for BMI: A 1 kg/m2 difference in BMI is associated with a 0.14 mmHg difference in SBP.

For females:

  • Intercept: The mean SBP among females with BMI = 25 is 122.8 mmHg.
  • Coefficient for BMI: A 1 kg/m2 difference in BMI is associated with a 0.50 mmHg difference in SBP.

We can visualize the impact of including an interaction by plotting the regression lines for the models without and with an interaction. As we did in Section 5.5, let’s plot the data, and then use predict() to get the points that fall on the regression lines over a range of predictor values.

# Get predicted values over a range for plotting
BMI <- seq(min(nhanes$BMXBMI, na.rm = T), max(nhanes$BMXBMI, na.rm = T), by = 1)
# Center X just like the predictor
cBMI <- BMI - 25
# Data frame of values at which to predict
preddat <- data.frame(cBMXBMI  = rep(cBMI, 2), # Repeat for Male, Female
                      RIAGENDR = c(rep("Male",   length(cBMI)),
                                   rep("Female", length(cBMI))))
# Add columns for the predicted values
preddat$pred_noint <- predict(fit1, preddat)
preddat$pred_int   <- predict(fit2, preddat)
# Examine the prediction data frame
head(preddat)
##   cBMXBMI RIAGENDR pred_noint pred_int
## 1   -10.2     Male      122.1    125.5
## 2    -9.2     Male      122.5    125.7
## 3    -8.2     Male      122.9    125.8
## 4    -7.2     Male      123.2    125.9
## 5    -6.2     Male      123.6    126.1
## 6    -5.2     Male      124.0    126.2
par(mfrow=c(1,2))
# Plot the raw X variable, not the centered one!
plot(nhanes$BMXBMI, nhanes$BPXSY1, col = "darkgray",
     xlab = expression(BMI (kg/m^2)),
     ylab = "SBP (mmHg)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
# Add lines to the plot
SUB1 <- preddat$RIAGENDR == "Male"
SUB2 <- preddat$RIAGENDR == "Female"
lines(BMI, preddat$pred_noint[SUB1], lwd = 2, lty = 1, col = "blue")
lines(BMI, preddat$pred_noint[SUB2], lwd = 2, lty = 2, col = "magenta")
legend(40, 220, c("Female", "Male"), col=c("magenta", "blue"), lwd=c(2,2), lty=c(1,2), bty="n")

plot(nhanes$BMXBMI, nhanes$BPXSY1, col = "darkgray",
     xlab = expression(BMI (kg/m^2)),
     ylab = "SBP (mmHg)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
lines(BMI, preddat$pred_int[SUB1], lwd = 2, lty = 1, col = "blue")
lines(BMI, preddat$pred_int[SUB2], lwd = 2, lty = 2, col = "magenta")
legend(40, 220, c("Female", "Male"), col=c("magenta", "blue"), lwd=c(2,2), lty=c(1,2), bty="n")

par(mfrow=c(1,1))

Model with no interaction: In the left panel, the BMI effect (slope) for males and females is assumed to be the same (and the estimate is 0.37), and the female line is slightly below the male line by an amount equal to the gender effect (-2.47). The assumption that the BMI effect is the same for males and females is equivalent to assuming that the lines are parallel.

Model with an interaction: In the right panel, including an interaction term has the effect of fitting separate regression lines for males and females. The BMI slope for females (0.50) is much larger than for males (0.14), indicating that BMI has a greater effect on SBP for females than for males.

The BMI effect is 0.50 for females and 0.14 for males. Is this difference statistically significant? This is the same as asking “are the lines in the plot significantly different from being parallel?” and is exactly what the p-value for the interaction is testing. The test of the null hypothesis that \(\beta_3 = 0\) is the test of interaction.

From the output, the p-value for the test of interaction is .048 which is < 0.05 so we conclude that there is a significant difference in the BMI effect on SBP between females and males (and that the there is a significant difference in the gender effect on SBP between those with differing BMI). In this case, since the interaction was between a continuos variable and a binary variable, the interaction is just one term, so the test can be obtained either from the regression coefficient table or from the Type III tests table. In Section 6.7.3 we will discuss what to do when the interaction involves a categorical variable with more than two levels

We just showed that the BMI effect is significantly greater for females than for males. What if you want to test the significance of the BMI effect for each gender individually (testing whether the line for females is significantly different from a horizontal line, testing whether the line for males is significantly different from a horizontal line)?

Null hypothesis for the test of the BMI effect among males:

\[H_0: \beta_1=0\]

For males, you can read this right off the output from the cBMXBMI row since that corresponds to \(\beta_1\). Conclusion: The BMI effect for males is not statistically significant (B = 0.14; 95% CI = -0.14, 0.43; p = .332).

How do we conduct this test for females? You could refit the model after setting Female to be the reference level instead of Male, however there is a more generally applicable method that is useful to learn and does not involve refitting the model.

Null hypothesis for the test of the BMI effect among females:

\[H_0: \beta_1 + \beta_3=0\] The BMI effect for females (\(\beta_1 + \beta_3\)) is a linear combination of regression coefficients. The function gmodels::estimable (Warnes et al. 2018) allows us to estimate and test such linear combinations. To use gmodels::estimable, you must specify a vector \((a, b, c, d)\) to estimate \(a\beta_0 + b\beta_1 + c\beta_2 + d\beta_3\) and test the null hypothesis that this sum is 0. For females, we want to test \(\beta_1 + \beta_3\) so we need to specify a vector \((0, 1, 0, 1)\). For males, the vector would be \((0, 1, 0, 0)\). To make sure you put the 1’s in the right places, look at the order of the coefficients in the lm() output.

# Testing the slope for males
gmodels::estimable(fit2, c(0, 1, 0, 0), conf.int = 0.95)
##           Estimate Std. Error t value  DF Pr(>|t|)
## (0 1 0 0)   0.1419     0.1461  0.9714 969   0.3316
##           Lower.CI Upper.CI
## (0 1 0 0)  -0.1448   0.4286
# Testing the slope for females
gmodels::estimable(fit2, c(0, 1, 0, 1), conf.int = 0.95)
##           Estimate Std. Error t value  DF    Pr(>|t|)
## (0 1 0 1)   0.5018     0.1087   4.616 969 0.000004435
##           Lower.CI Upper.CI
## (0 1 0 1)   0.2885   0.7151

Conclusion: The BMI effect for males is not statistically significant (Est = 0.14; 95% CI = -0.14, 0.43; p = .332). The BMI effect for females is statistically significant (Est = 0.50; 95% CI = 0.29, 0.72; p < .001).

The interaction term has two interpretations

Above, we saw how the BMI effect differed by gender. We can also break down the regression equation by BMI to see how the gender effect differs by BMI. BMI \(\ge\) 25 kg/m2 is considered “overweight” and BMI \(ge\) 30 kg/m2 is considered “obese.” Let’s see what the equations look like at BMI = 23, 28, and 33 kg/m2 (each in the range of normal, overweight, and obese, respectively).

At BMI = 23 kg/m2:

\[ SBP = \beta_0 + \beta_1 (23-25) + \beta_2 I(Gender = Female) + \beta_3 (23-25) \times I(Gender = Female) + \epsilon \] which is estimated as

\[\begin{array}{rcl} E(SBP│Gender) & = & 127.0 +0.14 \times (23-25) -4.14 \times I(Gender = Female) \\ & & +0.36 \times (23-25) \times I(Gender = Female) \end{array}\]

and reduces to

\[ E(SBP│Gender) = 126.7 -4.86 \times I(Gender=Female) \]

Skipping the math for the other cases, we get

\[\begin{array}{rcl} E(SBP│Gender) & = & 126.7 -4.86 \times I(Gender=Female) \\ E(SBP│Gender) & = & 127.4 -3.06 \times I(Gender=Female) \\ E(SBP│Gender) & = & 128.1 -1.26 \times I(Gender=Female) \end{array}\]

As BMI increases from 23 to 28, it increases by 5 units which results in the gender effect increasing by \(5 \times 0.36=1.80\) (going from -4.86 to -3.06).

We can visualize this in the plot by looking at how far apart the Male and Female regression lines are at different BMI values:

plot(nhanes$BMXBMI, nhanes$BPXSY1, col = "darkgray",
     xlab = expression(BMI (kg/m^2)),
     ylab = "SBP (mmHg)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
lines(BMI, preddat$pred_int[SUB1], lwd = 2, lty = 1, col = "blue")
lines(BMI, preddat$pred_int[SUB2], lwd = 2, lty = 2, col = "magenta")
legend(40, 220, c("Female", "Male"), col=c("magenta", "blue"), lwd=c(2,2), lty=c(1,2), bty="n")
abline(v = c(23, 28, 33), col = "brown", lty = 2, lwd = 2)

Conclusion: The magnitude of the association between BMI and SBP differs between males and females (p = .048) (this is the p-value for the interaction). The BMI effect for males is not statistically significant (Est = 0.14; 95% CI = -0.14, 0.43; p = .332) but the BMI effect for females is statistically significant (Est = 0.50; 95% CI = 0.29, 0.72; p < .001) (we computed these effect sizes, CIs, and p-values using gmodels::estimable).

NOTES:

  • When including an interaction with a continuous variable, it is helpful to center the variable first. It does not affect how well the model fits, but it does change the numeric values of some parameters and makes them more interpretable. For example, if we had not centered BMI in the example above, everywhere where we have “at BMI = 25” would be “at BMI = 0” which is an impossible value.
  • Centering is only needed for continuous variables, not categorical. Also, for a continuous variable \(X\), centering is only needed if \(X=0\) is not plausible, or when you want to see the value of the intercept or a main effect at some specific value of \(X\) other than 0.
  • Failure to center a continuous predictor does NOT invalidate the regression results… it just changes how you interpret them. The results are valid with or without centering.
  • You can ignore collinearity between a variable and an interaction including that variable. Just compute the VIFs without the interaction terms. (See section 6.10.4).
  • It’s OK if you feel confused at this point! Interactions are tricky, but you will learn and get used to them with experience. They are easy to mis-understand and mis-interpret, so if you feel confused that is actually a good thing… it means you know enough to be careful and ask for help until you are comfortable dealing with interactions on your own.

6.7.1 Types of interactions

In regression, you can have the following different kinds of interactions:

  • Interaction between a continuous variable and a categorical variable
  • Interaction between two categorical variables
  • Interaction between two continuous variables

6.7.1.1 Continuous \(\times\) categorical

The previous sections covered continuous × categorical interactions (e.g., BMI \(\times\) gender).

6.7.1.2 Categorical \(\times\) categorical

An interaction between two categorical variables cannot be visualized as a pair of regression lines. For example, if you have a categorical variable with 2 levels interacted with a categorical variable with 3 levels, you do not have two lines but rather 2 \(\times\) 3 = 6 means.

Example 4 Regress SBP on the categorical variables gender and smoking status, along with their interaction, and interpret the regression coefficients.

fit.cat.cat <- lm(BPXSY1 ~ RIAGENDR + smoker + RIAGENDR:smoker, data = nhanes)
round(summary(fit.cat.cat)$coef, 4)
##                              Estimate Std. Error
## (Intercept)                  127.2863      1.308
## RIAGENDRFemale                -3.2745      1.692
## smokerPast                     1.9137      2.034
## smokerCurrent                 -0.1462      2.311
## RIAGENDRFemale:smokerPast      7.6598      3.166
## RIAGENDRFemale:smokerCurrent   0.5290      3.406
##                              t value Pr(>|t|)
## (Intercept)                  97.3182   0.0000
## RIAGENDRFemale               -1.9352   0.0533
## smokerPast                    0.9408   0.3471
## smokerCurrent                -0.0632   0.9496
## RIAGENDRFemale:smokerPast     2.4191   0.0157
## RIAGENDRFemale:smokerCurrent  0.1553   0.8766

Including the intercept, there are 6 coefficients above. These can be combined to estimate mean SBP at the 6 possible combinations of gender × smoking status.

Let’s compute the mean at each level. We could use gmodels::estimable but for estimating the mean outcome at levels of categorical variables (as opposed to the slopes of continuous variables) using predict is much easier.

# Create a data.frame to store the predictions
preddat <- data.frame(
  RIAGENDR = c("Male",  "Male", "Male",    "Female", "Female", "Female"),
  smoker   = c("Never", "Past", "Current", "Never",  "Past",   "Current")
  )
# Add the predictions to preddat
pred.cat.cat <- cbind(preddat, predict(fit.cat.cat, newdata = preddat, interval = "confidence"))
pred.cat.cat
##   RIAGENDR  smoker   fit   lwr   upr
## 1     Male   Never 127.3 124.7 129.9
## 2     Male    Past 129.2 126.1 132.3
## 3     Male Current 127.1 123.4 130.9
## 4   Female   Never 124.0 121.9 126.1
## 5   Female    Past 133.6 129.3 137.9
## 6   Female Current 124.4 120.0 128.8

We can also visualize these means as follows.

# Plot SBP vs. smoking status, with separate means by gender
plot.default(nhanes$smoker, nhanes$BPXSY1, col = "darkgray",
             xlab = "Smoking Status", ylab = "Systolic BP (mm Hg)",
             las = 1, pch = 20,
             font.lab = 2, font.axis = 2, xaxt = "n",
             main = "SBP vs. Smoking by Gender")
axis(1, at = 1:3, labels = levels(nhanes$smoker), font = 2)
# Add means to the plot for males
points(1:3, pred.cat.cat$fit[preddat$RIAGENDR == "Male"], col = "brown", pch = 19, cex = 2)
lines( 1:3, pred.cat.cat$fit[preddat$RIAGENDR == "Male"], col = "brown")
# Add means to the plot for females
points(1:3, pred.cat.cat$fit[preddat$RIAGENDR == "Female"], col = "green", pch = 17, cex = 2)
lines( 1:3, pred.cat.cat$fit[preddat$RIAGENDR == "Female"], col = "green")
legend(1.1, 200, c("Male", "Female"), title = "Gender",
       lty = 1, col = c("brown", "green"),
       pch = c(19, 17), pt.cex = 2, seg.len = 5, bty = "n")

Is the interaction statistically significant? When you have an interaction between categorical variables with M and N levels, there will be (M-1) × (N-1) interaction terms. If this product is > 1, then you cannot get this test from the default lm() output since it is a multiple degree of freedom test (you are testing more than one coefficient simultaneously). But you can get it using car::Anova().

car::Anova(fit.cat.cat, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##                  Sum Sq  Df F value
## (Intercept)     3677812   1 9470.83
## RIAGENDR           1454   1    3.75
## smoker              417   2    0.54
## RIAGENDR:smoker    2372   2    3.05
## Residuals        381729 983        
##                              Pr(>F)    
## (Intercept)     <0.0000000000000002 ***
## RIAGENDR                      0.053 .  
## smoker                        0.584    
## RIAGENDR:smoker               0.048 *  
## Residuals                              
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a statistically significant interaction between gender and smoking status (p = .048). From the plot, we can see that males have a higher mean SBP among never and current smokers, but females have a higher mean SBP among past smokers.

NOTE: You could also plot SBP vs. Gender by Smoking Status to visualize how the smoking status effect differs between males and females.

What if you want to estimate the difference between specific levels of one variable at a certain level of the other? For example what is the difference in mean SBP between females and males among past smokers?

First, let’s write out the regression equation and be sure we know which coefficients correspond to each level. Second, we will use gmodels::estimable to estimate the contrast of interest.

\[\begin{array}{ccccl} E(SBP) & = & \beta_0 & + & \beta_1 I(Gender=Female) \\ & & & + & \beta_2 I(Smoker=Past) \\ & & & + & \beta_3 I(Smoker=Current) \\ & & & + & \beta_4 I(Gender=Female) \times I(Smoker=Past) \\ & & & + & \beta_5 I(Gender=Female) \times I(Smoker=Current) \end{array}\]

We would like to use gmodels::estimable to get an estimate, confidence interval, and p-value for our comparison of interest. The easiest way to figure out which vector of 0s and 1s is needed for a contrast is to first make a table for each gender \(\times\) smoking status combination.

Gender Smoker Coefficients Vector
Male Never \(\beta_0\) c(1, 0, 0, 0, 0, 0)
Male Past \(\beta_0 + \beta_2\) c(1, 0, 1, 0, 0, 0)
Male Current \(\beta_0 + \beta_3\) c(1, 0, 0, 1, 0, 0)
Female Never \(\beta_0 + \beta_1\) c(1, 1, 0, 0, 0, 0)
Female Past \(\beta_0 + \beta_1+ \beta_2+ \beta_4\) c(1, 1, 1, 0, 1, 0)
Female Current \(\beta_0 + \beta_1+ \beta_3+ \beta_5\) c(1, 1, 0, 1, 0, 1)

For example, to get the “Female, Never” row, plug \(I(Gender=Female) = 1\), \(I(Smoker=Past) = 0\) and \(I(Smoker=Current) = 0\) into the equation. The result is \(\beta_0 + \beta_1\), so we need a 1 in the first two slots in the coefficient vector.

NOTE: We number the \(\beta\)’s starting at \(\beta_0\) so be careful when creating the vector for gmodels::estimable. The slot in the vector for say, \(\beta_2\), is the 3rd slot, not the 2nd.

To check that the table is correct, we can compare the results to what we obtained previously using predict.

# Check that our table is correct... This should match preddat above
rbind(gmodels::estimable(fit.cat.cat, c(1, 0, 0, 0, 0, 0), conf.int = 0.95), # Male, Never
      gmodels::estimable(fit.cat.cat, c(1, 0, 1, 0, 0, 0), conf.int = 0.95), # Male, Past
      gmodels::estimable(fit.cat.cat, c(1, 0, 0, 1, 0, 0), conf.int = 0.95), # Male, Current
      gmodels::estimable(fit.cat.cat, c(1, 1, 0, 0, 0, 0), conf.int = 0.95), # Female, Never
      gmodels::estimable(fit.cat.cat, c(1, 1, 1, 0, 1, 0), conf.int = 0.95), # Female, Past
      gmodels::estimable(fit.cat.cat, c(1, 1, 0, 1, 0, 1), conf.int = 0.95)) # Female, Never
##               Estimate Std. Error t value  DF Pr(>|t|)
## (1 0 0 0 0 0)    127.3      1.308   97.32 983        0
## (1 0 1 0 0 0)    129.2      1.558   82.93 983        0
## (1 0 0 1 0 0)    127.1      1.905   66.74 983        0
## (1 1 0 0 0 0)    124.0      1.073  115.53 983        0
## (1 1 1 0 1 0)    133.6      2.176   61.39 983        0
## (1 1 0 1 0 1)    124.4      2.260   55.03 983        0
##               Lower.CI Upper.CI
## (1 0 0 0 0 0)    124.7    129.9
## (1 0 1 0 0 0)    126.1    132.3
## (1 0 0 1 0 0)    123.4    130.9
## (1 1 0 0 0 0)    121.9    126.1
## (1 1 1 0 1 0)    129.3    137.9
## (1 1 0 1 0 1)    120.0    128.8

It matches, so we can move forward with estimating additional contrasts of interest. To compare two levels, copy and paste their “Vector” values from the table above into gmodels::estimable. For example. to compare females and males among past smokers, subtract the vectors in the corresponding rows.

# Female Past - Male Past
gmodels::estimable(fit.cat.cat, c(1, 1, 1, 0, 1, 0) - c(1, 0, 1, 0, 0, 0))
##               Estimate Std. Error t value  DF Pr(>|t|)
## (0 1 0 0 1 0)    4.385      2.676   1.639 983   0.1016

Thus, we see that although the interaction was significant overall, this specific contrast is not statistically significant.

6.7.1.3 Continuous \(\times\) continuous

An interaction between two continuous variables also cannot be completely visualized as a pair of regression lines. Since each variable is continuous, the complete visualization would actually a response surface. But, you can plot a regression line for one variable at a set of specific values of the other. Unlike with a continuous \(\times\) categorical interaction, there are an infinite number of specific values we can choose, so we will just choose a few across a range of possible values.

Example 5: Among females in NHANES, regress SBP on the continuous variables BMI and age, along with their interaction, and interpret the regression coefficients. Center age at 40 years and BMI at 30 kg/m2.

# Center variables, filter to females
females <- nhanes %>% 
  mutate(cRIDAGEYR = RIDAGEYR - 40,
         cBMXBMI   = BMXBMI   - 30) %>% 
  filter(RIAGENDR == "Female")

# Fit model
fit.cont.cont <- lm(BPXSY1 ~ cRIDAGEYR + cBMXBMI + cRIDAGEYR:cBMXBMI, data = females)
round(summary(fit.cont.cont)$coef, 4)
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)       119.3985     0.8254 144.652    0.000
## cRIDAGEYR           0.6718     0.0400  16.798    0.000
## cBMXBMI             0.4905     0.0984   4.984    0.000
## cRIDAGEYR:cBMXBMI  -0.0142     0.0054  -2.622    0.009

The regression equation is:

\[SBP = \beta_0 + \beta_1 (Age-40) + \beta_2 (BMI-30) + \beta_3 (Age-40) \times (BMI-30) + \epsilon\] and the interpretation of each parameter is as follows:

  • \(\beta_0\) is the mean SBP among those Age = 40 years with BMI = 30 kg/m2. (Recall, the intercept is the mean outcome when all the predictors are at 0).
  • \(\beta_1\) is the age effect among those with BMI = 30 kg/m2. Because there is an Age \(\times\) BMI interaction, the “age effect” varies with BMI.
  • \(\beta_2\) is the BMI effect among those age = 40 years. Because there is an Age \(\times\) BMI interaction, the “BMI effect” varies with age.
  • \(\beta_3\) is the interaction effect. It represents the difference in the age effect for every unit difference in BMI away from 30 kg/m2, and the difference in the BMI effect for every year different in age away from 40 years.

To visualize this fully would require a 3-dimensional plot. However, we can pick a few specific values of one variable and plot the regression lines for the other variable at those levels. This is what we did for continuous \(\times\) categorical interactions but in that case there was a finite number of categorical levels; here, there are an infinite number of values you can pick from – just pick a few interesting levels. Let’s do the following:

  • Plot regression lines for SBP vs. BMI at ages 30, 40 and 50 years
  • Plot regression lines for SBP vs. Age at BMI 23, 28, and 33 kg/m2

Once again, we can use predict to get the fitted values over a range for use in plotting.

# Create "preddat1"
# Regression lines for SBP vs. BMI at ages 30, 40 and 50 years
# Get predicted values over a range for plotting
BMI <- seq(min(females$BMXBMI, na.rm = T), max(females$BMXBMI, na.rm = T), by = 1)
# Center X just like the predictor
cBMI <- BMI - 30
# Data frame of values at which to predict
preddat1 <- data.frame(cBMXBMI   = rep(cBMI, 3), # Repeat for 3 ages
                       # Each age (30,40,50) is centered at 40
                       cRIDAGEYR = c(rep(30 - 40, length(cBMI)),
                                     rep(40 - 40, length(cBMI)),
                                     rep(50 - 40, length(cBMI))))
# Add columns for the predicted values
preddat1$pred <- predict(fit.cont.cont, preddat1)

# Create "preddat2"
# Regression lines for SBP vs. Age at BMI 23, 28, and 33 kg/m2
# Get predicted values over a range for plotting
AGE <- seq(min(females$RIDAGEYR, na.rm = T), max(females$RIDAGEYR, na.rm = T), by = 1)
# Center X just like the predictor
cAGE <- AGE - 40
# Data frame of values at which to predict
preddat2 <- data.frame(cRIDAGEYR   = rep(cAGE, 3), # Repeat for 3 BMIs
                       # Each BMI (23,28,33) is centered at 30
                       cBMXBMI = c(rep(23 - 30, length(cAGE)),
                                   rep(28 - 30, length(cAGE)),
                                   rep(33 - 30, length(cAGE))))
# Add columns for the predicted values
preddat2$pred <- predict(fit.cont.cont, preddat2)
# Regression lines for SBP vs. BMI at ages 30, 40 and 50 years
# Plot the raw X variable, not the centered one!
plot(nhanes$BMXBMI, nhanes$BPXSY1, col = "lightgray",
     xlab = expression(BMI (kg/m^2)),
     ylab = "SBP (mmHg)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
# Add lines to the plot
SUB1 <- preddat1$cRIDAGEYR == (30 - 40)
SUB2 <- preddat1$cRIDAGEYR == (40 - 40)
SUB3 <- preddat1$cRIDAGEYR == (50 - 40)
lines(BMI, preddat1$pred[SUB1], lwd = 2, lty = 1, col = "blue")
lines(BMI, preddat1$pred[SUB2], lwd = 2, lty = 2, col = "magenta")
lines(BMI, preddat1$pred[SUB3], lwd = 2, lty = 3, col = "brown")
# After looking at the plot, make the order of col and lty match the order
# of the lines from top to bottom
legend(50, 220, c("50", "40", "30"), title = "Age (years)", cex = 0.9,
       lwd=2, col=c("brown", "magenta", "blue"), lty=c(3,2,1), bty="n")

# Regression lines for SBP vs. Age at BMI 23, 28, and 33 kg/m2
# Plot the raw X variable, not the centered one!
plot(nhanes$RIDAGEYR, nhanes$BPXSY1, col = "lightgray",
     xlab = "Age (years)",
     ylab = "SBP (mmHg)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
# Add lines to the plot
SUB1 <- preddat2$cBMXBMI == (23 - 30)
SUB2 <- preddat2$cBMXBMI == (28 - 30)
SUB3 <- preddat2$cBMXBMI == (33 - 30)
lines(AGE, preddat2$pred[SUB1], lwd = 2, lty = 1, col = "blue")
lines(AGE, preddat2$pred[SUB2], lwd = 2, lty = 2, col = "magenta")
lines(AGE, preddat2$pred[SUB3], lwd = 2, lty = 3, col = "brown")
# After looking at the plot, make the order of col and lty match the order
# of the lines from top to bottom
legend(20, 220, c("33", "28", "23"), title = expression(BMI (kg/m^2)), cex = 0.9,
       lwd=2, col=c("brown", "magenta", "blue"), lty=c(3,2,1), bty="n")

Recall that the regression coefficient table was:

round(summary(fit.cont.cont)$coef, 4)
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)       119.3985     0.8254 144.652    0.000
## cRIDAGEYR           0.6718     0.0400  16.798    0.000
## cBMXBMI             0.4905     0.0984   4.984    0.000
## cRIDAGEYR:cBMXBMI  -0.0142     0.0054  -2.622    0.009

We can re-interpret the numbers in this table in light of the figures above.

  • The coefficient for cBMXBMI is 0.4905. This is the slope for SBP vs. BMI when Age = 40 years (the value at which we centered age) and corresponds to the slope of the dashed (middle) line in the left panel.
  • The interaction term is negative (-0.0142), which means that the slope is smaller among those who are older, as can be seen by the fact that lines become less steep as age increases.
  • The coefficient for cRIDAGEYR is 0.6718. This is the slope for SBP vs. Age when BMI = 30 kg/m2 (the value at which we centered BMI). This corresponds to none of the lines in the right panel because we did not plot at BMI = 30, but if we did the line would be between the dashed (BMI = 28) and dotted (BMI = 33) lines.
  • The negative interaction term implies that the slope is smaller among those who have higher BMI, as can be seen by the fact that the lines corresponding to higher BMI values are less steep.
  • The interaction term is statistically significant (p = .009). Thus, we can reject the null hypothesis that the lines are parallel, and conclude that the age effect varies with BMI, and the BMI effect varies with age.
  • Differences in SBP between those of different ages are greater among those with lower BMI (the lines in the left panel are further apart for smaller BMI values).
  • Differences in SBP between those with different BMI are greater among those who are younger (the lines in the right panel are further apart for younger ages).

6.7.2 When to include an interaction

There may be quite a large number of possible interactions you could include in a regression model. If there are 5 predictors, for example, there are 10 possible two-way interactions! How do you know whether or not to include them? The best approach is to use subject-matter knowledge to a priori decide which interactions to include and then leave them in the model regardless of their statistical significance.

You can spend (waste) a lot of time investigating possible moderators by testing the significance of lots of possible interaction terms. To avoid wasting time and obtaining spurious results due to data dredging, think about possible interactions before you fit the model and include those that make sense from a subject-matter perspective. Typically, you will only need to include interactions between pairs of variables (two-way interactions). Three- or more way interactions become very complex to interpret and should only be used if they answer a pre-specified question of interest to the researcher (Harrell 2015, p37).

6.7.3 Test of interaction

When the interaction involves just one term in the model (one row in the regression coefficient table) you can get the p-value for the test of significance for the interaction by looking at the regression table. When at least one of the terms in the interaction is a categorical variable with more than two levels, the interaction will consist of multiple terms. For example:

fit.mult.df.int <- lm(BPXSY1 ~ cBMXBMI + smoker + cBMXBMI:smoker, data = nhanes)
round(summary(fit.mult.df.int)$coef, 4)
##                       Estimate Std. Error t value
## (Intercept)           122.8381     0.9705 126.567
## cBMXBMI                 0.4910     0.1185   4.144
## smokerPast              6.3461     1.8780   3.379
## smokerCurrent           2.3190     1.8916   1.226
## cBMXBMI:smokerPast     -0.3079     0.2121  -1.452
## cBMXBMI:smokerCurrent  -0.3325     0.2245  -1.481
##                       Pr(>|t|)
## (Intercept)             0.0000
## cBMXBMI                 0.0000
## smokerPast              0.0008
## smokerCurrent           0.2205
## cBMXBMI:smokerPast      0.1469
## cBMXBMI:smokerCurrent   0.1388

Recall that to test the main effect for a categorical variable with more than two levels, we used car::Anova() to carry out the multiple df test. We can do the same for a multiple df interaction:

car::Anova(fit.mult.df.int, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##                 Sum Sq  Df  F value
## (Intercept)    5903315   1 16019.16
## cBMXBMI           6328   1    17.17
## smoker            4251   2     5.77
## cBMXBMI:smoker    1224   2     1.66
## Residuals       356355 967         
##                              Pr(>F)    
## (Intercept)    < 0.0000000000000002 ***
## cBMXBMI                    0.000037 ***
## smoker                       0.0032 ** 
## cBMXBMI:smoker               0.1904    
## Residuals                              
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude for this example that there is not a significant BMI \(\times\) smoking interaction.

6.7.4 Overall test of a predictor involved in an interaction

Above, we tested the significance of the interaction, which is a test of the null hypothesis that the predictor effect does not vary with the other variable. A test of interaction answers the question “does the effect of BMI on SBP depend on gender?” But what if we want to answer the question “Is BMI significantly associated with SBP?” In this case, we would like to compare the full model (with an interaction) to the model that does not have BMI at all - that is, the model that excludes both the BMI main effect and the BMI \(\times\) gender interaction. We can accomplish this using anova, which allows us to compare any two nested models (nested models are those for which the predictors in one are a subset of the predictors in the other).

If we fit the full model with lm, then we need to fit the reduced model (making sure that it is fit to the same set of cases) before we use anova. Since the full model excluded all cases with missing SBP, BMI, or gender, and the reduced model does not have BMI in it, we need to explicitly exclude those with missing BMI using a subset statement.

# Fit reduced model, subsetting on those missing the other variable
# so fit0 and fit1 have the same sample size
fit0 <- lm(BPXSY1 ~  RIAGENDR, data = nhanes, subset = !is.na(cBMXBMI))
# Fit full model (fit2 fit previously)
# Compare
anova(fit0, fit2)
## Analysis of Variance Table
## 
## Model 1: BPXSY1 ~ RIAGENDR
## Model 2: BPXSY1 ~ cBMXBMI + RIAGENDR + cBMXBMI:RIAGENDR
##   Res.Df    RSS Df Sum of Sq    F   Pr(>F)    
## 1    971 366670                               
## 2    969 358439  2      8231 11.1 0.000017 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, we conclude that the model with both BMI and a BMI \(\times\) gender interaction is significantly better than the model with just gender (F[2,969] = 11.13, p < .001).

6.8 Predictions

How do we compute predictions from an MLR model? The same way we did for SLR – plug the predictor values into the equation – except that in MLR there are more predictor values you have to specify. Going back to Example 2, let’s estimate the predicted mean SBP at the following values of smoking status, age, gender, race, education, and income:

  • Smoker = Current
  • Age = 40 years
  • Gender = Male
  • Race = Non-Hispanic Black
  • Education = College graduate
  • Income = $55,000+

The predicted value would be the intercept + coefficient for “Current” + coefficient for “Age” × 40 + coefficient for Male (which is 0 because it is the reference level) + coefficient for “Non-Hispanic Black” + coefficient for “College graduate” + coefficient for “$55,000+.” Using predict(), we get the following:

# The function requires a dataframe with values of each predictor
# in the correct form for that predictor. If needed, for a factor variable x,
# see levels(x) to get the right spelling of each level.
predict(fit.ex2.adj, data.frame(smoker   = "Current",
                         RIDAGEYR = 40,
                         RIAGENDR = "Male",
                         RIDRETH3 = "Non-Hispanic Black",
                         DMDEDUC2 = "College grad",
                         income   = "$55,000+"),
        interval = "confidence")
##     fit   lwr   upr
## 1 123.8 119.1 128.4

6.9 Confidence intervals, confidence bands, prediction intervals

As with SLR, there are three kinds of confidence intervals we might be interested in:

  • Confidence intervals for the regression coefficients (CI for \(\beta\))
  • Confidence interval for the mean outcome (CI for \(E(Y|X=x)\))
  • Confidence interval for a future observation (prediction interval, or CI for \(Y|X=x\))

The syntax for these is very similar to SLR, just with more predictors.

# CI for regression coefficients
confint(fit.ex2.adj)
##                              2.5 %     97.5 %
## (Intercept)                92.8537 106.081649
## smokerPast                 -4.9942   1.005449
## smokerCurrent              -5.5155   1.238504
## RIDAGEYR                    0.4542   0.597948
## RIAGENDRFemale             -4.0808   0.822705
## RIDRETH3Other Hispanic     -4.3173   6.243171
## RIDRETH3Non-Hispanic White -4.8499   3.103371
## RIDRETH3Non-Hispanic Black  3.2270  11.777986
## RIDRETH3Non-Hispanic Asian -7.7177   1.795133
## RIDRETH3Other/Multi        -4.1494   7.721508
## DMDEDUC29-11                0.9119  12.656993
## DMDEDUC2HS/GED             -2.9399   7.703526
## DMDEDUC2Some college/AA    -2.1494   8.452327
## DMDEDUC2College grad       -4.6020   6.834050
## income$25,000 to < $55,000 -3.4695   2.853533
## income$55,000+             -6.4815   0.000141
# CI for the mean outcome
predict(fit.ex2.adj, data.frame(smoker   = "Current",
                                RIDAGEYR = 40,
                                RIAGENDR = "Male",
                                RIDRETH3 = "Non-Hispanic Black",
                                DMDEDUC2 = "College grad",
                                income   = "$55,000+"),
        interval = "confidence")
##     fit   lwr   upr
## 1 123.8 119.1 128.4
# CI for outcome for an individual
predict(fit.ex2.adj, data.frame(smoker   = "Current",
                                RIDAGEYR = 40,
                                RIAGENDR = "Male",
                                RIDRETH3 = "Non-Hispanic Black",
                                DMDEDUC2 = "College grad",
                                income   = "$55,000+"),
        interval = "prediction")
##     fit  lwr   upr
## 1 123.8 89.8 157.7

As we discussed previously, the predicted values for the mean outcome and for an individual are the same but the CI for an individual’s outcome will always be wider than the CI for the mean outcome. The former is the interval in which we expect a future single observation to lie, and single observations are more variable than the mean of many observations.

6.10 Diagnostics

A briefly mentioned in Section 5.8, the validity of regression coefficient estimates, confidence intervals, and significance tests based on a regression model depend on a number of assumptions.

Think about where the regression results are assumed to come from. We assume we have a data set of independent cases, each with an outcome and predictor values. We hypothesize that there is a specific relationship between the outcome and the predictors described by the linear regression equation - the outcome is approximately the sum of the predictors each multiplied by some number. Approximately, but not exactly - that is where the error term \((\epsilon)\) comes into play. It captures the influence of all the other variables that we have not been able to include in our model, either because we do not have the data or because we do not know they need to be in the model, as well as the fact that there is random variation in the outcome even among cases with all the same predictor values.

We assumed that this error term has a normal distribution with a variance that does not vary between cases. All the subsequent results - the estimate of the regression coefficient for each predictor, its confidence interval, its p-value - are derived based on formulas that make use of these assumptions. So if the assumptions are not valid, our conclusions may be invalid.

To make matters worse, even when the assumptions are generally met, there might be a few individual pesky observations that do not want to play by the rules. They might be outliers, points that fall far from the regression line, or influential observations, cases that, if removed, would result in large changes in the regression coefficient estimates. It’s not encouraging to find out that our results change drastically when one or a few observations are removed! So, in addition to checking assumptions, we need diagnostic tools to look for outliers and influential observations.

Finally, there is a issue that is not present in SLR that we have to consider in MLR - collinearity. This issue has to do with how correlated the predictors are with each other. In the extreme case, if two predictors are completely redundant, the equation that is being solved when fitting the model has no solution - you cannot get a coefficient for a predictor with the definition “the effect of this predictor while holding all other predictor fixed” if there is another predictor that is exactly correlated! Even when the redundancy is only partial, correlations between predictors can cause our regression results to be unstable.

For each of these issues, we will define the problem, describe how to diagnose the problem, and provide possible solutions if the diagnosis is not favorable.

6.10.1 Residuals

Recall that the regression model,

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon \]

has an error term, \(\epsilon\), at the end. When we fit the model, we get the following formula for the mean outcome, or the expected value of the outcome:

\[ E(Y|X=x) = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 + \ldots + \hat\beta_p x_p .\]

The \(\hat{ }\) on top of each \(\beta\) signifies that these are estimates of the population regression coefficients. These are our best guess given the data. Notice that there is no error term in this second equation. Every individual with the same set of predictor values has the same prediction for the mean outcome. But, of course, not every individual has the same observed value of the outcome. For each individual, the difference between the observed outcome and the predicted mean outcome is called the residual and is the observed value of the error term for that individual. In SLR, the residual can be visualized as the perpendicular distance from an observation to the line.

More formally, the prediction for the \(i^{th}\) observation, also called the fitted value, is denoted by \(\hat{y_i}\). The unstandardized residual is the difference between the observed outcome and the fitted value. For example, if \(\hat\beta_0\) and \(\hat\beta_1\) are the estimates of the intercept \(\beta_0\) and slope \(\beta_1\) in a simple linear regression with a continuous predictor, then the residual \(e_i\) for the \(i^{th}\) observation is

\[e_i = y_i - \hat{y_i} = y_i - (\hat\beta_0 + \hat\beta_1 x_i).\] If the model is correct, the residuals are realizations of the random error term \(\epsilon\) and are \(N(0, \sigma^2)\) (normally distributed with mean 0 and variance \(\sigma^2\)). Visually, the residual is the perpendicular distance from the observed value to the regression line. The figure below displays the residuals for two observations. Points below the line have a negative residual, and those above the line have a positive residual.

What is a residual?

Figure 6.1: What is a residual?

6.10.2 Computing residuals

Many of the diagnostic tools we will use will automatically compute residuals for you, but sometimes you need to compute them yourself. The basic definition of a residual given in Section @slr-normality is for what are called unstandardized residuals - these are the raw perpendicular distance from an observed value to the regression line, in the case of SLR, or more generally the difference between the observed outcome and the predicted value for that case. Unstandardized residuals are appropriate if you want to examine the residuals on the same scale as the outcome. However, if you want to compare their magnitude to some objective standard, then you need to standardize them.

Standardized residuals are the residuals divided by an estimate of their standard deviation, the result being that they have a standard deviation very close to 1. Studentized residuals are similar to standardized residuals except that for each case the residual is divided by the standard deviation estimated from the regression with that case removed (Fox and Weisberg 2019, 387).

The residuals() (or resid()), rstandard(), and rstudent() functions compute unstandardized, standardized, and Studentized residuals, respectively. For example,

r1  <- resid(fit.ex2.adj)
r2  <- rstandard(fit.ex2.adj)
r3  <- rstudent(fit.ex2.adj)
# Compare means and standard deviations
COMPARE <- round(data.frame(unstandardized = c(mean(r1), sd(r1)),
                            standardized   = c(mean(r2), sd(r2)),
                            Studentized    = c(mean(r3), sd(r3))), 5)
rownames(COMPARE) <- c("mean", "sd")
COMPARE
##      unstandardized standardized Studentized
## mean           0.00       0.0001     0.00055
## sd            16.97       1.0007     1.00279

All three types of residuals have a mean of approximately 0 (for unstandardized residuals, it will be exactly 0), and standardized and Studentized residuals have a standard deviation that is approximately 1. This makes standardized and Studentized residuals each comparable to an objective standard. As shown below, the distribution of these 3 types of residuals will have approximately the same shape.

# Set margins to make the histograms fill the space better
# Default par(mar=c(5, 4, 4, 2) + 0.1) for c(bottom, left, top, right)
par(mar=c(5, 4, 2.2, 2)) # c(bottom, left, top, right)
par(mfrow=c(3,1))
hist(r1, probability = T, main = "Unstandardized")
hist(r2, probability = T, main = "Standardized")
hist(r3, probability = T, main = "Studentized")

par(mfrow=c(1,1))
par(mar=c(5, 4, 4, 2) + 0.1) # Reset to default

6.10.3 Checking assumptions

6.10.3.1 Independent observations

A linear regression model assumes that each case, or \((x,y)\) pair, is independent of the others. If the cases were drawn from a population using a simple random sample, then this assumption is met. An example where this is violated is if the observations are clustered, such as if there are repeated measures from the same individual, as in longitudinal data, or if households were first sampled followed by sampling individuals within households.

6.10.3.1.1 Impact

A violation of the assumption of independence results in parameter estimates that may have the wrong estimated variance, leading to incorrect confidence intervals and p-values (although they may still be unbiased in some cases) (K. Y. Liang and Zeger 1993).

6.10.3.1.2 Diagnosis

Think about how the data were collected. Are there clusters? For example, are there data from individual patients clustered within hospitals? Or individuals clustered in families or neighborhoods? Are there repeated measures from the same individual? Those are all examples of correlated, dependent, data.

6.10.3.1.3 Potential Solutions

There are a number of methods of handling correlated data. If the correlation is the result of a clustered sampling method, such as is utilized in NHANES (https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx, Accessed July 30, 2021), it is possible to adjust for this feature of the data using methods discussed in Chapter 10. In some cases, the clusters are a simple random sample, but we have multiple observations within a cluster. An example of this is longitudinal data, where individuals are sampled and then measured repeatedly over time. Methods for handling longitudinal (or similarly clustered) data include linear mixed models (Laird and Ware 1982; Fitzmaurice, Laird, and Ware 2011) and generalized estimating equations (Kung-Yee Liang and Zeger 1986; Zeger and Liang 1986; Diggle et al. 2002).

6.10.3.2 Linearity

When the predictor is continuous, the model assumes that the average outcome is linearly related to each predictor when holding all others fixed. That is why the plotted simple linear regression line is a straight line - the model assumes it is straight. The underlying population relationship might not be linear, however, as we saw in the analysis of nation-level mortality rate in Section 5.5.

NOTE: The linearity assumption only applies to continuous predictors.

6.10.3.2.1 Impact

If the true model is not linear but you fit a line, then the predictions will be biased. You may indeed get an estimate of the best fitting line, but a line is not the correct model.

6.10.3.2.2 Diagnosis

If there is only one predictor, you can diagnose linearity by looking at the plot of the data with the regression line (see Section 6.4), along with a smoother that relaxes the linearity assumption.

contplot("BPXSY1", "RIDAGEYR", nhanes, ylab = "Systolic BP (mmHg)", xlab = "Age (years)")
Scatterplot to check linearity assumption

Figure 6.2: Scatterplot to check linearity assumption

Sometimes, however, it is easier to judge non-linearity by looking at a plot of the the residuals vs. the predictor, which is the same plot except rotated so the regression line is now a horizontal line at 0, using car::residualPlots (Fox and Weisberg 2019).

fit.rp <- lm(BPXSY1 ~ RIDAGEYR, data = nhanes)
car::residualPlots(fit.rp, fitted = F, pch=20, col="gray")
Residual vs. predictor to check linearity assumption

Figure 6.3: Residual vs. predictor to check linearity assumption

##          Test stat Pr(>|Test stat|)   
## RIDAGEYR      2.85           0.0045 **
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-linearity can be seen by looking at the shape of the cloud of points around the regression line in 6.2 or, more easily, by looking at the shape of the smoother in 6.3. If linearity is a good assumption, the cloud should not appear to curve around the regression line and the smoother should be close to a horizontal line at 0.

The car::residualPlots output also includes a test of curvature. The row with the same name as the predictor is a test of the significance of a quadratic term being added to the model, which is equivalent to the following:

round(summary(lm(BPXSY1 ~ RIDAGEYR + I(RIDAGEYR^2), data = nhanes))$coef, 6)
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)   110.596612   3.902521 28.3398 0.000000
## RIDAGEYR        0.045937   0.170727  0.2691 0.787935
## I(RIDAGEYR^2)   0.004829   0.001695  2.8500 0.004463

In general, such a test suffers from the same problem as many goodness of fit tests. In a large sample even a small but unimportant deviation from linearity may be statistically significant and in a small sample size the test may lack power to detect an important deviation from linearity. Whenever possible, it is a good idea to also look at a visualization and not simply rely on a test.

When there are multiple predictors, the linearity assumption requires that the \(Y\) vs. \(X\) relationship be linear when holding all other variables fixed. Thus, we would like a residual plot that takes that into account. We can use the car::crPlots function (Fox and Weisberg 2019) which creates a component + residual plot (CR plot).

As an example, let’s assess the linearity assumption for the continuous predictors in the regression of SBP on age, BMI, height, and smoking status.

For each continuous predictor, the horizontal axis represents the value of the predictor. The vertical axis represents the partial residual which is the unstandardized residual + the contribution to the mean outcome for that individual’s predictor value (\(\hat\beta_j X_{ij}\)). Compare the dashed line (which assumes linearity) to the smooth curve (does not assume linearity) in each continuous predictor’s CR plot. If linearity is a good assumption, the curve should fall near the line.

The plot for smoker can be ignored because there is no linearity assumption for categorical variables. For the others, the solid line (relaxing the linearity assumption) is only a little bit different than the dashed line (linearity assumption), so it may be reasonable to conclude that the linearity assumption is acceptable.

The deviation for age seems the largest… what would we do if we wanted to relax the linearity assumption? A good option when the curvature is U-shaped (or upside-down-U-shaped) is to add a quadratic term for that variable and see how the component+residual plot changes. Note the use of terms to specify that we only want plots for the continuous predictors (smoker was left out of the list).

# Try a quadratic for age
fit.diagnostics2 <- lm(BPXSY1 ~ RIDAGEYR + I(RIDAGEYR^2) + BMXBMI + BMXHT + smoker, data = nhanes)
car::crPlots(fit.diagnostics2, pch=20, col="gray",
             terms = ~ RIDAGEYR + I(RIDAGEYR^2) + BMXBMI + BMXHT)

Notice that now there are two plots for age, one for the linear term and one for the quadratic term. Although it seems counterintuitive since now age is modeled as a quadratic, you still need to check the “linearity” assumption for each age term. Although it does not make sense to interpret each of these terms as the effect holding the others constant (you cannot vary age while holding age2 constant!) this still works mathematically and it’s the mathematical assumption of linearity that matters here.

NOTE: If at least one of the variables in the interactions is continuous, the linearity assumption becomes more complicated to check and, when there are interactions, car::crPlots will not run. In that case, diagnose linearity using an AV plot (car::avPlots). This is not ideal, and that function does not add a smooth line that relaxes the linearity assumption. However, using this method as an approximation, you can still see if the cloud of points looks curved in any way. For example,

fit.diagnostics.int <- lm(BPXSY1 ~ RIDAGEYR + BMXHT + smoker + RIDAGEYR:smoker, data = nhanes)
try(car::crPlots(fit.diagnostics.int, pch=20, col="gray"))
## Error in crPlots.default(fit.diagnostics.int, pch = 20, col = "gray") : 
##   C+R plots not available for models with interactions.
car::avPlots(fit.diagnostics.int, pch=20, col="gray")

6.10.3.2.3 Potential solutions
  • Fit a polynomial curve (see Section 5.5).
  • Transform \(X\) and/or \(Y\) to obtain a linear relationship. Common transformations are logarithm, square root, and inverse (Weisberg 1985, 134).
  • Fit a spline function to fit a smooth curve (not discussed here, but see (Harrell 2015) for more information).
  • Transform \(X\) into a categorical variable (e.g., using tertiles or quartiles of \(X\) as cutoffs). It is generally better to not categorize a continuous predictor as you lose information and the cutoffs are arbitrary, but this method is useful just to see what the trend looks like.
  • Non-linearity may co-occur with non-constant variance and/or non-normality. Sometimes solving one problem fixes others.

Examples (from simulated data) of non-linearity and how to resolve it.

When the \(X\) values are highly skewed with a large range, which often happens with lab measurements or variables such as income or population size, then a log-transformation often works well.

The plots on the top row are scatterplots of the outcome vs. the predictor, before (left panel) and after (right panel) the transformation. The plots on the bottom row are the corresponding CR plots that allow us to assess linearity. In this artificial example, the log transformation perfectly took care of the non-linearity. Despite being simulated data, this illustrates the sort of \(Y\) vs. \(X\) relationship where a log transformation may help.

When the \(X\) values are skewed with a moderate range a square-root transformation often works well.

When the \(X\) values are bunched up near zero but with some larger values, as well, then an inverse transformation often works well.

NOTE: When using a transformation, you may inadvertently create some missing values! For example, the logarithmic function is only defined for positive values. If a variable has negative values, do not use a log-transformation. If the only non-positive values are zeros, it is common to add 1 before transforming. For example, \(\ln(X+1)\) or \(\ln(Y+1)\). If the values of the raw variable are all very small or large, then instead of adding 1, add a number on their scale (e.g., 0.01, 0.10, 10, 100). Similarly, do not use a square root transformation if there are negative values and avoid the inverse transformation if there are zero values or a mix of positive and negative numbers (the inverse is defined in this case but the resulting values will no longer be in the same rank ordering).

NOTE: The “linearity assumption” is not the same as the term “linear” in “linear regression.” There are many applications of SLR where you can fit something other than a straight line, as we did in Section 5.5 when we fit a polynomial or if we transform the variables in other ways. The “linear” in “linear regression” refers to the outcome being linear in the parameters, the \(\beta\)s. The mathematics behind fitting a regression line involves taking the derivative of \(Y\) with respect to the parameters and when the relationship is linear the resulting equations cannot be solved. When the relationship is non-linear, the computation requires more complex algorithms.

When we used a polynomial curve for a single predictor, we still could use “linear” regression because we could express the polynomial function as the sum of functions. You can use linear regression to fit any non-linear function of a predictor \(X\) that can be expressed as a sum of the product of coefficients and functions of \(X\) that do not have any coefficients inside the function. For example,

\[\begin{array}{lrcl} \text{Linear regression:} & Y & = & \beta_0 + \beta_1 (X-\bar{X}) + \beta_2 (X-\bar{X})^2 + \epsilon \\ \text{Linear regression:} & Y & = & \beta_0 + \beta_1 \log(X) + \epsilon \\ \text{Not linear regression:} & Y & = & \beta_0 + \frac{\beta_1 X}{\beta_2 + \beta_3 (X+1)} + \epsilon \\ \text{Not linear regression:} & Y & = & \beta_0 e^{\beta_1 X} \epsilon \end{array}\]

To fit the third model above, you would have to use non-linear regression (CITATION). Some non-linear models can be converted into linear models; for example, a log-transformation of \(Y\) will turn the fourth model into the linear model \(Y^* = \beta_0^* + \beta_1 X + \epsilon^*\) where \(Y^* = \log(Y)\), \(\beta_0^* = \log(\beta_0)\) and \(\epsilon^* = \log(\epsilon)\).

6.10.3.3 Normality of residuals

Earlier in this chapter we stated that \(\epsilon \sim N(0, \sigma^2)\) which is, in part, shorthand for “the residuals are normally distributed around the mean outcome and the average of the residuals is zero.” This implies that at any given \(X\) the distribution of \(Y\) given \(X\) should be normal. Let’s look at this in more detail.

6.10.3.3.1 Impact

A violation of the assumption of normality of residuals results in incorrect inferences. Both confidence intervals and p-values rely on the normality assumption, so if it is not valid then these will be inaccurate.

6.10.3.3.2 Diagnosis

We learned the definition of a residual using Figure 6.1 which showed two individual residuals. If we collect all the residuals for all the observations, we can plot them in a histogram. The distribution of these residuals should be approximately normal.

RESID <- resid(fit_resid_defn)
# Histogram of residuals
hist(RESID, xlab = "Residuals", main = "", probability = T)
# Superimpose empirical density (no normality assumption, for comparison)
lines(density(RESID, na.rm=T), lwd = 2, col = "red")
# Superimpose best fitting normal curve
curve(dnorm(x, mean = mean(RESID, na.rm=T), sd = sd(RESID, na.rm=T)),
      lty = 2, lwd = 2, add = TRUE, col = "blue")

The solid line provides a smoothed estimate of the distribution of the observed residuals. The dashed line is the best fit normal curve. The closer these two curves are to each other, the better the normality assumption is met.

NOTE: A regression model does not assume that either the outcome (\(Y\)) or predictor (\(X\)) are normally distributed. Rather, it assumes that the residuals are normally distributed. It is often the case, however, that the normality or non-normality of the outcome (\(Y\)) does correspond to that of the residuals.

6.10.3.3.3 Potential solutions
  • Transform the outcome \(Y\). Common transformations are logarithm, square root, and inverse (Weisberg 1985, 134).
  • If the non-normality is due to outliers (see Section 6.10.5), then perform a sensitivity analysis (fit the model with and without the outliers and see what changes). We will discuss this more in Chapter 9.
  • Bootstrap (see Chapter 9)
  • Non-normality may co-occur with non-constant variance and/or non-linearity. Sometimes solving one problem fixes others.

Public Health Examples: Continuous variables that are highly skewed are common in public health. For example, any variable that is derived as the sum of multiple indicators will have some zero values with the remaining values being positive, and this often leads to a skewed distribution. For example, in (Rogers et al. 2021), researchers log-transformed the skewed outcome variable frailty index (FI), derived as the sum of 34 indicators across a number of health domains, prior to fitting linear regression studying the association between childhood socioeconomic status and frailty. Due to the presence of zero values, they used the transformation \(\log(FI + 0.01)\). Other examples of skewed variables include population size, income, and costs. For example, in (Sakai-Bizmark et al. 2021), researchers log-transformed cost of hospitalization in a linear regression investigating the association between homelessness and health outcomes in youth.

Below, we will check the normality assumption for the regression of fasting insulin on age and try an alternative to see if it improves the fit.

fit_insulin <- lm(LBXIN ~ RIDAGEYR, data = nhanes)
RESID <- resid(fit_insulin)
hist(RESID, xlab = "Residuals", main = "", probability = T)
lines(density(RESID, na.rm=T), lwd = 2, col = "red")
curve(dnorm(x, mean = mean(RESID, na.rm=T), sd = sd(RESID, na.rm=T)),
      lty = 2, lwd = 2, add = TRUE, col = "blue")

These residuals do not look normally distributed. They are skewed to the right and very often a natural log transformation works in this situation to correct the non-normality. Before using a log transformation, let’s check for zeros and negative numbers.

# Check for zeros
summary(nhanes$LBXIN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.7     6.4    10.6    17.1    17.6   267.2 
##    NA's 
##     689

All the values are positive, so we can use \(log(X)\) rather than \(log(X+1)\).

fit_insulin_log <- lm(log(LBXIN) ~ RIDAGEYR, data = nhanes)
RESID <- resid(fit_insulin_log)
hist(RESID, xlab = "Residuals", main = "", probability = T)
lines(density(RESID, na.rm=T), lwd = 2, col = "red")
curve(dnorm(x, mean = mean(RESID, na.rm=T), sd = sd(RESID, na.rm=T)),
      lty = 2, lwd = 2, add = TRUE, col = "blue")

The observed distribution of the residuals (solid line) matches the normal density (dashed line) much better after the transformation.

6.10.3.4 Constant variance (homoscedasticity)

Not only does the assumption \(\epsilon \sim N(0, \sigma^2)\) imply normality of residuals, it implies that the variance of the residuals (\(\sigma^2\)) does not depend on any of the predictors. In a SLR, the residuals should have about the same amount of variation everywhere around the regression line. In MLR, you can visualize this as the residuals having about the same amount of variation around each regression line in the added variable plots (see Section @ref(#mlr-viz-adj)). This is the “constant variance” or “homoscedasticity” assumption.

6.10.3.4.1 Impact

A violation of the assumption of constant variance results in incorrect inferences. Both confidence intervals and p-values rely on the constant variance assumption, so if it is not valid then these will be inaccurate.

6.10.3.4.2 Diagnosis

Non-constant variance can be spotted in plots of residuals vs. predictors or residuals vs. fitted values. The spread of the points in the vertical direction should not vary much as you move in the horizontal direction.

Let’s examine the constant variance assumption for the model we fit when we checked linearity.

# This will show all the residuals vs. predictor and residual vs. fitted plots
car::residualPlots(fit.diagnostics, fitted = T, tests = F, pch=20, col="darkgray")

# Just vs. predictors
# car::residualPlots(fit.diagnostics, fitted = F, tests = F, pch=20, col="darkgray")
# Just vs. fitted
# car::residualPlots(fit.diagnostics, fitted = T, terms = ~ 1, tests = F, , pch=20, col="darkgray")

If the constant variance assumption is met, the spread of the points in the vertical direction should not vary much as you move in the horizontal direction. Looking at the plot vs. fitted values, it looks like the variance is larger for observations with larger fitted values (larger estimated mean outcome).

6.10.3.4.3 Potential solutions
  • Transform \(Y\) (“variance stabilizing transformation”)
  • Common transformations are logarithm, square root, and inverse (Weisberg 1985, 134)
  • Weighted least squares
  • Bootstrap (see Chapter 9)
  • Non-constant variance may co-occur with non-linearity and/or non-normality. Sometimes solving one problem fixes the other.

Below, we will check the constant variance assumption for the regression of fasting insulin on age and see if a log transformation improves the fit. This is the same solution we used to correct for non-normality.

car::residualPlots(fit_insulin,     main = "Insulin ~ Age", tests = F)

car::residualPlots(fit_insulin_log, main = "log(Insulin) ~ Age", tests = F)

Before the transformation, the variance of insulin seems to increase with age. Also, the insulin values are very skewed, with many small values and fewer very large values, and this typically leads to skewed residuals and non-normality. This is a situation where a log-transformation often works to stabilize the variance (as well as take care of non-normality of residuals).

Below are some examples (from simulated data) of non-constant variance and how to resolve it.

When the \(Y\) values are highly skewed with a large range, which often happens with measurement data or variables such as income or population, then a log-transformation often works well to stabilize the variance.

The plots on the top row are scatterplots of the outcome vs. the predictor, before (left panel) and after (right panel) the transformation. The plots on the bottom row are the corresponding residual vs. fitted plots that allow us to assess constant variance. In this artificial example, the log transformation perfectly took care of the non-constant variance. Despite being simulated data, this illustrates the sort of \(Y\) vs. \(X\) relationship where a log transformation may help.

When the variance of \(Y\) increases with \(X\) (the points are like a cone opening to the right), a square-root transformation may stabilize the variance.

When the variance of \(Y\) decreases with \(X\) (the points are like a cone opening to the left), try an inverse transformation.

6.10.4 Collinearity

An important step in the process of fitting a MLR is checking for collinearity between predictor variables. This step was not needed in SLR because there was only one predictor.

Here’s an extreme example which helps illustrate the concept of collinearity. Suppose you want to know the relationship between systolic blood pressure and weight, adjusted for age, and you fit a regression model that includes both weight in kg and weight in pounds in the model. Clearly, these two variables are redundant. With either one in the regression model, the other adds no new information. That is called perfect collinearity and it is mathematically impossible to include both in the model – you cannot estimate the regression coefficients for both.

But you can also have approximate collinearity. Suppose instead of weight in kg and weight in pounds, you have weight in kg and body mass index (BMI = weight/height2 where weight is in kg and height is in meters). While weight and BMI are not exactly redundant, they are highly related. In this case, it is mathematically possible to include both in the model, but their regression coefficients are difficult to estimate accurately (they will have large variances) and interpret. The regression coefficient for weight would be interpreted as the effect of a 1-unit change in weight while holding BMI constant, which would mean varying height – so the weight effect would be interpreted as the effect of height! In general, when two or more variables are collinear, the interpretation of each while holding the others fixed becomes questionable.

NOTE: We are not talking about the outcome \(Y\) here. We are only concerned with collinearity among predictors.

Two predictors are perfectly collinear if their correlation is -1 or 1. The term “collinear” comes from the fact that one predictor (\(X_1\)) can be written as a linear combination of the other (\(X_1\)) as \(X_1 = a + b X_2\). If you were to plot the pairs of points \((x_i1, x_i2)\) for all the cases, they would fall exactly on a line.

Examples of perfect collinearity include: - If two variables are exactly the same, then \(X_1 = X_2\). - If one variable is twice the other, then \(X_1 = 2 X_2\). - If one is 3 less than twice the other, then \(X_1 = -3 + 2 X_2\).

If two predictors are perfectly collinear, then they are completely redundant for the purposes of linear regression. You could put either one in the model and the fit would be identical. Unless they are exactly the same (\(X_1 = X_2\)), you will get a different slope estimate, but their p-values will be identical. If you put them both in the regression model at the same time, however, R will estimate a regression coefficient for just one of them.

# To illustrate...
# Generate random data from a standard normal distribution
set.seed(2937402)
y <- rnorm(100)
x <- rnorm(100)
z <- -3 + 2*x
lm(y ~ x + z)
## 
## Call:
## lm(formula = y ~ x + z)
## 
## Coefficients:
## (Intercept)            x            z  
##     -0.0776       0.1418           NA

The problem is actually worse, however, for approximate collinearity because if you put both in the model R will go ahead and estimate a regression coefficient for each, but will have trouble estimating their unique contributions and you could get strange results, such as very large \(\beta\)s, one with a large negative \(\beta\) and one with a large positive \(\beta\), \(\beta\)s that change drastically depending on the sample or if a few cases are removed, or if one of the collinear variables is removed from the model). Remember, two variables that are approximately collinear are essentially measuring the same thing and so should have about the same association with the outcome. But when each is adjusted for the other, the results for each can be quite unstable and quite different from each other. In the example below, the second variable is identical to the first except for having some random noise added.

# Generate random data but make the collinearity approximate
set.seed(2937402)
y <- rnorm(100)
x <- rnorm(100)
z <- x + rnorm(100, sd=0.05)
# Each predictor alone
round(summary(lm(y ~ x))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0776     0.1020 -0.7601   0.4490
## x             0.1418     0.1087  1.3045   0.1951
round(summary(lm(y ~ z))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0783     0.1021  -0.767   0.4449
## z             0.1394     0.1079   1.292   0.1995
# Adjusted for each other
round(summary(lm(y ~ x + z))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0742     0.1034 -0.7176   0.4747
## x             0.7256     2.3995  0.3024   0.7630
## z            -0.5799     2.3811 -0.2436   0.8081

Each predictor individually has a regression slope of about 0.14 with a standard error of about 0.11. However, when attempting to adjust for each other, their slopes are now in opposite directions and their standard errors are around 2.4! A large standard error means that if you happened to have drawn a slightly different sample, you might end up with a very different regression coefficient estimate - the results are not stable.

Going back to our example of weight and BMI, we can see that in the NHANES sample these two variables are highly collinear.

The points do not fall exactly on a line but they are generally close. Thus, weight and BMI seem to be approximately collinear. This makes sense because BMI is computed is a measure of body mass computed as weight normalized by stature.

NOTE: All of the above examples were of pairwise collinearity. You can also have collinearity between groups of variables, as well, if some linear combination of one group of variables is approximately equal to a linear combination of some other variables.

How much collinearity is a problem? We will cover that in the subsequent sections. Fortunately, you do not have to do the math yourself, R will do it for you, but it is important to understand what the numbers mean and how to correct collinearity issues.

6.10.4.1 Visual and tabular diagnostics

In practice, you will likely just use a numerical statistic (variance inflation factors, described later) to diagnose collinearity, but it can also be useful to use some preliminary visual or tabular checks to inspect pairs of variables for collinearity. These checks are also useful at this point for the purpose of understanding what collinearity is all about.

The visual or tabular diagnostic method used depends on if the variables are continuous or categorical. These tools are not definitive, but can help when you are trying to figure out how to deal with collinearity.

Continuous vs. continuous: Scatterplot – if you hold one fixed, can you vary the other?

Let’s look again at the scatterplot of BMI vs. weight.

plot(nhanes$BMXWT, nhanes$BMXBMI, xlab = "Weight (kg)", ylab = "BMI (kg/m2)", pch=20, col="darkgray")
abline(lm(nhanes$BMXBMI ~ nhanes$BMXWT), col = "red", lwd = 2, lty = 2)
abline(h= 40, col = "blue",  lwd = 2)
abline(v=100, col = "green", lwd = 2)

As mentioned previously, weight and BMI are clearly correlated. What we can also assess here, visually, is whether you can hold one variable fixed and still vary the other very much. For example, holding BMI at 40 kg/m2 (horizontal line), the possible values of weight are more limited compared to the full range. Similarly, if we hold weight fixed at 100 kg (vertical line), the range of BMI values is more limited.

Categorical vs. categorical: Two-way table – frequencies near 0?

For two categorical variables, a tabular collinearity diagnostic is to create a two-way table and then look for a pattern, especially cells with very few observations. Similar to the scatterplot above, can you hold one variable fixed and vary the other?

For example, comparing race/ethnicity to income, we can hold either variable fixed and vary the other across all levels. So, even though these two variables may be associated, their association does not hinder interpretation of one while holding the other fixed.

table(nhanes$RIDRETH3, nhanes$income)
##                     
##                      < $25,000 $25,000 to < $55,000
##   Mexican American          41                   47
##   Other Hispanic            31                   27
##   Non-Hispanic White        96                  123
##   Non-Hispanic Black        66                   62
##   Non-Hispanic Asian        18                   33
##   Other/Multi               19                   27
##                     
##                      $55,000+
##   Mexican American         52
##   Other Hispanic           26
##   Non-Hispanic White      142
##   Non-Hispanic Black       83
##   Non-Hispanic Asian       85
##   Other/Multi              20

What would it look like in the case of collinearity? Below is an example of two hypothetical categorical variables that exhibit a strong relationship:

##      [,1] [,2] [,3]
## [1,]   50    5    2
## [2,]    8   60   10
## [3,]    1    7   80

In this \(3 \times 3\) table, if you hold either the row or column fixed at one level, the other variable does not vary much with almost every observation in a single cell.

Categorical vs. continuous: Side-by-side boxplots – do they overlap?

Here is an example of a continuous variable (BMI) and a categorical variable (smoking status) that do not have a problem with collinearity:

boxplot(nhanes$BMXBMI ~ nhanes$smoker, xlab = "Smoking Status", ylab = "BMI (kg/m2)")

Below is an example of a continuous and categorical variable that do have a problem with collinearity. Since the boxplots barely overlap, if you know the value of the continuous variable, you have a pretty good idea what the categorical variable value is, indicating that these are somewhat redundant.

# Hypothetical example of highly collinear continuous x categorical pair
set.seed(753247)
Group <- rep(1:3, rep(100, 3))
Y     <- c(rnorm(100, 50, 10),
           rnorm(100, 70, 10),
           rnorm(100, 90, 10))
boxplot(Y ~ Group)

6.10.4.2 Variance inflation factors

The primary statistic we will use to diagnose collinearity is the *variance inflation factor (VIF)**. For each predictor, VIF measures how much the variance of the regression coefficient estimate is inflated due to correlation between that predictor and the others. We can compute VIFs using the car::vif function in R (Fox and Weisberg 2019).

Let’s use VIFs to measure the collinearity in the regression model of the outcome SBP on the predictors age, weight, BMI, and height. First, fit the regression model. Second, call car::vif with the regression fit as the argument.

fit_vif <- lm(BPXSY1 ~ RIDAGEYR + BMXWT + BMXBMI + BMXHT, data = nhanes)
car::vif(fit_vif)
## RIDAGEYR    BMXWT   BMXBMI    BMXHT 
##    1.021   91.602   70.002   18.897

How do we interpret the magnitude of VIFs? How large is “too large?” Unfortunately, there is no clear-cut answer to that question. We do know that a value of 1 is ideal - that would indicate that the variance of that predictor’s regreesion coefficient estimate is not being inflated at all by the presence of other predictors. Values above 2.5 may be of concern (Allison 2012), and values above 5 or 10 are indicative of a more serious problem. In this example height, weight, and BMI have serious collinearity issues.

All the predictors in the previous example were continuous. How do we compute VIF for a categorical variable? Remember that a categorical variable with \(K\) levels will be entered into a model as \(K-1\) dummy variables (\(K-1\) vectors of 1’s and 0’s). Why \(K-1\)? Because if you included all \(K\) of them the vectors would sum up to a vector of all 1’s (since every observation falls in exactly 1 category) and that would be perfect collinearity! But, the VIF will differ depending on which dummy variable you leave out. In addition to getting inconsistent results, we are not interested in the VIF for each individual level of a categorical variable, but rather of the variable as a single entity. Fortunately, car::vif automatically gives us what we want as long as the categorical variable is entered as a factor.

# To check if a predictor is a factor variable use is.factor
is.factor(nhanes$smoker)
## [1] TRUE
car::vif(lm(BPXSY1 ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = nhanes))
##           GVIF Df GVIF^(1/(2*Df))
## RIDAGEYR 1.124  1           1.060
## BMXBMI   1.011  1           1.006
## BMXHT    1.070  1           1.034
## smoker   1.149  2           1.035

Use the values in the “GVIF” column (first column). Thus, the (generalized) VIF for the 3-level variable “smoker” is 1.15. In this example, the variables do not exhibit strong collinearity (all the VIFs are near 1).

6.10.4.3 Impact of collinearity

When there is perfect collinearity, R will drop out variables until that is no longer the case and so there is no impact on the final regression results. When there is approximate collinearity, however, the result is inflated standard errors and difficulty interpreting coefficients. For example, in the regression of SBP on age, weight, BMI, and height, let’s compare the regression coefficients, their standard errors, and their p-values before and after dropping weight from the model.

With Weight
Without Weight
Term B SE P B SE P
Intercept -8.71 9.48 .359 5.71 2.30 .013
Age 0.0015 0.0071 .827 0.0014 0.0071 .842
Weight -0.0851 0.0543 .117
BMI 0.3305 0.1508 .029 0.0958 0.0181 < .001
Height 0.0539 0.0570 .344 -0.0330 0.0130 .011

Age was not very collinear with Weight so its numbers do not change much. But look at the rows for BMI and Height. After removing Weight from the model, their estimates change drastically, with the effect for Height even changing direction (going from positive to negative)! Also, their standard errors decrease dramatically, indicating that they are more precisely estimated. What is actually happening is that the definition of “BMI” and “Height” change after removing Weight, with the old quantities (effect of BMI and Height when holding Weight constant) being much harder to estimate accurately (and harder to interpret). Finally, whereas “Height when holding Weight, BMI, and Age constant” was not statistically significant (p = .344), “Height when holding BMI and Age constant” is (p = .011).

Collinearity caused by an interaction

Collinearity between terms involved in an interaction can be safely ignored (Allison 2012). This applies also to collinearity between terms that together define a polynomial curve (e.g., \(X\), \(X^2\), etc.). In general, centering continuous variables will reduce the collinearity and, as we saw in previous sections, centering is often a good idea for the purpose of interpretability. However, the sort of collinearity that is removed by centering actually has no effect on the fit of the model, only on the interpretation of the intercept and the main effects of factor variables involved in the interaction.

For example, consider the following model which contains an interaction between age and smoking status. We will fit it with and without centering age and see how that affects the variance inflation factors and the model fit.

# Uncentered age
fit_uncentered <- lm(BPXSY1 ~ RIDAGEYR + BMXBMI + BMXHT + smoker + RIDAGEYR:smoker, data = nhanes)
# Center age
tmp <- nhanes
tmp$cage <- tmp$RIDAGEYR - mean(tmp$RIDAGEYR, na.rm=T)
fit_centered <- lm(BPXSY1 ~ cage + BMXBMI + BMXHT + smoker + cage:smoker, data = tmp)
car::vif(fit_uncentered)
##                    GVIF Df GVIF^(1/(2*Df))
## RIDAGEYR          1.731  1           1.316
## BMXBMI            1.014  1           1.007
## BMXHT             1.072  1           1.036
## smoker          118.244  2           3.298
## RIDAGEYR:smoker 134.533  2           3.406
car::vif(fit_centered)
##              GVIF Df GVIF^(1/(2*Df))
## cage        1.731  1           1.316
## BMXBMI      1.014  1           1.007
## BMXHT       1.072  1           1.036
## smoker      1.334  2           1.075
## cage:smoker 1.940  2           1.180

The uncentered model appears to have a collinearity problem, and centering appears to have removed that problem. However, if we look at the fit of these two models, we find they are identical.

summary(fit_uncentered)
## 
## Call:
## lm(formula = BPXSY1 ~ RIDAGEYR + BMXBMI + BMXHT + smoker + RIDAGEYR:smoker, 
##     data = nhanes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.68 -10.80  -1.73   8.56  68.49 
## 
## Coefficients:
##                        Estimate Std. Error t value
## (Intercept)             81.8857     9.8675    8.30
## RIDAGEYR                 0.5361     0.0378   14.18
## BMXBMI                   0.3108     0.0766    4.06
## BMXHT                    0.0536     0.0561    0.96
## smokerPast               2.6311     4.4224    0.59
## smokerCurrent           -2.7916     4.4893   -0.62
## RIDAGEYR:smokerPast     -0.0802     0.0747   -1.07
## RIDAGEYR:smokerCurrent   0.0655     0.0897    0.73
##                                    Pr(>|t|)    
## (Intercept)             0.00000000000000035 ***
## RIDAGEYR               < 0.0000000000000002 ***
## BMXBMI                  0.00005311521113012 ***
## BMXHT                                  0.34    
## smokerPast                             0.55    
## smokerCurrent                          0.53    
## RIDAGEYR:smokerPast                    0.28    
## RIDAGEYR:smokerCurrent                 0.47    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 965 degrees of freedom
##   (198 observations deleted due to missingness)
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.257 
## F-statistic:   49 on 7 and 965 DF,  p-value: <0.0000000000000002
summary(fit_centered)
## 
## Call:
## lm(formula = BPXSY1 ~ cage + BMXBMI + BMXHT + smoker + cage:smoker, 
##     data = tmp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.68 -10.80  -1.73   8.56  68.49 
## 
## Coefficients:
##                    Estimate Std. Error t value
## (Intercept)        108.9546     9.4208   11.57
## cage                 0.5361     0.0378   14.18
## BMXBMI               0.3108     0.0766    4.06
## BMXHT                0.0536     0.0561    0.96
## smokerPast          -1.4167     1.4526   -0.98
## smokerCurrent        0.5158     1.4824    0.35
## cage:smokerPast     -0.0802     0.0747   -1.07
## cage:smokerCurrent   0.0655     0.0897    0.73
##                                Pr(>|t|)    
## (Intercept)        < 0.0000000000000002 ***
## cage               < 0.0000000000000002 ***
## BMXBMI                         0.000053 ***
## BMXHT                              0.34    
## smokerPast                         0.33    
## smokerCurrent                      0.73    
## cage:smokerPast                    0.28    
## cage:smokerCurrent                 0.47    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 965 degrees of freedom
##   (198 observations deleted due to missingness)
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.257 
## F-statistic:   49 on 7 and 965 DF,  p-value: <0.0000000000000002

As you can see, the two models fit identically, with only the intercept and smoking status main effects changing in interpretation. The intercept is the mean outcome when all variables are at 0 or their reference level. In the centered model, “0” for centered age is the mean age; in the uncentered model, “0” for age was 0. So the interpretation has changed. Also, the smoking status main effects represent the differences in outcome between each level of smoking status and the reference level when age is at its “0” value so again this has changed.

In summary, centering a variable involved in an interaction changes the interpretation of some terms in the model but does not impact the model fit, so collinearity in such a term can be ignored. Additionally, centering does not work if both terms involved in the interaction are categorical. Instead, just compute the variance inflation factors for the model without the interaction to see if any of the other terms are collinear.

6.10.4.4 Potential solutions

There are multiple ways to deal with collinearity, including using a biased regression technique such as ridge regression (beyond the scope of this text but see, for example, the R glmnet package). The two methods we will focus on here are:

  • Remove one or more variables from the model.
  • Combine variables (e.g., average, sum, difference).

Removing variables

Ultimately, you want a set of variables that are not redundant. A good place to start is to remove variables suspected of being problematic, one at a time, and see if the VIFs become smaller.

# Drop weight
fit_nowt  <- lm(BPXSY1 ~ RIDAGEYR         + BMXBMI + BMXHT, data = nhanes)
# Drop BMI
fit_nobmi <- lm(BPXSY1 ~ RIDAGEYR + BMXWT          + BMXHT, data = nhanes)
# Drop height
fit_noht  <- lm(BPXSY1 ~ RIDAGEYR + BMXWT + BMXBMI        , data = nhanes)
car::vif(fit_nowt)
## RIDAGEYR   BMXBMI    BMXHT 
##    1.020    1.005    1.021
car::vif(fit_nobmi)
## RIDAGEYR    BMXWT    BMXHT 
##    1.020    1.316    1.338
car::vif(fit_noht)
## RIDAGEYR    BMXWT   BMXBMI 
##    1.018    4.951    4.958

It looks like weight and BMI are the problem variables, and dropping weight helps the most. This makes sense since BMI is equal to weight/height2. In this example, removing the variable Weight from the model resolved the collinearity issue. Sometimes you will have to remove more than one variable. Regardless, simply removing variables is often the best solution.

Combining variables

At other times, you have multiple variables and you really are interested in their collective effect. In that case, combining them in some way can work. For example, suppose you have \(X_1\) = father’s age and \(X_2\) = mother’s age. These are highly correlated. However, often replacing a pair of variables with their average \(Z_1 = (X_1 + X_2)/2\) and their difference \(Z_2 = X_1 - X_2\) results in a pair of much less correlated variables. In this case, the result is “average parental age” and “age difference.” In this example, you would be replacing two collinear variables with two that are (hopefully) less collinear.

In other cases, you might just combine the collinear variables into a single variable by summing or averaging and replacing multiple collinear variables with a single summary variable. For example, if you have a set of items in a questionnaire that all are asking about the same underlying construct, each on a scale from, say, 1 to 5 (e.g., Likert-scale items), you may be able to add them together to produce a single summary variable. If taking this approach, be sure all the items are on the same scale and if they are not then standardize them first. Also, make sure the ordering of responses has the same meaning… sometimes one item is asking about agreement with the underlying construct and another is asking about agreement with the opposite, resulting in a low score on one item having the same meaning as a high score on another. If you have items that are in the opposite direction, reverse some of them until all are in the same direction. You may also want to check the validity of such a sum using Cronbach’s alpha and/or factor analysis (beyond the scope of this text but see … )

[ADD REFERENCE TO Cronbach’s and factor analysis for summing Likert items]

ADD DATA EXAMPLE?

In general, you should take time to think about how your candidate predictors are related to each other and to the outcome. Are some completely or partially redundant (measuring the same underlying concept)? That will show up in the check for collinearity and then you can remove or combine variables in a way that aligns with your analysis goals.

6.10.5 Outliers

While collinearity diagnostics look for problems in the regression model due to relationships between variables (predictors), outlier (this section) and influence (next section) diagnostics look for problems due to individual observations (cases).

6.10.5.1 Impact

An outlier is an individual observation with a very large residual. Outliers are not necessarily extreme in either the outcome or any of the predictors. What makes an observation an outlier in regression is that the outcome is far from it’s predicted value from the regression. For example, in the left panel below, the filled in point is extreme in the \(Y\) direction (its \(Y\) value is much larger than any other \(Y\) value) but is close to the line so its residual is small and it is not an outlier for the regression. In the right panel, however, the filled in point is in the middle of the \(Y\) distribution but is very far from the line and so has a large residual, making it an outlier.

The presence of outliers can impact the validity of certain assumptions, in particular normality and constant variance, resulting in invalid confidence intervals and p-values for regression coefficients.

6.10.5.2 Diagnosis

The above illustrated an outlier in SLR. In MLR, instead of \(Y\) vs. \(X\) we plot residuals vs. fitted values since we have many predictors. In such a plot, outliers are observations that are far from the horizontal zero-line. But how far from the line is far enough to be considered an outlier? Recall that Studentized residuals put the residuals on a standard normal scale. The cutoff for “large” is arbitrary, but we know that values that are more than 3 or 4 standard deviations from the mean are very rare in a standard normal distribution. We’ll use 3.5 as our arbitrary cutoff for the purpose of visualization, and we will also use a statistical test to determine what is and is not an outlier.

To visualize and diagnose outliers, we will (a) compute Studentized residuals, (b) use car::residualPlots to visualize outliers in the plot of residuals vs. fitted values, (c) create a boxplot of the Studentized residuals, and (d) conduct a statistical test for outliers.

For example, let’s look for outliers in the regression of total cholesterol on age in NHANES. We use the type = "rstudent" option in car::residualPlots to plot the Studentized residuals.

fit_outliers <- lm(LBXTC ~ RIDAGEYR, data = nhanes)
# Compute the Studentized residuals
RSTUDENT <- rstudent(fit_outliers)
# Identify "large" residuals
SUB <- abs(RSTUDENT) > 3.5
# na.rm = T is include here because if you use the na.action = na.exclude option
# (not discussed) then SUB may have missing values
sum(SUB, na.rm = T)
## [1] 6
# Plot residuals vs. fitted
# terms = ~ 1 tells R to not plot vs. any predictors
# tests = F suppresses the test of non-linearity
car::residualPlots(fit_outliers, terms = ~ 1, tests = F, fitted = T,
                   type = "rstudent", pch = 20, col = "gray")
# Highlight the outliers
abline(h = c(-3.5, 3.5), lty = 3, lwd = 2)
points(fitted(fit_outliers)[SUB], RSTUDENT[SUB], pch=20, cex=1.5, col="black")

A boxplot of the Studentized residuals can be used, as well. Points above or below the horizontal dashed line have Studentized residuals larger than 3.5 in absolute value (this plot happens to not have a horizontal line at -3.5 since there are no negative residuals that large).

boxplot(rstudent(fit_outliers))
abline(h = c(-3.5, 3.5), lty = 3)

Finally, we conduct a statistical test for outliers using car::outlierTest (Fox and Weisberg 2019). This tests each residual to see how likely you are to observe such an extreme value if the Studentized residuals were truly normally distributed. A Bonferroni adjustment is used to account for the increased Type I error due to doing one test for every observation. The “n.max” option tells the function to report up to 10 observations. It turns out only 3 observations were flagged by this test as having an unusually large outlier (compared to the 6 we identified above with our arbitrary cutoff). After adjusting for multiple testing, these three points had Bonferroni p-values (raw p-values multiplied by the number of tests) < .05, indicating that they are unlikely to have arisen from a normal distribution – they are unusually far from the mean.

# Outlier test
# The default for n.max is 10. Increase it if you think
# you may have more outliers
car::outlierTest(fit_outliers, n.max = 10)
##      rstudent unadjusted p-value Bonferroni p
## 3319    4.924       0.0000009893     0.001015
## 3531    4.756       0.0000022527     0.002311
## 5693    4.241       0.0000243200     0.024953
# Look at the rows in the dataset corresponding to these outliers
nhanes[c("3319", "3531", "5693"), c("LBXTC", "RIDAGEYR")]
##      LBXTC RIDAGEYR
## 3319   384       35
## 3531   376       30
## 5693   365       67
# nhanes[c(3319, 3531, 5693), ]
# Not correct. The numbers are not the row numbers
# but the row labels from the original NHANES dataset
# before we took at 20% subset

6.10.5.3 Potential solutions

Once you have identified which observations are outliers, you can check to see if the outliers are data entry errors (if you have access to the raw data). If that’s not possible, or they are not data entry errors, then we can try a few other options. If a point is an outlier based on the diagnostics, that alone does not justify its removal! That would artificially alter your conclusions. Never simply remove an observation from the data just because it might be an outlier. Instead, try the following:

  • Transformation: If the \(Y\) distribution is very skewed, that can lead to outlying residuals. A transformation (e.g., log, square-root, inverse) may solve this problem.
  • Perform a sensitivity analysis (fit the model with and without the outliers and see what changes). We will discuss this topic more in Section 9.

Finally, rather than simply being a problem, outliers may actually be some of the more interesting observations in the data. Observations that do not fit well with the regression model may warrant further investigation and may lead to new insights and hypotheses.

6.10.6 Influential observations

An influential observation is an individual observation whose presence has a large impact on the estimated regression coefficients. If you were to remove an influential observation and refit the model, the intercept and/or slope would be meaningfully different.

As with outliers, we need to be careful with influential observations with care. Rather than just remove them, we will assess their impact and then decide how best to report our results. We will discuss this topic more in Section 9.

avPlot can show leverage for each variable better than marginal plots (see CAR, p394)

6.11 Model selection

6.11.1 Confounders, mediators, and moderators

6.11.2 Confirmatory analysis

6.11.3 Exploratory analysis

6.12 Multiple testing

6.13 Writing it up

6.14 Steps of linear regression

References

Allison, Paul. 2012. “When Can You Safely Ignore Multicollinearity?, Accessed 8/23/2021.” https://statisticalhorizons.com/multicollinearity.
Diggle, Peter J., Patrick Heagerty, Kung-Yee Liang, and Scott L. Zeger. 2002. Analysis of Longitudinal Data. Oxford: Oxford University Press.
Fitzmaurice, Garrett, Nan M. Laird, and James H. Ware. 2011. Applied Longitudinal Analysis. Second edition. Hoboken: John Wiley; Sons, Inc.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third edition. Los Angeles: SAGE Publications, Inc.
Harrell, Frank E., Jr. 2015. Regression Modeling Strategies. Second edition. Switzerland: Springer International Publishing.
Laird, Nan M., and James H. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics 38 (4): 963–74. https://doi.org/10.2307/2529876.
Liang, K Y, and S L Zeger. 1993. “Regression Analysis for Correlated Data.” Annual Review of Public Health 14 (1): 43–68. https://doi.org/10.1146/annurev.pu.14.050193.000355.
Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22. https://doi.org/10.2307/2336267.
Rogers, Nina T, Joanna M Blodgett, Samuel D Searle, Rachel Cooper, Daniel H J Davis, and Snehal M Pinto Pereira. 2021. “Early-Life Socioeconomic Position and the Accumulation of Health-Related Deficits by Midlife in the 1958 British Birth Cohort Study.” American Journal of Epidemiology 190 (8): 1550–60. https://doi.org/10.1093/aje/kwab038.
Sakai-Bizmark, Rie, Hiraku Kumamaru, Dennys Estevez, Emily H Marr, Edith Haghnazarian, Lauren E M Bedel, Laurie A Mena, and Mark S Kaplan. 2021. “Health-Care Utilization Due to Suicide Attempts Among Homeless Youth in New York State.” American Journal of Epidemiology 190 (8): 1582–91. https://doi.org/10.1093/aje/kwab037.
Warnes, Gregory R., Ben Bolker, Thomas Lumley, Randall C Johnson. Contributions from Randall C. Johnson are Copyright SAIC-Frederick, Inc. Funded by the Intramural Research Program, of the NIH, National Cancer Institute, and Center for Cancer Research under NCI Contract NO1-CO-12400. 2018. Gmodels: Various R Programming Tools for Model Fitting. https://CRAN.R-project.org/package=gmodels.
Weisberg, Sanford. 1985. Applied Linear Regression. Second edition. New York: John Wiley; Sons.
Zeger, Scott L., and Kung-Yee Liang. 1986. “Longitudinal Data Analysis for Discrete and Continuous Outcomes.” Biometrics 42 (1): 121–30. http://www.jstor.org/stable/2531248.