5.13 Overview of regression diagnostics: Assumptions, outliers, influence, and collinearity

As briefly mentioned at the close of the previous Chapter, the validity of regression coefficient estimates, confidence intervals, and significance tests depends on a number of assumptions. A standard linear regression model makes strong assumptions about the source of the data. The model assumes the data are from a set of independent cases. Each case has an outcome and a set of predictor values, and the model assumes a specific relationship between them described by the regression equation (5.1), reproduced here.

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_K X_K + \epsilon \]

Thus, the model assumes the outcome is approximately the sum of the predictors each multiplied by some number. This implies a linear relationship between the outcome and each predictor when holding other predictors constant. The relationship implied by the \(\beta X\) parts of the equation is not assumed to be exact, only true on average. That is where the error term comes into play – \(\epsilon\) captures the influence of all the other predictors that we have not been able to include in our model, either because we do not have measures of those predictors or because we do not know they need to be in the model, as well as random variation in the outcome even among cases with all the same predictor values. The model assumes \(\epsilon \sim N(0, \sigma^2)\), that is, that the error term has a normal distribution with a variance that does not vary between cases.

All the subsequent results – the estimates of regression coefficients, their confidence intervals, their p-values – are derived based on formulas that make use of these assumptions:

  • Independence
  • Normality
  • Linearity
  • Constant variance

If the assumptions are not valid, the results might not be correct. To make matters worse, there may be a few individual observations that cause problems. There may be outliers, observations with outcomes that are very different from what the regression model predicts, that could lead to problems with the normality or constant variance assumptions. There may be influential observations, observations that if removed would result in large changes in the regression coefficient estimates. It is not encouraging to find that the results depend heavily on just a few observations!

Finally, collinearity is an issue that is not present in SLR that we have to consider in MLR. Collinearity has to do with how correlated the predictors are with each other. In the extreme case, if two predictors are completely redundant, the equation that is being solved when fitting the regression model has no solution – you cannot get a coefficient for a predictor with the definition “the effect of this predictor while holding all other predictors fixed” if there is another predictor that is exactly correlated since you cannot vary one while holding the other fixed. Even when the redundancy is only partial, correlations between predictors can cause regression results to be unstable.

The following sections go into detail regarding each of the above. Each section describes the impact of an issue, its diagnosis, and potential solutions when the diagnosis is not favorable.