2.4 Checking assumptions

When carrying out a simple linear regression analysis, it is important to check whether or not any of the assumptions of the model have been violated. For the SLR model described in Section 2.2, three such assumptions are:

Simple linear regression model assumptions:

  1. The model is linear. Is the linear model proposed suitable for the data, or might there be another kind of model that would fit the data more accurately?
  2. The errors have constant variance. As well as having an expected value \(\text{E}(\epsilon) = 0\), they are assumed to have constant variance such that \(\text{Var}(\epsilon) = \sigma^2\). This being the case, it follows that we would not expect the variance to be larger or smaller depending on \(x\).
  3. The errors are normally ditsributed with mean 0 and variance \(\sigma^2\), such that \(\epsilon \sim N(0, \sigma^2)\). Given this assumption, we would expect the residuals to look like data that has been sampled from a normal distribution.

There are two plots that are very useful in helping us check for the above assumptions: the residuals versus fits plot, and the Normal Q-Q plot. Both plots are shown below for the happiness versus income example:

2.4.1 What to look for when checking the plots

Let's start with the plot that is already familiar to us: the Normal Q-Q plot. We can use this plot to check for Assumption 3 (Normality). Aside from one outlier (observation 11), the dots follow the line very closely. This means there are no obvious concerns regarding the normality of the errors.

The residuals versus fits plot is a plot of the residuals on the \(y\)-axis versus the fitted values on the \(x\)-axis. We can use this plot to check for Assumption 1 (linearity) and Assumption 2 (constant variance). Firstly, if the model is linear, then we expect to see random data and no patterns in the residuals versus fits plot. If the data follows some sort of pattern, it may be that a different type of model (for example a quadratic or exponential model) may be appropriate. Secondly, if the errors have constant variance, then we would expect the magnitude of the residuals to remain fairly constant across the entirety of the plot. If we see that the spread of the residuals becomes larger or smaller, rather than remaining generally constant, as the fitted values change, it may be that the constant variance assumption has been violated. This is commonly referred to as 'fanning'.

Referring to the above residuals versus fits plot, we can observe the following:

  1. There are no obvious patterns, meaning the linearity assumption has been met
  2. The spread of the residuals remains generally constant as the fitted values change. That is, there are no signs of fanning. This means the constant variance assumption has also been met.

We do, however, observe the presence of one potential outlier (observation 11), so this observation may warrant further scrutiny.

For more of an idea what to look for when checking residuals versus fits plots, consider the below examples:

  1. In the first plot, we see a random scatter, which is exactly what we want to see in a residuals versus fits plot. There are no signs that either of the linearity or constant variance assumptions have been violated.
  2. In the middle plot, we see an obvious pattern in the data, as it is showing signs of curvature. This is an indication that the linearity assumption has been violated.
  3. In the plot on the right, we see obvious signs of fanning. This is an indication that the constant variance assumption has been violated.