Checking assumptions
Residual plots are very useful in order to check assumptions. These plots can be based on ordinary or on standardized residuals.
Histograms or normal probability plot (Q-Q plot) of \(\hat{\epsilon}_i\) (residuals)
This is useful to check the assumption of normality.
Plots of \(\hat{\epsilon}_i\) (residuals) versus the fitted values \(\hat{y_i}\)
This is used to detect changes in error variance and to check if the mean of the errors is zero.
Plots of \(\hat{\epsilon}_i\) (residuals), versus an explanatory variable \(x_{ji}\) in the model
This helps to to check that variable \(x_j\) has a linear relationship with the response variable.
Plots of \(\hat{\epsilon}_i\) versus an explanatory variable \(x_{ki}\) not in the model
With this we can check whether the additional variable \(x_k\) might have a relationship with the response variable.
Plots of \(\hat{\epsilon}_i\) in the order that the observations were collected
This is useful to check whether errors might be correlated over time.
Furthermore, when we suspect that observations are not independent, we can plot \(\hat{\epsilon}_i\) versus \(\hat{\epsilon}_{i-1}\), where \(i\) denotes the order in which the observations were collected. This plot might identify for example whether high values of one residual are associated with high values of the previous residual, and low with low.
Q-Q plot
A Q-Q plot, short for quantile-quantile plot, is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. In this case, we use it as a graphical tool to help us assess if a set of data, here residuals, plausibly came from some theoretical distribution, here a normal distribution. For example, are the observed data below consistent with a normal distribution? \[3.01, 5.21, 5.16, 5.82, 4.83, 3.43, 5.14, 3.99, 6.51, 8.12\] For now, let’s consider the standard normal distribution \(N(0,1)\) (left hand side plot below). Since we have ten data points above, we need to find values expected under this standard normal distribution (middle plot). Each segment, separated by the red lines, contain 10% of the area under this curve. These values shown by the black points given use expected values under this normal distribution correspond to \[-1.28, -0.84, -0.52, -0.25, 0.00, 0.25, 0.52, 0.84, 1.28, 3.09\] Comparing the (ordered) observed data to what we might expect from a normal distribution we see that they match up well (right hand side plot). The near straight line, superimposed, indicates the observed data are approximately consistent with a normal distribution.
Protein in Pregnancy
We can now fully define the regression model used in the protein in pregnancy example.
Data: \((y_i,x_i), \quad i=1,\dots,n\)
\(y_i\), protein level of mother \(i\)
\(x_i\), gestation of baby \(i\) (in weeks)
Model: \(y_i=0.2017 + 0.0228x_i + \epsilon_i\) where \(\epsilon_i \sim N(0,\sigma^2)\) and \(\epsilon_1, \ldots, \epsilon_n\) are independent.
Let us first look at a plot of \(\hat{\epsilon}_i\) (residuals) versus the fitted values \(\hat{y_i}\) and normal probability plots of \(\hat{\epsilon}_i\).
#Load needed libraries
library(ggfortify)
library(gridExtra)
#Plot residuals against fitted values and Normal QQ plot
autoplot(protein.lm, which=1:2)
#Create data frame with fitted values and residuals
protein.fit<-data.frame(fit=protein.lm$fitted.values,
res=protein.lm$residuals,
ges=protein$Gestation,
res1=c(protein.lm$residuals[2:19],NA))
#Plot fitted values against gestation
plot1<-ggplot(protein.fit, aes(x=ges,y=res)) +
geom_point() +
labs(x="Gestation", y="Residuals", title="Residuals vs Gestation")
#Plot residuals against the previous residuals (lag 1)
plot3<-ggplot(protein.fit, aes(x=res1,y=res)) +
geom_point() +
labs(y="Residuals", x="Residuals lag 1", title="Residual Independence")
#Create a grid with the two plots side-by-side
grid.arrange(plot1, plot3, ncol=2)
Interpreting Residual Plots
Histograms or normal probability plots of \(\hat{\epsilon}_i\) (residuals).
This is useful to check the assumption of normality. We can see from the Normal Q-Q plot that the residuals are approximately consistent with a normal distribution.
Plots of \(\hat{\epsilon}_i\) (residuals) versus the fitted values \(\hat{y_i}\).
This is used to detect changes in error variance and to check if the mean of the errors is zero. We can see from the residuals vs fitted values plot that the residuals are randomly scatter around the zero line. This suggests that the residuals have constant variance and mean zero.
Plots of \(\hat{\epsilon}_i\) (residuals), versus an explanatory variable \(x_{ji}\) in the model.
This helps to to check that variable \(x_j\) has a linear relationship with the response variable. We can see from the residuals vs gestation plot that the residuals are randomly scatter around the zero line. This suggests that there is no relationship between the residuals and gestation.
Plots of \(\hat{\epsilon}_i\) versus an explanatory variable \(x_{ki}\) not in the model.
With this we can check whether the additional variable \(x_k\) might have a relationship with the response variable. We do not have any other recorded data and so we are unable to do so.
Plots of \(\hat{\epsilon}_i\) in the order that the observations were collected.
This is useful to check whether errors might be correlated over time. We can plot \(\hat{\epsilon}_i\) versus \(\hat{\epsilon}_{i-1}\), that is with lag 1, where \(i\) denotes the order in which the observations were collected. This plot might identify for example whether high values of one residual are associated with high values of the previous residual, and low with low. In this case, we can see no obvious pattern in residuals suggesting they are independent.