Regression Model Assumptions

It is helpful to formally define the assumptions of regression models.

The deterministic part of the model captures all the non-random structure in the data

This assumption implies that the mean of the errors, $\boldsymbol{\epsilon}$ , is zero. It applies only over the range of the explanatory variables in the data that we are considering. In the protein example, it is possible that the protein level changes markedly when the baby is born; we would be unwise to predict the protein level at any fetal age greater than the largest we have observed, namely 36 weeks. Similarly, we have no information on protein levels before week 10 of pregnancy so again, in the absence of other information, it would be unwise to make predictions outwith this range.

The scale of the variability of the errors is constant at all values of the explanatory variables

In our models, we have been summarizing the scale of the scatter of the observations about the deterministic part of the model using $\sigma$ . This assumes the scale to be constant at all points on the regression line, otherwise a single number would not suffice. Let us recall the plot of the protein in pregnancy data overlaid with the fitted regression line. The spread of points about the line is roughly constant for different values of gestation, so the assumption appears to be valid for the protein in pregnancy data.

The errors are independent

Another assumption we have made is implied by the fact that the sum-of-squares function (in our least squares estimation approach) was constructed by adding together the squared deviations of each observation from the deterministic part of the model. This implies the errors are independent.

Broadly speaking, independence means that knowledge of the error attached to one observation does not give us any information about the error attached to another. Very often, this is ensured by the physical nature of the experiment, where observations correspond to a random sample of different people or animals. However, there are situations, particularly where something is being monitored over time, where the error at one point can have a “carry-on”” effect on the error at the next. In the protein example, each observation corresponds to a different woman and so we have quite strong justification for assuming that the observations are independent. If the observations had arisen by taking repeated measurements on the same woman over a variety of times throughout her pregnancy then the assumption of independence would not be justified.

The errors are normally distributed

This will allow us to describe the variation in the model’s parameter estimates, and therefore make inferences about the population from which our sample was taken.

The values of the explanatory variables are recorded without error

This last assumption is not one which we can check by examining the data; instead we have to consider the nature of the experiment. The presence of appreciable error in the explanatory variables can cause a surprising amount of difficulty in fitting and interpreting the model. With the protein data, there is certainly scope for errors in $x$ — since this refers to time into pregnancy; a quantity which has to be calculated from recorded or remembered dates, or by ultrasound measurements. In the present case we shall assume that these dates are known accurately, through careful calculation, ultrasound scans, and following through until eventual birth. However, whenever we consider regression data, it is always wise to consider whether this assumption has been violated.