Residuals

Defining Residuals

A sensible strategy is to focus attention on the differences between the observed data \(y_1, \ldots, y_n\) and the fitted values from the fitted regression model \(\hat{y}_1, \ldots, \hat{y}_n\). This is easiest to see in the case of simple linear regression, with one explanatory variable, \[y_i = \alpha+\beta x_i+\epsilon_i.\] Here we define fitted values as \[\hat{y}_i = \hat{\alpha}+\hat{\beta} x_i\] and residuals as \[\hat{\epsilon}_i = y_i - \hat{y}_i\] In the case of \(k\) explanatory variables, the fitted values are \[\hat{y}_i = \hat{\alpha}+\hat{\beta}_1 x_{1i}+ \dots +\hat{\beta}_{k} x_{ki}.\] and the residuals are \[\hat{\epsilon}_i = y_i - \hat{y_i}\]

The residuals are not the true errors, they are only estimates based on the observed data. For instance, if we observed different data from the same population then our estimated residuals would be different because of random variation. The estimated residuals \(\hat{\epsilon}_i\)’s should have properties similar to the true error terms \(\epsilon\). The residuals, unlike the true errors, do not all have the same variance. In order to adjust for the fact that different \(\hat{\epsilon}_i\)’s can have different variances, we can use standardized residuals, these are defined as

\[r_i = \frac{\hat{\epsilon}_i}{\sqrt{\mathrm{Var}(\hat{\epsilon}_i)}}\] where \({\mathrm{Var}(\hat{\epsilon}_i)}\) is the variance of the estimated residuals commonly denoted by \(\sigma^2\).

Estimating the error variance

The error variance \(\sigma^2\) is estimated as:

\[\hat{\sigma}^2 = \frac{RSS}{n-p}.\]

where \(p\) = number of regressions coefficients estimated in the model. Please note that if we included an intercept term in the regression model, then \(p\) is the number of explanatory variables plus 1. For example, in the simple linear regression model in the above example \(p=2\).