A summary of the fitted model

The Simple Linear Regression

Suppose we have one response variable y and an explanatory variable x and two models as follows

Data: (y_i,x_{i}),\quad i=1,\dots,n

Model 0: E(y_i) = \alpha

Model 1: E(y_i) = \alpha+\beta x_{i}

In the case of simple linear regression with only one explanatory variable, this compares a line that slopes through the data (Model 1) with a line that runs through the data but lies parallel to the horizontal axis (Model 0).

In order to fit Model 0 to the data, that is estimate the parameters in this model through least squares, we use

S(\alpha) = \sum_{i=1}^n(y_i-\alpha)^2 \hat{\alpha} = \bar{y}.

as illustrated in the top left hand side plot below. Therefore, the residual sum-of-squares for Model 0 is: \begin{aligned} S(\hat{\alpha}) &= \sum_{i=1}^n(y_i-\bar{y})^2 \\ &= S_{yy} \\ \end{aligned} corresponding to the top right hand side plot below.

Denote the residual sum-of-squares for Model 1 as

\begin{aligned} S(\hat{\alpha}, \hat{\beta}) &= \sum_{i=1}^n(y_i-\{\hat{\alpha}+\hat{\beta} x_{i}\})^2 \\ &= \sum_{i=1}^n(y_i-\hat{y}_i)^2 \end{aligned}

Recall that we are calculating the distances between the observed values y_1,\ldots,y_n to fitted values \hat{y}_1,\ldots,\hat{y}_n corresponding to bottom left hand side plot above.

For completeness we can also look at the difference between fitted values obtained from Model 0 and those obtained from Model 1

\sum_{i=1}^{n}(\bar{y} - \hat{y}_i)^2

corresponding to bottom right hand side plot above.

Sums of Squares

The residual sum of squares of Model 0 is referred to as the Total corrected sum of squares TSS and the sum of squares between the fitted values obtained from Model 0 and Model 1 is referred to as the Model sum of squares MSS. The three values RSS, MSS and TSS are related such that TSS=MSS+RSS.

Coefficient of Determination R^2

In our discussion of least squares, the residual sum-of-squares for a particular model was proposed as a numerical measure of how well the model fits the data. This leads to a natural measure of how much variation in the data our model has explained, by comparing RSS with TSS. A simple but useful measure of model fit is given by

R^2 = 1-\frac{RSS}{TSS}

where RSS is the residual sum-of-squares for Model 1, the fitted model of interest; and TSS = \sum_{i=1}^n(y_i-\bar{y})^2 = S_{yy}, the residual sum of squares of the null model. Since Model 0 is more restricted, it will always produce a larger residual sum-of-squares. That is TSS > RSS.

R^2 quantifies how much of a drop in the residual sum-of-squares is accounted for by fitting the proposed model, and is often referred to as the coefficient of determination. This is expressed on a helpful scale, as a proportion of the total variation in the data.

  • Values of R^2 approaching 1 indicate the model to be a good fit.

  • Values of R^2 less than 0.5 suggest that the model gives an OK fit to the data.

  • Working with real data, we often observed very small R^2 values less than 0.5.

In the case of simple linear regression

Model 1: E(y_i) = \alpha+\beta x_i

R^2=r^2

where R^2 is the coefficient of determination and r is the sample correlation coefficient. To show this recall

RSS = \sum_{i=1}^n (y_i-(\hat{\alpha}+\hat{\beta} x_i))^2 = S_{yy}-\frac{(S_{xy})^2}{S_{xx}}

\begin{aligned} RSS &= \sum_{i=1}^n (y_i-\{\hat{\alpha}+\hat{\beta} x_i\})^2 \\ &= S_{yy}-\frac{(S_{xy})^2}{S_{xx}} \\ R^2 &= 1-\frac{RSS}{TSS}\\ &=1-\frac{\sum_{i=1}^n({y_i}-\hat{y}_i)^2}{\sum_{i=1}^n(y_i-\bar{y})^2}\\ &=\frac{S_{yy}-(S_{yy}-\frac{(S_{xy})^2}{S_{xx}})}{S_{yy}}\\ &=\frac{(S_{xy})^2}{S_{xx}S_{yy}} \\ &= r^2 \end{aligned}

Hence R^2 = r^2 i.e. the coefficient of determination is the squared sample correlation; in the case of simple linear regression. This result does not extend to the multiple linear regression.

Nested Models

In the case of the simple linear regression

Model 0: E(y_i) = \alpha

Model 1: E(y_i) = \alpha+\beta x_{i}

these models are nested. By setting \beta=0 in Model 1 we retrieve Model 0. In other words, the simpler Model 0 is a special case of the more complex Model 1.

In the case of simple linear regression through the origin

Model 0: E(y_i) = \alpha

Model 1: E(y_i) = \beta x_{i}

the formula for R^2, with TSS = \sum_i (y_i-\bar{y})^2, cannot be used. The fitted Model 1 and Model 0 are not nested.

The Multiple Linear Regression

Suppose now we have one response variable y and p explanatory variable x_1, \ldots x_p and two models as follows

Data: (y_i,x_{1i}, \ldots, x_{(p-1)i}),\quad i=1,\dots,n

Model 0: E(y_i) = \alpha

Model 1: E(y_i) = \alpha+\beta_1 x_{1i} + \ldots + \beta_{k} x_{ki}

Then we can calculate the coefficient of determination R^2 is the same way. However, in the case of multiple linear regression, where there is more than one explanatory variable in the model, we often refer to a quantity called adjusted R^2, R^2 (adj), instead of R^2. As the number of explanatory variables increases, R^2 also increases, but R^2 (adj) adjusts for the fact that there is more than one explanatory variable the model.

R^2(adj) as a measure of model fit

For any multiple linear regression E(y_i) = \alpha+\beta_1x_{1i}+\dots+\beta_{(p-1)}x_{(p-1)i} the R^2(adj) is defined as R^2 \mbox(adj) = 1-\frac{\frac{RSS}{n-k-1}}{\frac{TSS}{n-1}}, where k is the number of explanatory variables, i.e. the number of coefficients in the model excluding the constant \alpha term.

R^2(adj) can also be calculated from the following identity

R^{2} \mbox{(adj)} ={1-(1-R^{2}){n-1 \over n-k-1}}.