This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

4.3 Measures of Fit

After fitting a linear regression model, a natural question is how well the model describes the data. Visually, this amounts to assessing whether the observations are tightly clustered around the regression line. Both the coefficient of determination and the standard error of the regression measure how well the OLS Regression line fits the data.

The Coefficient of Determination

R2R2, the coefficient of determination, is the fraction of the sample variance of YiYi that is explained by XiXi. Mathematically, the R2R2 can be written as the ratio of the explained sum of squares to the total sum of squares. The explained sum of squares (ESSESS) is the sum of squared deviations of the predicted values ^Yi^Yi, from the average of the YiYi. The total sum of squares (TSSTSS) is the sum of squared deviations of the YiYi from their average. Thus we have

ESS=ni=1(^Yi¯Y)2,TSS=ni=1(Yi¯Y)2,R2=ESSTSS.

Since TSS=ESS+SSR we can also write

R2=1SSRTSS

where SSR is the sum of squared residuals, a measure for the errors made when predicting the Y by X. The SSR is defined as

SSR=ni=1ˆu2i.

R2 lies between 0 and 1. It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line, implies R2=1 since then we have SSR=0. On the contrary, if our estimated regression line does not explain any variation in the Yi, we have ESS=0 and consequently R2=0.

The Standard Error of the Regression

The Standard Error of the Regression (SER) is an estimator of the standard deviation of the residuals ˆui. As such it measures the magnitude of a typical deviation from the regression line, i.e., the magnitude of a typical residual.

SER=sˆu=s2ˆu   where   s2ˆu=1n2ni=1ˆu2i=SSRn2

Remember that the ui are unobserved. This is why we use their estimated counterparts, the residuals ˆui, instead. See Chapter 4.3 of the book for a more detailed comment on the SER.

Application to the Test Score Data

Both measures of fit can be obtained by using the function summary() with an lm object provided as the only argument. While the function lm() only prints out the estimated coefficients to the console, summary() provides additional predefined information such as the regression’s R2 and the SER.

mod_summary <- summary(linear_model)
mod_summaryHide Source
## 
## Call:
## lm(formula = score ~ STR, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
## STR          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06Hide Output

The R2 in the output is called Multiple R-squared and has a value of 0.051. Hence, 5.1% of the variance of the dependent variable score is explained by the explanatory variable STR. That is, the regression explains little of the variance in score, and much of the variation in test scores remains unexplained (cf. Figure 4.3 of the book).

The SER is called Residual standard error and equals 18.58. The unit of the SER is the same as the unit of the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is 18.58 points.

Now, let us check whether summary() uses the same definitions for R2 and SER as we do when computing them manually.

# compute R^2 manually
SSR <- sum(mod_summary$residuals^2)
TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS

# print the value to the console
R2
## [1] 0.05124009
# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))

# print the value to the console
SER
## [1] 18.58097

We find that the results coincide. Note that the values provided by summary() are rounded to two decimal places. Can you do so using R?