4.3 Measures of Fit
After fitting a linear regression model, a natural question is how well the model describes the data. Visually, this amounts to assessing whether the observations are tightly clustered around the regression line. Both the coefficient of determination and the standard error of the regression measure how well the OLS Regression line fits the data.
The Coefficient of Determination
R2R2, the coefficient of determination, is the fraction of the sample variance of YiYi that is explained by XiXi. Mathematically, the R2R2 can be written as the ratio of the explained sum of squares to the total sum of squares. The explained sum of squares (ESSESS) is the sum of squared deviations of the predicted values ^Yi^Yi, from the average of the YiYi. The total sum of squares (TSSTSS) is the sum of squared deviations of the YiYi from their average. Thus we have
ESS=n∑i=1(^Yi−¯Y)2,TSS=n∑i=1(Yi−¯Y)2,R2=ESSTSS.Since TSS=ESS+SSR we can also write
R2=1−SSRTSS
where SSR is the sum of squared residuals, a measure for the errors made when predicting the Y by X. The SSR is defined as
SSR=n∑i=1ˆu2i.
R2 lies between 0 and 1. It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line, implies R2=1 since then we have SSR=0. On the contrary, if our estimated regression line does not explain any variation in the Yi, we have ESS=0 and consequently R2=0.
The Standard Error of the Regression
The Standard Error of the Regression (SER) is an estimator of the standard deviation of the residuals ˆui. As such it measures the magnitude of a typical deviation from the regression line, i.e., the magnitude of a typical residual.
SER=sˆu=√s2ˆu where s2ˆu=1n−2n∑i=1ˆu2i=SSRn−2
Remember that the ui are unobserved. This is why we use their estimated counterparts, the residuals ˆui, instead. See Chapter 4.3 of the book for a more detailed comment on the SER.
Application to the Test Score Data
Both measures of fit can be obtained by using the function summary() with an lm object provided as the only argument. While the function lm() only prints out the estimated coefficients to the console, summary() provides additional predefined information such as the regression’s R2 and the SER.
mod_summary <- summary(linear_model)
mod_summary
Hide Source
##
## Call:
## lm(formula = score ~ STR, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
## STR -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Hide Output
The R2 in the output is called Multiple R-squared and has a value of 0.051. Hence, 5.1% of the variance of the dependent variable score is explained by the explanatory variable STR. That is, the regression explains little of the variance in score, and much of the variation in test scores remains unexplained (cf. Figure 4.3 of the book).
The SER is called Residual standard error and equals 18.58. The unit of the SER is the same as the unit of the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is 18.58 points.
Now, let us check whether summary() uses the same definitions for R2 and SER as we do when computing them manually.
# compute R^2 manually
SSR <- sum(mod_summary$residuals^2)
TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS
# print the value to the console
R2
## [1] 0.05124009
# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))
# print the value to the console
SER
## [1] 18.58097
We find that the results coincide. Note that the values provided by summary() are rounded to two decimal places. Can you do so using R?