2.7 ANOVA and model fit

2.7.1 ANOVA

As we have seen in Sections 2.5 and 2.6, the variance of the error, σ2σ2, plays a fundamental role in the inference for the model coefficients and prediction. In this section we will see how the variance of YY is decomposed into two parts, each one corresponding to the regression and to the error, respectively. This decomposition is called the ANalysis Of VAriance (ANOVA).

Before explaining ANOVA, it is important to recall an interesting result: the mean of the fitted values ˆY1,,ˆYn^Y1,,^Yn is the mean of Y1,,YnY1,,Yn. This is easily seen if we plug-in the expression of ˆβ0^β0: 1nni=1ˆYi=1nni=1(ˆβ0+ˆβ1Xi)=ˆβ0+ˆβ1ˉX=(ˉYˆβ1ˉX)+ˆβ1ˉX=ˉY.1nni=1^Yi=1nni=1(^β0+^β1Xi)=^β0+^β1¯X=(¯Y^β1¯X)+^β1¯X=¯Y. The ANOVA decomposition considers the following measures of variation related with the response:

  • SST=ni=1(YiˉY)2SST=ni=1(Yi¯Y)2, the total sum of squares. This is the total variation of Y1,,YnY1,,Yn, since SST=ns2ySST=ns2y, where s2ys2y is the sample variance of Y1,,YnY1,,Yn.
  • SSR=ni=1(ˆYiˉY)2SSR=ni=1(^Yi¯Y)2, the regression sum of squares12. This is the variation explained by the regression line, that is, the variation from ˉY¯Y that is explained by the estimated conditional mean ˆYi=ˆβ0+ˆβ1Xi^Yi=^β0+^β1Xi. SSR=ns2ˆySSR=ns2^y, where s2ˆys2^y is the sample variance of ˆY1,,ˆYn^Y1,,^Yn.
  • SSE=ni=1(YiˆYi)2SSE=ni=1(Yi^Yi)2, the sum of squared errors13. Is the variation around the conditional mean. Recall that SSE=ni=1ˆε2i=(n2)ˆσ2SSE=ni=1^ε2i=(n2)^σ2, where ˆσ2^σ2 is the sample variance of ˆε1,,ˆεn^ε1,,^εn.

The ANOVA decomposition is SSTVariation of Yis=SSRVariation of ˆYis+SSEVariation of ˆεisSSTVariation of Yis=SSRVariation of ^Yis+SSEVariation of ^εis(2.11) or, equivalently (dividing by nn in (2.11)), s2yVariance of Yis=s2ˆyVariance of ˆYis+(n2)/n׈σ2Variance of ˆεis.s2yVariance of Yis=s2^yVariance of ^Yis+(n2)/n×^σ2Variance of ^εis. The graphical interpretation of (2.11) is shown in Figures 2.24 and 2.25.

Visualization of the ANOVA decomposition. SST measures the variation of $Y_1,\ldots,Y_n$ with respect to $\bar Y$. SST measures the variation with respect to the conditional means, $\hat \beta_0+\hat\beta_1X_i$. SSE collects the variation of the residuals.

Figure 2.24: Visualization of the ANOVA decomposition. SST measures the variation of Y1,,YnY1,,Yn with respect to ˉY¯Y. SST measures the variation with respect to the conditional means, ˆβ0+ˆβ1Xi^β0+^β1Xi. SSE collects the variation of the residuals.

Figure 2.25: Illustration of the ANOVA decomposition and its dependence on σ2σ2 and ˆσ2^σ2. Application also available here.

The ANOVA table summarizes the decomposition of the variance. Here is given in the layout employed by R.

Degrees of freedom Sum Squares Mean Squares FF-value pp-value
Predictor 11 SSR SSR1SSR1 SSR/1SSE/(n2)SSR/1SSE/(n2) pp
Residuals n2n2 SSE SSEn2SSEn2

The anova function in R takes a model as an input and returns the ANOVA table. In R Commander, the ANOVA table can be computed by going to 'Models' -> 'Hypothesis tests' -> 'ANOVA table...'. In the 'Type of Tests' option, select 'Sequential ("Type I")'. (This types anova() for you…)

# Fit a linear model (alternatively, using R Commander)
model <- lm(MathMean ~ ReadingMean, data = pisa)
summary(model)
## 
## Call:
## lm(formula = MathMean ~ ReadingMean, data = pisa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.039 -10.151  -2.187   7.804  50.241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.65638   19.87265  -3.153  0.00248 ** 
## ReadingMean   1.13083    0.04172  27.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.72 on 63 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9198 
## F-statistic: 734.5 on 1 and 63 DF,  p-value: < 2.2e-16

# ANOVA table
anova(model)
## Analysis of Variance Table
## 
## Response: MathMean
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## ReadingMean  1 181525  181525  734.53 < 2.2e-16 ***
## Residuals   63  15569     247                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “FF-value” of the ANOVA table represents the value of the FF-statistic SSR/1SSE/(n2)SSR/1SSE/(n2). This statistic is employed to test H0:β1=0vs.H1:β10,H0:β1=0vs.H1:β10, that is, the hypothesis of no linear dependence of YY on XX. The result of this test is completely equivalent to the tt-test for β1β1 that we saw in Section 2.5 (this is something specific for simple linear regression – the FF-test will not be equivalent to the tt-test for β1β1 in Chapter 3). It happens that F=SSR/1SSE/(n2)H0F1,n2,F=SSR/1SSE/(n2)H0F1,n2, where F1,n2F1,n2 is the Snedecor’s FF distribution14 with 11 and n2n2 degrees of freedom. If H0H0 is true, then FF is expected to be small since SSR will be close to zero. The pp-value of this test is the same as the pp-value of the tt-test for H0:β1=0H0:β1=0.

Recall that the FF-statistic, its pp-value and the degrees of freedom are also given in the output of summary.

For the y6 ~ x6 and y7 ~ x7 in the assumptions dataset, compute their ANOVA tables. Check that the pp-values of the tt-test for β1β1 and the FF-test are the same.

2.7.2 The R2R2

The coefficient of determination R2R2 is closely related with the ANOVA decomposition. R2R2 is defined as R2=SSRSST=SSRSSR+SSE=SSRSSR+(n2)ˆσ2.R2=SSRSST=SSRSSR+SSE=SSRSSR+(n2)^σ2. R2R2 measures the proportion of variation of the response variable YY that is explained by the predictor XX through the regression. The proportion of total variation of YY that is not explained is 1R2=SSESST1R2=SSESST. Intuitively, R2R2 measures the tightness of the data cloud around the regression line, since is related directly with ˆσ2^σ2. Check in Figure 2.25 how changing the value of σ2σ2 (not ˆσ2^σ2, but ˆσ2^σ2 is obviously dependent on σ2σ2) affects the R2R2.

The R2R2 is related with the sample correlation coefficient rxy=sxysxsy=ni=1(XiˉX)(YiˉY)ni=1(XiˉX)2ni=1(YiˉY)2rxy=sxysxsy=ni=1(Xi¯X)(Yi¯Y)ni=1(Xi¯X)2ni=1(Yi¯Y)2 and it can be seen that R2=r2xyR2=r2xy. Interestingly, it also holds that R2=r2yˆyR2=r2y^y, that is, the square of the sample correlation coefficient between Y1,,YnY1,,Yn and ˆY1,,ˆYn^Y1,,^Yn is R2R2, a fact that is not immediately evident. This can be easily by first noting that ˆYi=ˆβ0+ˆβ1Xi=(ˉYˆβ1Xi)+ˆβ1Xi=ˉY+ˆβ1(XiˉX)^Yi=^β0+^β1Xi=(¯Y^β1Xi)+^β1Xi=¯Y+^β1(Xi¯X)(2.12) and then replacing (2.12) into r2yˆy=s2yˆys2ys2ˆy=(ni=1(YiˉY)(ˆYiˉY))2ni=1(YiˉY)2ni=1(ˆYiˉY)2=(ni=1(YiˉY)(ˉY+ˆβ1(XiˉX)ˉY))2ni=1(YiˉY)2ni=1(ˉY+ˆβ1(XiˉX)ˉY)2=r2xy.r2y^y=s2y^ys2ys2^y=(ni=1(Yi¯Y)(^Yi¯Y))2ni=1(Yi¯Y)2ni=1(^Yi¯Y)2=(ni=1(Yi¯Y)(¯Y+^β1(Xi¯X)¯Y))2ni=1(Yi¯Y)2ni=1(¯Y+^β1(Xi¯X)¯Y)2=r2xy.

The equality R2=r2yˆyR2=r2y^y is still true for the multiple linear regression, e.g. Y=β0+β1X1+β2X2+εY=β0+β1X1+β2X2+ε. On the contrary, there is no coefficient of correlation between three or more variables, so rx1x2yrx1x2y does not exist. Hence, R2=r2xyR2=r2xy is a specific fact for simple linear regression.

The result R2=r2xy=r2yˆyR2=r2xy=r2y^y can be checked numerically and graphically with the next code.

# Responses generated following a linear model
set.seed(343567) # Fixes seed, allows to generate the same random data
x <- rnorm(50)
eps <- rnorm(50)
y <- -1 + 2 * x + eps

# Regression model
reg <- lm(y ~ x)
yHat <- reg$fitted.values

# Summary
summary(reg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5834 -0.6015 -0.1466  0.6006  3.0419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9027     0.1503  -6.007 2.45e-07 ***
## x             2.1107     0.1443  14.623  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.059 on 48 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.8129 
## F-statistic: 213.8 on 1 and 48 DF,  p-value: < 2.2e-16

# Square of the correlation coefficient
cor(y, x)^2
## [1] 0.8166752
cor(y, yHat)^2
## [1] 0.8166752

# Plots
scatterplot(y ~ x, smooth = FALSE)
scatterplot(y ~ yHat, smooth = FALSE)

We conclude this section by pointing out two common sins regarding the use of R2R2. First, recall two important concepts regarding the application of any regression model in practice, in particular the linear model:

  1. Correctness. The linear model is built on certain assumptions, such as the ones we saw in Section 2.4. All the inferential results are based on these assumptions being true!15. A model is formally correct whenever the assumptions on which is based are not violated in the data.

  2. Usefulness. The usefulness of the model is a more subjective concept, but is usually measured by the accuracy on the prediction and explanation of the response YY by the predictor XX. For example, Y=0X+εY=0X+ε is a valid linear model, but is completely useless for predicting YY from XX.

Figure 2.25 show a fitted regression line to a small dataset, for various levels of σ2. All the linear models are correct by construction, but the ones with a larger R2 are more useful for predicting/explaining Y from X, since this is done in a more precise way.

R2 does not measure the correctness of a linear model but its usefulness (for prediction, for explaining the variance of Y), assuming the model is correct.

Trusting blindly the R2 can lead to catastrophic conclusions, since the model may not be correct. Here is a counterexample of a linear regression performed in a data that clearly does not satisfy the assumptions discussed in Section 2.4, but despite so it has a large R2. Recall how biased will be the predictions for x=0.35 and x=0.65!

# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps

# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53525 -0.18020  0.02811  0.16882  0.46896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.87190    0.05860   14.88   <2e-16 ***
## x           -1.69268    0.09359  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7671 
## F-statistic: 327.1 on 1 and 98 DF,  p-value: < 2.2e-16

# But prediction is obviously problematic
scatterplot(y ~ x, smooth = FALSE)

So remember:

A large R2 means nothing if the assumptions of the model do not hold. R2 is the proportion of variance of Y explained by X, but, of course, only when the linear model is correct.


  1. Recall that SSR is different from RSS (Residual Sum of Squares, Section 2.3).↩︎

  2. Recall that SSE and RSS (for (ˆβ0,ˆβ1)) are just different names for referring to the same quantity: SSE=ni=1(YiˆYi)2=ni=1(Yiˆβ0ˆβ1Xi)2=RSS(ˆβ0,ˆβ1).↩︎

  3. The Fn,m distribution arises as the quotient of two independent random variables χ2n and χ2m, χ2n/nχ2m/m.↩︎

  4. If the assumptions are not satisfied (mismatch between what is assumed to happen in theory and what the data is), then the inference results may be misleading.↩︎