2.7 ANOVA and model fit

2.7.1 ANOVA

As we have seen in Sections 2.5 and 2.6, the variance of the error, σ2, plays a fundamental role in the inference for the model coefficients and prediction. In this section we will see how the variance of Y is decomposed into two parts, each one corresponding to the regression and to the error, respectively. This decomposition is called the ANalysis Of VAriance (ANOVA).

Before explaining ANOVA, it is important to recall an interesting result: the mean of the fitted values Y^1,,Y^n is the mean of Y1,,Yn. This is easily seen if we plug-in the expression of β^0: 1ni=1nY^i=1ni=1n(β^0+β^1Xi)=β^0+β^1X¯=(Y¯β^1X¯)+β^1X¯=Y¯. The ANOVA decomposition considers the following measures of variation related with the response:

  • SST=i=1n(YiY¯)2, the total sum of squares. This is the total variation of Y1,,Yn, since SST=nsy2, where sy2 is the sample variance of Y1,,Yn.
  • SSR=i=1n(Y^iY¯)2, the regression sum of squares12. This is the variation explained by the regression line, that is, the variation from Y¯ that is explained by the estimated conditional mean Y^i=β^0+β^1Xi. SSR=nsy^2, where sy^2 is the sample variance of Y^1,,Y^n.
  • SSE=i=1n(YiY^i)2, the sum of squared errors13. Is the variation around the conditional mean. Recall that SSE=i=1nε^i2=(n2)σ^2, where σ^2 is the sample variance of ε^1,,ε^n.

The ANOVA decomposition is (2.11)SSTVariation of Yis=SSRVariation of Y^is+SSEVariation of ε^is or, equivalently (dividing by n in (2.11)), sy2Variance of Yis=sy^2Variance of Y^is+(n2)/n×σ^2Variance of ε^is. The graphical interpretation of (2.11) is shown in Figures 2.24 and 2.25.

Visualization of the ANOVA decomposition. SST measures the variation of $Y_1,\ldots,Y_n$ with respect to $\bar Y$. SST measures the variation with respect to the conditional means, $\hat \beta_0+\hat\beta_1X_i$. SSE collects the variation of the residuals.

Figure 2.24: Visualization of the ANOVA decomposition. SST measures the variation of Y1,,Yn with respect to Y¯. SST measures the variation with respect to the conditional means, β^0+β^1Xi. SSE collects the variation of the residuals.

Figure 2.25: Illustration of the ANOVA decomposition and its dependence on σ2 and σ^2. Application also available here.

The ANOVA table summarizes the decomposition of the variance. Here is given in the layout employed by R.

Degrees of freedom Sum Squares Mean Squares F-value p-value
Predictor 1 SSR SSR1 SSR/1SSE/(n2) p
Residuals n2 SSE SSEn2

The anova function in R takes a model as an input and returns the ANOVA table. In R Commander, the ANOVA table can be computed by going to 'Models' -> 'Hypothesis tests' -> 'ANOVA table...'. In the 'Type of Tests' option, select 'Sequential ("Type I")'. (This types anova() for you…)

# Fit a linear model (alternatively, using R Commander)
model <- lm(MathMean ~ ReadingMean, data = pisa)
summary(model)
## 
## Call:
## lm(formula = MathMean ~ ReadingMean, data = pisa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.039 -10.151  -2.187   7.804  50.241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.65638   19.87265  -3.153  0.00248 ** 
## ReadingMean   1.13083    0.04172  27.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.72 on 63 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9198 
## F-statistic: 734.5 on 1 and 63 DF,  p-value: < 2.2e-16

# ANOVA table
anova(model)
## Analysis of Variance Table
## 
## Response: MathMean
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## ReadingMean  1 181525  181525  734.53 < 2.2e-16 ***
## Residuals   63  15569     247                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “F-value” of the ANOVA table represents the value of the F-statistic SSR/1SSE/(n2). This statistic is employed to test H0:β1=0vs.H1:β10, that is, the hypothesis of no linear dependence of Y on X. The result of this test is completely equivalent to the t-test for β1 that we saw in Section 2.5 (this is something specific for simple linear regression – the F-test will not be equivalent to the t-test for β1 in Chapter 3). It happens that F=SSR/1SSE/(n2)H0F1,n2, where F1,n2 is the Snedecor’s F distribution14 with 1 and n2 degrees of freedom. If H0 is true, then F is expected to be small since SSR will be close to zero. The p-value of this test is the same as the p-value of the t-test for H0:β1=0.

Recall that the F-statistic, its p-value and the degrees of freedom are also given in the output of summary.

For the y6 ~ x6 and y7 ~ x7 in the assumptions dataset, compute their ANOVA tables. Check that the p-values of the t-test for β1 and the F-test are the same.

2.7.2 The R2

The coefficient of determination R2 is closely related with the ANOVA decomposition. R2 is defined as R2=SSRSST=SSRSSR+SSE=SSRSSR+(n2)σ^2. R2 measures the proportion of variation of the response variable Y that is explained by the predictor X through the regression. The proportion of total variation of Y that is not explained is 1R2=SSESST. Intuitively, R2 measures the tightness of the data cloud around the regression line, since is related directly with σ^2. Check in Figure 2.25 how changing the value of σ2 (not σ^2, but σ^2 is obviously dependent on σ2) affects the R2.

The R2 is related with the sample correlation coefficient rxy=sxysxsy=i=1n(XiX¯)(YiY¯)i=1n(XiX¯)2i=1n(YiY¯)2 and it can be seen that R2=rxy2. Interestingly, it also holds that R2=ryy^2, that is, the square of the sample correlation coefficient between Y1,,Yn and Y^1,,Y^n is R2, a fact that is not immediately evident. This can be easily by first noting that (2.12)Y^i=β^0+β^1Xi=(Y¯β^1Xi)+β^1Xi=Y¯+β^1(XiX¯) and then replacing (2.12) into ryy^2=syy^2sy2sy^2=(i=1n(YiY¯)(Y^iY¯))2i=1n(YiY¯)2i=1n(Y^iY¯)2=(i=1n(YiY¯)(Y¯+β^1(XiX¯)Y¯))2i=1n(YiY¯)2i=1n(Y¯+β^1(XiX¯)Y¯)2=rxy2.

The equality R2=ryy^2 is still true for the multiple linear regression, e.g. Y=β0+β1X1+β2X2+ε. On the contrary, there is no coefficient of correlation between three or more variables, so rx1x2y does not exist. Hence, R2=rxy2 is a specific fact for simple linear regression.

The result R2=rxy2=ryy^2 can be checked numerically and graphically with the next code.

# Responses generated following a linear model
set.seed(343567) # Fixes seed, allows to generate the same random data
x <- rnorm(50)
eps <- rnorm(50)
y <- -1 + 2 * x + eps

# Regression model
reg <- lm(y ~ x)
yHat <- reg$fitted.values

# Summary
summary(reg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5834 -0.6015 -0.1466  0.6006  3.0419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9027     0.1503  -6.007 2.45e-07 ***
## x             2.1107     0.1443  14.623  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.059 on 48 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.8129 
## F-statistic: 213.8 on 1 and 48 DF,  p-value: < 2.2e-16

# Square of the correlation coefficient
cor(y, x)^2
## [1] 0.8166752
cor(y, yHat)^2
## [1] 0.8166752

# Plots
scatterplot(y ~ x, smooth = FALSE)
scatterplot(y ~ yHat, smooth = FALSE)

We conclude this section by pointing out two common sins regarding the use of R2. First, recall two important concepts regarding the application of any regression model in practice, in particular the linear model:

  1. Correctness. The linear model is built on certain assumptions, such as the ones we saw in Section 2.4. All the inferential results are based on these assumptions being true!15. A model is formally correct whenever the assumptions on which is based are not violated in the data.

  2. Usefulness. The usefulness of the model is a more subjective concept, but is usually measured by the accuracy on the prediction and explanation of the response Y by the predictor X. For example, Y=0X+ε is a valid linear model, but is completely useless for predicting Y from X.

Figure 2.25 show a fitted regression line to a small dataset, for various levels of σ2. All the linear models are correct by construction, but the ones with a larger R2 are more useful for predicting/explaining Y from X, since this is done in a more precise way.

R2 does not measure the correctness of a linear model but its usefulness (for prediction, for explaining the variance of Y), assuming the model is correct.

Trusting blindly the R2 can lead to catastrophic conclusions, since the model may not be correct. Here is a counterexample of a linear regression performed in a data that clearly does not satisfy the assumptions discussed in Section 2.4, but despite so it has a large R2. Recall how biased will be the predictions for x=0.35 and x=0.65!

# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps

# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53525 -0.18020  0.02811  0.16882  0.46896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.87190    0.05860   14.88   <2e-16 ***
## x           -1.69268    0.09359  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7671 
## F-statistic: 327.1 on 1 and 98 DF,  p-value: < 2.2e-16

# But prediction is obviously problematic
scatterplot(y ~ x, smooth = FALSE)

So remember:

A large R2 means nothing if the assumptions of the model do not hold. R2 is the proportion of variance of Y explained by X, but, of course, only when the linear model is correct.


  1. Recall that SSR is different from RSS (Residual Sum of Squares, Section 2.3).↩︎

  2. Recall that SSE and RSS (for (β^0,β^1)) are just different names for referring to the same quantity: SSE=i=1n(YiY^i)2=i=1n(Yiβ^0β^1Xi)2=RSS(β^0,β^1).↩︎

  3. The Fn,m distribution arises as the quotient of two independent random variables χn2 and χm2, χn2/nχm2/m.↩︎

  4. If the assumptions are not satisfied (mismatch between what is assumed to happen in theory and what the data is), then the inference results may be misleading.↩︎