3.6 ANOVA and model fit
3.6.1 ANOVA
The ANOVA decomposition for multiple linear regression is quite analogous to the one in simple linear regression. The ANOVA decomposes the variance of YY into two parts, each one corresponding to the regression and to the error, respectively. Since the difference between simple and multiple linear regression is the number of predictors – the response YY is unique in both cases – the ANOVA decompositions are highly similar, as we will see.
As in simple linear regression, the mean of the fitted values ˆY1,…,ˆYn^Y1,…,^Yn is the mean of Y1,…,YnY1,…,Yn. This is an important result that can be checked if we use matrix notation. The ANOVA decomposition considers the following measures of variation related with the response:
- SST=∑ni=1(Yi−ˉY)2SST=∑ni=1(Yi−¯Y)2, the total sum of squares. This is the total variation of Y1,…,YnY1,…,Yn, since SST=ns2ySST=ns2y, where s2ys2y is the sample variance of Y1,…,YnY1,…,Yn.
- SSR=∑ni=1(ˆYi−ˉY)2SSR=∑ni=1(^Yi−¯Y)2, the regression sum of squares. This is the variation explained by the regression plane, that is, the variation from ˉY¯Y that is explained by the estimated conditional mean ˆYi=ˆβ0+ˆβ1Xi1+⋯+ˆβkXik^Yi=^β0+^β1Xi1+⋯+^βkXik. SSR=ns2ˆySSR=ns2^y, where s2ˆys2^y is the sample variance of ˆY1,…,ˆYn^Y1,…,^Yn.
- SSE=∑ni=1(Yi−ˆYi)2SSE=∑ni=1(Yi−^Yi)2, the sum of squared errors25. Is the variation around the conditional mean. Recall that SSE=∑ni=1ˆε2i=(n−k−1)ˆσ2SSE=∑ni=1^ε2i=(n−k−1)^σ2, where ˆσ2^σ2 is the sample variance of ˆε1,…,ˆεn^ε1,…,^εn.
The ANOVA decomposition is exactly the same as in simple linear regression: SST⏟Variation of Y′is=SSR⏟Variation of ˆY′is+SSE⏟Variation of ˆε′is or, equivalently (dividing by n in (3.9)), s2y⏟Variance of Y′is=s2ˆy⏟Variance of ˆY′is+(n−k−1)/n׈σ2⏟Variance of ˆε′is. Notice the n−k−1 instead of simple linear regression’s n−2, which is the main change. The graphical interpretation of (3.9) when k=2 is shown in Figures 3.11 and 3.12.

Figure 3.11: Visualization of the ANOVA decomposition when k=2. SST measures the variation of Y1,…,Yn with respect to ˉY. SST measures the variation with respect to the conditional means, ˆβ0+ˆβ1Xi1+ˆβ2Xi2. SSE collects the variation of the residuals.
Figure 3.12: Illustration of the ANOVA decomposition and its dependence on σ2 and ˆσ2. Application also available here.
The ANOVA table summarizes the decomposition of the variance.
Degrees of freedom | Sum Squares | Mean Squares | F-value | p-value | |
---|---|---|---|---|---|
Predictors | k | SSR | SSRk | SSR/kSSE/(n−k−1) | p |
Residuals | n−k−1 | SSE | SSEn−k−1 |
The “F-value” of the ANOVA table represents the value of the F-statistic SSR/kSSE/(n−k−1). This statistic is employed to test H0:β1=…=βk=0vs.H1:βj≠0 for any j, that is, the hypothesis of no linear dependence of Y on X1,…,Xk (the plane is completely flat, with no inclination). If H0 is rejected, it means that at least one βj is significantly different from zero. It happens that F=SSR/kSSE/(n−k−1)H0∼Fk,n−k−1, where Fk,n−k−1 is the Snedecor’s F distribution with k and n−k−1 degrees of freedom. If H0 is true, then F is expected to be small since SSR will be close to zero (little variation is explained by the regression model since ˆβ≈0). The p-value of this test is not the same as the p-value of the t-test for H0:β1=0, that only happens in simple linear regression because k=1!
The “ANOVA table” is a broad concept in statistics, with different variants. Here we are only covering the basic ANOVA table from the relation SST=SSR+SSE. However, further sophistications are possible when SSR is decomposed into the variations contributed by each predictor. In particular, for multiple linear regression R’s anova
implements a sequential (type I) ANOVA table, which is not the previous table!
The anova
function in R takes a model as an input and returns the following sequential ANOVA table26:
Degrees of freedom | Sum Squares | Mean Squares | F-value | p-value | |
---|---|---|---|---|---|
Predictor 1 | 1 | SSR1 | SSR11 | SSR1/1SSE/(n−k−1) | p1 |
Predictor 2 | 1 | SSR2 | SSR21 | SSR2/1SSE/(n−k−1) | p2 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Predictor k | 1 | SSRk | SSRk1 | SSRk/1SSE/(n−k−1) | pk |
Residuals | n−k−1 | SSE | SSEn−k−1 |
Here the SSRj represents the regression sum of squares associated to the inclusion of Xj in the model with predictors X1,…,Xj−1, this is: SSRj=SSR(X1,…,Xj)−SSR(X1,…,Xj−1). The p-values p1,…,pk correspond to the testing of the hypotheses H0:βj=0vs.H1:βj≠0, carried out inside the linear model Y=β0+β1X1+⋯+βjXj+ε. This is like the t-test for βj for the model with predictors X1,…,Xj.
Let’s see how we can compute both ANOVA tables in R. The sequential table is simple: use anova
. We illustrate it with the Boston
dataset.
# Load data
library(MASS)
data(Boston)
# Fit a linear model
<- lm(medv ~ crim + lstat + zn + nox, data = Boston)
model summary(model)
##
## Call:
## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.972 -3.956 -1.344 2.148 25.076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.93462 1.72076 17.977 <2e-16 ***
## crim -0.08297 0.03677 -2.257 0.0245 *
## lstat -0.90940 0.05040 -18.044 <2e-16 ***
## zn 0.03493 0.01395 2.504 0.0126 *
## nox 5.42234 3.24241 1.672 0.0951 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.169 on 501 degrees of freedom
## Multiple R-squared: 0.5537, Adjusted R-squared: 0.5502
## F-statistic: 155.4 on 4 and 501 DF, p-value: < 2.2e-16
# ANOVA table with sequential test
anova(model)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## crim 1 6440.8 6440.8 169.2694 < 2e-16 ***
## lstat 1 16950.1 16950.1 445.4628 < 2e-16 ***
## zn 1 155.7 155.7 4.0929 0.04360 *
## nox 1 106.4 106.4 2.7967 0.09509 .
## Residuals 501 19063.3 38.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The last p-value is the one of the last t-test
In order to compute the simplified ANOVA table, we need to rely on an ad-hoc function27. The function takes as input a fitted lm
:
# This function computes the simplified anova from a linear model
<- function(object, ...) {
simpleAnova
# Compute anova table
<- anova(object, ...)
tab
# Obtain number of predictors
<- nrow(tab) - 1
p
# Add predictors row
<- colSums(tab[1:p, 1:2])
predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])
predictorsRow
# F-quantities
<- predictorsRow[3] / tab[p + 1, 3]
Fval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
pval <- c(predictorsRow, Fval, pval)
predictorsRow
# Simplified table
<- rbind(predictorsRow, tab[p + 1, ])
tab row.names(tab)[1] <- "Predictors"
return(tab)
}
# Simplified ANOVA
simpleAnova(model)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 4 23653 5913.3 155.41 < 2.2e-16 ***
## Residuals 501 19063 38.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that the F-statistic, its p-value and the degrees of freedom are also given in the output of summary
.
Compute the ANOVA table for the regression Price ~ WinterRain + AGST + HarvestRain + Age
in the wine
dataset. Check that the p-value for the F-test given in summary
and by simpleAnova
are the same.
3.6.2 The R2
The coefficient of determination R2 is defined as in simple linear regression: R2=SSRSST=SSRSSR+SSE=SSRSSR+(n−k−1)ˆσ2. R2 measures the proportion of variation of the response variable Y that is explained by the predictors X1,…,Xk through the regression. Intuitively, R2 measures the tightness of the data cloud around the regression plane. Check in Figure 3.12 how changing the value of σ2 (not ˆσ2, but ˆσ2 is obviously dependent on σ2) affects the R2. Also, as we saw in Section 2.7, R2=r2yˆy, that is, the square of the sample correlation coefficient between Y1,…,Yn and ˆY1,…,ˆYn is R2.
Trusting blindly the R2 can lead to catastrophic conclusions in model selection. Here is a counterexample of a multiple regression where the R2 is apparently large but the assumptions discussed in Section 3.3 are clearly not satisfied.
# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
<- seq(0.15, 1, l = 100)
x1 set.seed(123456)
<- runif(100, -3, 3)
x2 <- rnorm(n = 100, sd = 0.25 * x1^2)
eps <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps
y
# Great R^2!?
<- lm(y ~ x1 + x2)
reg summary(reg)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78737 -0.20946 0.01031 0.19652 1.05351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.788812 0.096418 8.181 1.1e-12 ***
## x1 -2.540073 0.154876 -16.401 < 2e-16 ***
## x2 0.002283 0.020954 0.109 0.913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3754 on 97 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388
## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16
# But prediction is obviously problematic
scatter3d(y ~ x1 + x2, fit = "linear")
Remember that:
- R2 does not measure the correctness of a linear model but its usefulness, assuming the model is correct.
- R2 is the proportion of variance of Y explained by X1,…,Xk, but, of course, only when the linear model is correct.
We finalize by pointing out a nice connection between the R2, the ANOVA decomposition and the least squares estimator ˆβ:
The ANOVA decomposition gives another interpretation of the least-squares estimates: ˆβ are the estimated coefficients that maximize the R2 (among all the possible estimates we could think about). To see this, recall that R2=SSRSST=SST−SSESST=SST−RSS(ˆβ)SST, so if RSS(ˆβ)=minβ∈Rk+1RSS(β), then R2 is maximal for ˆβ!
3.6.3 The R2Adj
As we saw, these are equivalent forms for R2: R2=SSRSST=SST−SSESST=1−SSESST=1−ˆσ2SST×(n−k−1). The SSE on the numerator always decreases as more predictors are added to the model, even if these are no significant. As a consequence, the R2 always increases with k. Why is so? Intuitively, because the complexity – hence the flexibility – of the model augments when we use more predictors to explain Y. Mathematically, because when k approaches n−1 the second term in (3.10) is reduced and, as a consequence, R2 grows.
The adjusted R2 is an important quantity specifically designed to cover this R2’s flaw, which is ubiquitous in multiple linear regression. The purpose is to have a better tool for comparing models without systematically favoring complexer models. This alternative coefficient is defined as R2Adj=1−SSE/(n−k−1)SST/(n−1)=1−SSESST×n−1n−k−1=1−ˆσ2SST×(n−1). The R2Adj is independent of k, at least explicitly. If k=1 then R2Adj is almost R2 (practically identical if n is large). Both (3.10) and (3.11) are quite similar except for the last factor, which in the former does not depend on k. Therefore, (3.11) will only increase if ˆσ2 is reduced with k – in other words, if the new variables contribute in the reduction of variability around the regression plane.
The different behavior between R2 and R2Adj can be visualized by a small simulation. Suppose that we generate a random dataset, with n=200 observations of a response Y and two predictors X1,X2. This is, the sample {(Xi1,Xi2,Yi)}ni=1 with Yi=β0+β1Xi1+β2Xi2+εi,εi∼N(0,1). To this data, we add 196 garbage predictors that are completely independent from Y. Therefore, we end up with k=198 predictors. Now we compute the R2(j) and R2Adj(j) for the models Y=β0+β1X1+⋯+βjXj+ε, with j=1,…,k and we plot them as the curves (j,R2(j)) and (j,R2Adj(j)). Since R2 and R2Adj are random variables, we repeat the procedure 100 times to have a measure of the variability.

Figure 3.13: Comparison of R2 and R2Adj for n=200 and k ranging from 1 to 198. M=100 datasets were simulated with only the first two predictors being significant. The thicker curves are the mean of each color’s curves.
Figure 3.13 contains the results of this experiment. As you can see R2 increases linearly with the number of predictors considered, although only the first two ones were important! On the contrary, R2Adj only increases in the first two variables and then is flat on average, but it has a huge variability when k approaches n−2. This is a consequence of the explosive variance of ˆσ2 in that degenerate case (as we will see in Section 3.7). The experiment evidences that R2Adj is more adequate than the R2 for evaluating the fit of a multiple linear regression.
An example of a simulated dataset considered in the experiment of Figure 3.13:
# Generate data
<- 198
k <- 200
n set.seed(3456732)
<- c(0.5, -0.5, rep(0, k - 2))
beta <- matrix(rnorm(n * k), nrow = n, ncol = k)
X <- drop(X %*% beta + rnorm(n, sd = 3))
Y <- data.frame(y = Y, x = X)
data
# Regression on the two meaningful predictors
summary(lm(y ~ x.1 + x.2, data = data))
# Adding 20 garbage variables
summary(lm(y ~ X[, 1:22], data = data))
The R2Adj no longer measures the proportion of variation of Y explained by the regression, but the result of correcting this proportion by the number of predictors employed. As a consequence of this, R2Adj≤1 but it can be negative!
The next code illustrates a situation where we have two predictors completely independent from the response. The fitted model has a negative R2Adj.
# Three independent variables
set.seed(234599)
<- rnorm(100)
x1 <- rnorm(100)
x2 <- 1 + rnorm(100)
y
# Negative adjusted R^2
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5081 -0.5021 -0.0191 0.5286 2.4750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.97024 0.10399 9.330 3.75e-15 ***
## x1 0.09003 0.10300 0.874 0.384
## x2 -0.05253 0.11090 -0.474 0.637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared: 0.009797, Adjusted R-squared: -0.01062
## F-statistic: 0.4799 on 2 and 97 DF, p-value: 0.6203
Construct more predictors (x3
, x4
, …) by sampling 100 points from a normal (rnorm(100)
). Check that when the predictors are added to the model, the R2Adj decreases and the R2 increases.
SSE and RSS are two names for the same quantity (that appears in different contexts): SSE=∑ni=1(Yi−ˆYi)2=∑ni=1(Yi−ˆβ0−ˆβ1Xi1−⋯−ˆβkXik)2=RSS(ˆβ).↩︎
More complex – included here just for clarification of the
anova
’s output.↩︎You will need to run this piece of code whenever you want to call
simpleAnova
, since it is not part of R nor R Commander.↩︎