3.6 ANOVA and model fit
3.6.1 ANOVA
The ANOVA decomposition for multiple linear regression is quite analogous to the one in simple linear regression. The ANOVA decomposes the variance of into two parts, each one corresponding to the regression and to the error, respectively. Since the difference between simple and multiple linear regression is the number of predictors – the response is unique in both cases – the ANOVA decompositions are highly similar, as we will see.
As in simple linear regression, the mean of the fitted values is the mean of . This is an important result that can be checked if we use matrix notation. The ANOVA decomposition considers the following measures of variation related with the response:
- , the total sum of squares. This is the total variation of , since , where is the sample variance of .
- , the regression sum of squares. This is the variation explained by the regression plane, that is, the variation from that is explained by the estimated conditional mean . , where is the sample variance of .
- , the sum of squared errors25. Is the variation around the conditional mean. Recall that , where is the sample variance of .
The ANOVA decomposition is exactly the same as in simple linear regression: or, equivalently (dividing by in (3.9)), Notice the instead of simple linear regression’s , which is the main change. The graphical interpretation of (3.9) when is shown in Figures 3.11 and 3.12.

Figure 3.11: Visualization of the ANOVA decomposition when . SST measures the variation of with respect to . SST measures the variation with respect to the conditional means, . SSE collects the variation of the residuals.
Figure 3.12: Illustration of the ANOVA decomposition and its dependence on and . Application also available here.
The ANOVA table summarizes the decomposition of the variance.
Degrees of freedom | Sum Squares | Mean Squares | -value | -value | |
---|---|---|---|---|---|
Predictors | SSR | ||||
Residuals | SSE |
The “-value” of the ANOVA table represents the value of the -statistic . This statistic is employed to test that is, the hypothesis of no linear dependence of on (the plane is completely flat, with no inclination). If is rejected, it means that at least one is significantly different from zero. It happens that where is the Snedecor’s distribution with and degrees of freedom. If is true, then is expected to be small since SSR will be close to zero (little variation is explained by the regression model since ). The -value of this test is not the same as the -value of the -test for , that only happens in simple linear regression because !
The “ANOVA table” is a broad concept in statistics, with different variants. Here we are only covering the basic ANOVA table from the relation . However, further sophistications are possible when is decomposed into the variations contributed by each predictor. In particular, for multiple linear regression R’s anova
implements a sequential (type I) ANOVA table, which is not the previous table!
The anova
function in R takes a model as an input and returns the following sequential ANOVA table26:
Degrees of freedom | Sum Squares | Mean Squares | -value | -value | |
---|---|---|---|---|---|
Predictor 1 | SSR | ||||
Predictor 2 | SSR | ||||
Predictor | SSR | ||||
Residuals | SSE |
Here the SSR represents the regression sum of squares associated to the inclusion of in the model with predictors , this is: The -values correspond to the testing of the hypotheses carried out inside the linear model . This is like the -test for for the model with predictors .
Let’s see how we can compute both ANOVA tables in R. The sequential table is simple: use anova
. We illustrate it with the Boston
dataset.
# Load data
library(MASS)
data(Boston)
# Fit a linear model
<- lm(medv ~ crim + lstat + zn + nox, data = Boston)
model summary(model)
##
## Call:
## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.972 -3.956 -1.344 2.148 25.076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.93462 1.72076 17.977 <2e-16 ***
## crim -0.08297 0.03677 -2.257 0.0245 *
## lstat -0.90940 0.05040 -18.044 <2e-16 ***
## zn 0.03493 0.01395 2.504 0.0126 *
## nox 5.42234 3.24241 1.672 0.0951 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.169 on 501 degrees of freedom
## Multiple R-squared: 0.5537, Adjusted R-squared: 0.5502
## F-statistic: 155.4 on 4 and 501 DF, p-value: < 2.2e-16
# ANOVA table with sequential test
anova(model)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## crim 1 6440.8 6440.8 169.2694 < 2e-16 ***
## lstat 1 16950.1 16950.1 445.4628 < 2e-16 ***
## zn 1 155.7 155.7 4.0929 0.04360 *
## nox 1 106.4 106.4 2.7967 0.09509 .
## Residuals 501 19063.3 38.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The last p-value is the one of the last t-test
In order to compute the simplified ANOVA table, we need to rely on an ad-hoc function27. The function takes as input a fitted lm
:
# This function computes the simplified anova from a linear model
<- function(object, ...) {
simpleAnova
# Compute anova table
<- anova(object, ...)
tab
# Obtain number of predictors
<- nrow(tab) - 1
p
# Add predictors row
<- colSums(tab[1:p, 1:2])
predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])
predictorsRow
# F-quantities
<- predictorsRow[3] / tab[p + 1, 3]
Fval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
pval <- c(predictorsRow, Fval, pval)
predictorsRow
# Simplified table
<- rbind(predictorsRow, tab[p + 1, ])
tab row.names(tab)[1] <- "Predictors"
return(tab)
}
# Simplified ANOVA
simpleAnova(model)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 4 23653 5913.3 155.41 < 2.2e-16 ***
## Residuals 501 19063 38.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that the -statistic, its -value and the degrees of freedom are also given in the output of summary
.
Compute the ANOVA table for the regression Price ~ WinterRain + AGST + HarvestRain + Age
in the wine
dataset. Check that the -value for the -test given in summary
and by simpleAnova
are the same.
3.6.2 The
The coefficient of determination is defined as in simple linear regression: measures the proportion of variation of the response variable that is explained by the predictors through the regression. Intuitively, measures the tightness of the data cloud around the regression plane. Check in Figure 3.12 how changing the value of (not , but is obviously dependent on ) affects the . Also, as we saw in Section 2.7, , that is, the square of the sample correlation coefficient between and is .
Trusting blindly the can lead to catastrophic conclusions in model selection. Here is a counterexample of a multiple regression where the is apparently large but the assumptions discussed in Section 3.3 are clearly not satisfied.
# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
<- seq(0.15, 1, l = 100)
x1 set.seed(123456)
<- runif(100, -3, 3)
x2 <- rnorm(n = 100, sd = 0.25 * x1^2)
eps <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps
y
# Great R^2!?
<- lm(y ~ x1 + x2)
reg summary(reg)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78737 -0.20946 0.01031 0.19652 1.05351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.788812 0.096418 8.181 1.1e-12 ***
## x1 -2.540073 0.154876 -16.401 < 2e-16 ***
## x2 0.002283 0.020954 0.109 0.913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3754 on 97 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388
## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16
# But prediction is obviously problematic
scatter3d(y ~ x1 + x2, fit = "linear")
Remember that:
- does not measure the correctness of a linear model but its usefulness, assuming the model is correct.
- is the proportion of variance of explained by , but, of course, only when the linear model is correct.
We finalize by pointing out a nice connection between the , the ANOVA decomposition and the least squares estimator :
The ANOVA decomposition gives another interpretation of the least-squares estimates: are the estimated coefficients that maximize the (among all the possible estimates we could think about). To see this, recall that so if , then is maximal for !
3.6.3 The
As we saw, these are equivalent forms for : The SSE on the numerator always decreases as more predictors are added to the model, even if these are no significant. As a consequence, the always increases with . Why is so? Intuitively, because the complexity – hence the flexibility – of the model augments when we use more predictors to explain . Mathematically, because when approaches the second term in (3.10) is reduced and, as a consequence, grows.
The adjusted is an important quantity specifically designed to cover this ’s flaw, which is ubiquitous in multiple linear regression. The purpose is to have a better tool for comparing models without systematically favoring complexer models. This alternative coefficient is defined as The is independent of , at least explicitly. If then is almost (practically identical if is large). Both (3.10) and (3.11) are quite similar except for the last factor, which in the former does not depend on . Therefore, (3.11) will only increase if is reduced with – in other words, if the new variables contribute in the reduction of variability around the regression plane.
The different behavior between and can be visualized by a small simulation. Suppose that we generate a random dataset, with observations of a response and two predictors . This is, the sample with To this data, we add garbage predictors that are completely independent from . Therefore, we end up with predictors. Now we compute the and for the models with and we plot them as the curves and . Since and are random variables, we repeat the procedure times to have a measure of the variability.

Figure 3.13: Comparison of and for and ranging from to . datasets were simulated with only the first two predictors being significant. The thicker curves are the mean of each color’s curves.
Figure 3.13 contains the results of this experiment. As you can see increases linearly with the number of predictors considered, although only the first two ones were important! On the contrary, only increases in the first two variables and then is flat on average, but it has a huge variability when approaches . This is a consequence of the explosive variance of in that degenerate case (as we will see in Section 3.7). The experiment evidences that is more adequate than the for evaluating the fit of a multiple linear regression.
An example of a simulated dataset considered in the experiment of Figure 3.13:
# Generate data
<- 198
k <- 200
n set.seed(3456732)
<- c(0.5, -0.5, rep(0, k - 2))
beta <- matrix(rnorm(n * k), nrow = n, ncol = k)
X <- drop(X %*% beta + rnorm(n, sd = 3))
Y <- data.frame(y = Y, x = X)
data
# Regression on the two meaningful predictors
summary(lm(y ~ x.1 + x.2, data = data))
# Adding 20 garbage variables
summary(lm(y ~ X[, 1:22], data = data))
The no longer measures the proportion of variation of explained by the regression, but the result of correcting this proportion by the number of predictors employed. As a consequence of this, but it can be negative!
The next code illustrates a situation where we have two predictors completely independent from the response. The fitted model has a negative .
# Three independent variables
set.seed(234599)
<- rnorm(100)
x1 <- rnorm(100)
x2 <- 1 + rnorm(100)
y
# Negative adjusted R^2
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5081 -0.5021 -0.0191 0.5286 2.4750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.97024 0.10399 9.330 3.75e-15 ***
## x1 0.09003 0.10300 0.874 0.384
## x2 -0.05253 0.11090 -0.474 0.637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared: 0.009797, Adjusted R-squared: -0.01062
## F-statistic: 0.4799 on 2 and 97 DF, p-value: 0.6203
Construct more predictors (x3
, x4
, …) by sampling 100 points from a normal (rnorm(100)
). Check that when the predictors are added to the model, the decreases and the increases.