3.6 ANOVA and model fit

3.6.1 ANOVA

The ANOVA decomposition for multiple linear regression is quite analogous to the one in simple linear regression. The ANOVA decomposes the variance of YY into two parts, each one corresponding to the regression and to the error, respectively. Since the difference between simple and multiple linear regression is the number of predictors – the response YY is unique in both cases – the ANOVA decompositions are highly similar, as we will see.

As in simple linear regression, the mean of the fitted values ˆY1,,ˆYn^Y1,,^Yn is the mean of Y1,,YnY1,,Yn. This is an important result that can be checked if we use matrix notation. The ANOVA decomposition considers the following measures of variation related with the response:

  • SST=ni=1(YiˉY)2SST=ni=1(Yi¯Y)2, the total sum of squares. This is the total variation of Y1,,YnY1,,Yn, since SST=ns2ySST=ns2y, where s2ys2y is the sample variance of Y1,,YnY1,,Yn.
  • SSR=ni=1(ˆYiˉY)2SSR=ni=1(^Yi¯Y)2, the regression sum of squares. This is the variation explained by the regression plane, that is, the variation from ˉY¯Y that is explained by the estimated conditional mean ˆYi=ˆβ0+ˆβ1Xi1++ˆβkXik^Yi=^β0+^β1Xi1++^βkXik. SSR=ns2ˆySSR=ns2^y, where s2ˆys2^y is the sample variance of ˆY1,,ˆYn^Y1,,^Yn.
  • SSE=ni=1(YiˆYi)2SSE=ni=1(Yi^Yi)2, the sum of squared errors25. Is the variation around the conditional mean. Recall that SSE=ni=1ˆε2i=(nk1)ˆσ2SSE=ni=1^ε2i=(nk1)^σ2, where ˆσ2^σ2 is the sample variance of ˆε1,,ˆεn^ε1,,^εn.

The ANOVA decomposition is exactly the same as in simple linear regression: SSTVariation of Yis=SSRVariation of ˆYis+SSEVariation of ˆεis or, equivalently (dividing by n in (3.9)), s2yVariance of Yis=s2ˆyVariance of ˆYis+(nk1)/n׈σ2Variance of ˆεis. Notice the nk1 instead of simple linear regression’s n2, which is the main change. The graphical interpretation of (3.9) when k=2 is shown in Figures 3.11 and 3.12.

Visualization of the ANOVA decomposition when $k=2$. SST measures the variation of $Y_1,\ldots,Y_n$ with respect to $\bar Y$. SST measures the variation with respect to the conditional means, $\hat\beta_0+\hat\beta_1X_{i1}+\hat\beta_2X_{i2}$. SSE collects the variation of the residuals.

Figure 3.11: Visualization of the ANOVA decomposition when k=2. SST measures the variation of Y1,,Yn with respect to ˉY. SST measures the variation with respect to the conditional means, ˆβ0+ˆβ1Xi1+ˆβ2Xi2. SSE collects the variation of the residuals.

Figure 3.12: Illustration of the ANOVA decomposition and its dependence on σ2 and ˆσ2. Application also available here.

The ANOVA table summarizes the decomposition of the variance.

Degrees of freedom Sum Squares Mean Squares F-value p-value
Predictors k SSR SSRk SSR/kSSE/(nk1) p
Residuals nk1 SSE SSEnk1

The “F-value” of the ANOVA table represents the value of the F-statistic SSR/kSSE/(nk1). This statistic is employed to test H0:β1==βk=0vs.H1:βj0 for any j, that is, the hypothesis of no linear dependence of Y on X1,,Xk (the plane is completely flat, with no inclination). If H0 is rejected, it means that at least one βj is significantly different from zero. It happens that F=SSR/kSSE/(nk1)H0Fk,nk1, where Fk,nk1 is the Snedecor’s F distribution with k and nk1 degrees of freedom. If H0 is true, then F is expected to be small since SSR will be close to zero (little variation is explained by the regression model since ˆβ0). The p-value of this test is not the same as the p-value of the t-test for H0:β1=0, that only happens in simple linear regression because k=1!

The “ANOVA table” is a broad concept in statistics, with different variants. Here we are only covering the basic ANOVA table from the relation SST=SSR+SSE. However, further sophistications are possible when SSR is decomposed into the variations contributed by each predictor. In particular, for multiple linear regression R’s anova implements a sequential (type I) ANOVA table, which is not the previous table!

The anova function in R takes a model as an input and returns the following sequential ANOVA table26:

Degrees of freedom Sum Squares Mean Squares F-value p-value
Predictor 1 1 SSR1 SSR11 SSR1/1SSE/(nk1) p1
Predictor 2 1 SSR2 SSR21 SSR2/1SSE/(nk1) p2
Predictor k 1 SSRk SSRk1 SSRk/1SSE/(nk1) pk
Residuals nk1 SSE SSEnk1

Here the SSRj represents the regression sum of squares associated to the inclusion of Xj in the model with predictors X1,,Xj1, this is: SSRj=SSR(X1,,Xj)SSR(X1,,Xj1). The p-values p1,,pk correspond to the testing of the hypotheses H0:βj=0vs.H1:βj0, carried out inside the linear model Y=β0+β1X1++βjXj+ε. This is like the t-test for βj for the model with predictors X1,,Xj.

Let’s see how we can compute both ANOVA tables in R. The sequential table is simple: use anova. We illustrate it with the Boston dataset.

# Load data
library(MASS)
data(Boston)

# Fit a linear model
model <- lm(medv ~ crim + lstat + zn + nox, data = Boston)
summary(model)
## 
## Call:
## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.972  -3.956  -1.344   2.148  25.076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.93462    1.72076  17.977   <2e-16 ***
## crim        -0.08297    0.03677  -2.257   0.0245 *  
## lstat       -0.90940    0.05040 -18.044   <2e-16 ***
## zn           0.03493    0.01395   2.504   0.0126 *  
## nox          5.42234    3.24241   1.672   0.0951 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.169 on 501 degrees of freedom
## Multiple R-squared:  0.5537, Adjusted R-squared:  0.5502 
## F-statistic: 155.4 on 4 and 501 DF,  p-value: < 2.2e-16

# ANOVA table with sequential test
anova(model)
## Analysis of Variance Table
## 
## Response: medv
##            Df  Sum Sq Mean Sq  F value  Pr(>F)    
## crim        1  6440.8  6440.8 169.2694 < 2e-16 ***
## lstat       1 16950.1 16950.1 445.4628 < 2e-16 ***
## zn          1   155.7   155.7   4.0929 0.04360 *  
## nox         1   106.4   106.4   2.7967 0.09509 .  
## Residuals 501 19063.3    38.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The last p-value is the one of the last t-test

In order to compute the simplified ANOVA table, we need to rely on an ad-hoc function27. The function takes as input a fitted lm:

# This function computes the simplified anova from a linear model
simpleAnova <- function(object, ...) {

  # Compute anova table
  tab <- anova(object, ...)

  # Obtain number of predictors
  p <- nrow(tab) - 1

  # Add predictors row
  predictorsRow <- colSums(tab[1:p, 1:2])
  predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])

  # F-quantities
  Fval <- predictorsRow[3] / tab[p + 1, 3]
  pval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
  predictorsRow <- c(predictorsRow, Fval, pval)

  # Simplified table
  tab <- rbind(predictorsRow, tab[p + 1, ])
  row.names(tab)[1] <- "Predictors"
  return(tab)

}

# Simplified ANOVA
simpleAnova(model)
## Analysis of Variance Table
## 
## Response: medv
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## Predictors   4  23653  5913.3  155.41 < 2.2e-16 ***
## Residuals  501  19063    38.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall that the F-statistic, its p-value and the degrees of freedom are also given in the output of summary.

Compute the ANOVA table for the regression Price ~ WinterRain + AGST + HarvestRain + Age in the wine dataset. Check that the p-value for the F-test given in summary and by simpleAnova are the same.

3.6.2 The R2

The coefficient of determination R2 is defined as in simple linear regression: R2=SSRSST=SSRSSR+SSE=SSRSSR+(nk1)ˆσ2. R2 measures the proportion of variation of the response variable Y that is explained by the predictors X1,,Xk through the regression. Intuitively, R2 measures the tightness of the data cloud around the regression plane. Check in Figure 3.12 how changing the value of σ2 (not ˆσ2, but ˆσ2 is obviously dependent on σ2) affects the R2. Also, as we saw in Section 2.7, R2=r2yˆy, that is, the square of the sample correlation coefficient between Y1,,Yn and ˆY1,,ˆYn is R2.

Trusting blindly the R2 can lead to catastrophic conclusions in model selection. Here is a counterexample of a multiple regression where the R2 is apparently large but the assumptions discussed in Section 3.3 are clearly not satisfied.

# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
x1 <- seq(0.15, 1, l = 100)
set.seed(123456)
x2 <- runif(100, -3, 3)
eps <- rnorm(n = 100, sd = 0.25 * x1^2)
y <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps

# Great R^2!?
reg <- lm(y ~ x1 + x2)
summary(reg)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78737 -0.20946  0.01031  0.19652  1.05351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.788812   0.096418   8.181  1.1e-12 ***
## x1          -2.540073   0.154876 -16.401  < 2e-16 ***
## x2           0.002283   0.020954   0.109    0.913    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3754 on 97 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.7388 
## F-statistic:   141 on 2 and 97 DF,  p-value: < 2.2e-16
# But prediction is obviously problematic
scatter3d(y ~ x1 + x2, fit = "linear")

Remember that:

  • R2 does not measure the correctness of a linear model but its usefulness, assuming the model is correct.
  • R2 is the proportion of variance of Y explained by X1,,Xk, but, of course, only when the linear model is correct.

We finalize by pointing out a nice connection between the R2, the ANOVA decomposition and the least squares estimator ˆβ:

The ANOVA decomposition gives another interpretation of the least-squares estimates: ˆβ are the estimated coefficients that maximize the R2 (among all the possible estimates we could think about). To see this, recall that R2=SSRSST=SSTSSESST=SSTRSS(ˆβ)SST, so if RSS(ˆβ)=minβRk+1RSS(β), then R2 is maximal for ˆβ!

3.6.3 The R2Adj

As we saw, these are equivalent forms for R2: R2=SSRSST=SSTSSESST=1SSESST=1ˆσ2SST×(nk1). The SSE on the numerator always decreases as more predictors are added to the model, even if these are no significant. As a consequence, the R2 always increases with k. Why is so? Intuitively, because the complexity – hence the flexibility – of the model augments when we use more predictors to explain Y. Mathematically, because when k approaches n1 the second term in (3.10) is reduced and, as a consequence, R2 grows.

The adjusted R2 is an important quantity specifically designed to cover this R2’s flaw, which is ubiquitous in multiple linear regression. The purpose is to have a better tool for comparing models without systematically favoring complexer models. This alternative coefficient is defined as R2Adj=1SSE/(nk1)SST/(n1)=1SSESST×n1nk1=1ˆσ2SST×(n1). The R2Adj is independent of k, at least explicitly. If k=1 then R2Adj is almost R2 (practically identical if n is large). Both (3.10) and (3.11) are quite similar except for the last factor, which in the former does not depend on k. Therefore, (3.11) will only increase if ˆσ2 is reduced with k – in other words, if the new variables contribute in the reduction of variability around the regression plane.

The different behavior between R2 and R2Adj can be visualized by a small simulation. Suppose that we generate a random dataset, with n=200 observations of a response Y and two predictors X1,X2. This is, the sample {(Xi1,Xi2,Yi)}ni=1 with Yi=β0+β1Xi1+β2Xi2+εi,εiN(0,1). To this data, we add 196 garbage predictors that are completely independent from Y. Therefore, we end up with k=198 predictors. Now we compute the R2(j) and R2Adj(j) for the models Y=β0+β1X1++βjXj+ε, with j=1,,k and we plot them as the curves (j,R2(j)) and (j,R2Adj(j)). Since R2 and R2Adj are random variables, we repeat the procedure 100 times to have a measure of the variability.

Comparison of \(R^2\) and \(R^2_{\text{Adj}}\) for \(n=200\) and \(k\) ranging from \(1\) to \(198\). \(M=100\) datasets were simulated with only the first two predictors being significant. The thicker curves are the mean of each color’s curves.

Figure 3.13: Comparison of R2 and R2Adj for n=200 and k ranging from 1 to 198. M=100 datasets were simulated with only the first two predictors being significant. The thicker curves are the mean of each color’s curves.

Figure 3.13 contains the results of this experiment. As you can see R2 increases linearly with the number of predictors considered, although only the first two ones were important! On the contrary, R2Adj only increases in the first two variables and then is flat on average, but it has a huge variability when k approaches n2. This is a consequence of the explosive variance of ˆσ2 in that degenerate case (as we will see in Section 3.7). The experiment evidences that R2Adj is more adequate than the R2 for evaluating the fit of a multiple linear regression.

An example of a simulated dataset considered in the experiment of Figure 3.13:

# Generate data
k <- 198
n <- 200
set.seed(3456732)
beta <- c(0.5, -0.5, rep(0, k - 2))
X <- matrix(rnorm(n * k), nrow = n, ncol = k)
Y <- drop(X %*% beta + rnorm(n, sd = 3))
data <- data.frame(y = Y, x = X)

# Regression on the two meaningful predictors
summary(lm(y ~ x.1 + x.2, data = data))

# Adding 20 garbage variables
summary(lm(y ~ X[, 1:22], data = data))

The R2Adj no longer measures the proportion of variation of Y explained by the regression, but the result of correcting this proportion by the number of predictors employed. As a consequence of this, R2Adj1 but it can be negative!

The next code illustrates a situation where we have two predictors completely independent from the response. The fitted model has a negative R2Adj.

# Three independent variables
set.seed(234599)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 1 + rnorm(100)

# Negative adjusted R^2
summary(lm(y ~ x1 + x2))
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5081 -0.5021 -0.0191  0.5286  2.4750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.97024    0.10399   9.330 3.75e-15 ***
## x1           0.09003    0.10300   0.874    0.384    
## x2          -0.05253    0.11090  -0.474    0.637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared:  0.009797,   Adjusted R-squared:  -0.01062 
## F-statistic: 0.4799 on 2 and 97 DF,  p-value: 0.6203

Construct more predictors (x3, x4, …) by sampling 100 points from a normal (rnorm(100)). Check that when the predictors are added to the model, the R2Adj decreases and the R2 increases.


  1. SSE and RSS are two names for the same quantity (that appears in different contexts): SSE=ni=1(YiˆYi)2=ni=1(Yiˆβ0ˆβ1Xi1ˆβkXik)2=RSS(ˆβ).↩︎

  2. More complex – included here just for clarification of the anova’s output.↩︎

  3. You will need to run this piece of code whenever you want to call simpleAnova, since it is not part of R nor R Commander.↩︎