## 2.7 ANOVA and model fit

### 2.7.1 ANOVA

As we have seen in Sections 2.5 and 2.6, the variance of the error, \(\sigma^2\), plays a fundamental role in the inference for the model coefficients and prediction. In this section we will see how the variance of \(Y\) is decomposed into two parts, each one corresponding to the regression and to the error, respectively. This decomposition is called the *ANalysis Of VAriance* (ANOVA).

*the mean of the fitted values \(\hat Y_1,\ldots,\hat Y_n\) is the mean of \(Y_1,\ldots, Y_n\)*. This is easily seen if we plug-in the expression of \(\hat\beta_0\): \[\begin{align*} \frac{1}{n}\sum_{i=1}^n \hat Y_i=\frac{1}{n}\sum_{i=1}^n \left(\hat \beta_0+\hat\beta_1X_i\right)=\hat \beta_0+\hat\beta_1\bar X=\left(\bar Y - \hat\beta_1\bar X \right) + \hat\beta_1\bar X=\bar Y. \end{align*}\]

The ANOVA decomposition considers the following measures of variation related with the response:

- \(\text{SST}=\sum_{i=1}^n\left(Y_i-\bar Y\right)^2\), the
**total sum of squares**. This is the*total variation*of \(Y_1,\ldots,Y_n\), since \(\text{SST}=ns_y^2\), where \(s_y^2\) is the sample variance of \(Y_1,\ldots,Y_n\). - \(\text{SSR}=\sum_{i=1}^n\left(\hat Y_i-\bar Y\right)^2\), the
**regression sum of squares**^{14}. This is the variation explained by the regression line, that is,*the variation from \(\bar Y\) that is explained by the estimated conditional mean \(\hat Y_i=\hat\beta_0+\hat\beta_1X_i\)*. \(\text{SSR}=ns_{\hat y}^2\), where \(s_{\hat y}^2\) is the sample variance of \(\hat Y_1,\ldots,\hat Y_n\). - \(\text{SSE}=\sum_{i=1}^n\left(Y_i-\hat Y_i\right)^2\), the
**sum of squared errors**^{15}. Is the variation around the conditional mean. Recall that \(\text{SSE}=\sum_{i=1}^n \hat\varepsilon_i^2=(n-2)\hat\sigma^2\), where \(\hat\sigma^2\) is the sample variance of \(\hat \varepsilon_1,\ldots,\hat \varepsilon_n\).

The graphical interpretation of (2.11) is shown in Figures 2.24 and 2.25.

The ANOVA table summarizes the decomposition of the variance. Here is given in the layout employed by `R`

.

Degrees of freedom | Sum Squares | Mean Squares | \(F\)-value | \(p\)-value | |
---|---|---|---|---|---|

Predictor | \(1\) | SSR | \(\frac{\text{SSR}}{1}\) | \(\frac{\text{SSR}/1}{\text{SSE}/(n-2)}\) | \(p\) |

Residuals | \(n - 2\) | SSE | \(\frac{\text{SSE}}{n-2}\) |

The `anova`

function in `R`

takes a model as an input and returns the ANOVA table. In `R Commander`

, the ANOVA table can be computed by going to `'Models' -> 'Hypothesis tests' -> 'ANOVA table...'`

. In the `'Type of Tests'`

option, select `'Sequential ("Type I")'`

. (This types `anova()`

for you…)

```
# Fit a linear model (alternatively, using R Commander)
model <- lm(MathMean ~ ReadingMean, data = pisa)
summary(model)
##
## Call:
## lm(formula = MathMean ~ ReadingMean, data = pisa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.039 -10.151 -2.187 7.804 50.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.65638 19.87265 -3.153 0.00248 **
## ReadingMean 1.13083 0.04172 27.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.72 on 63 degrees of freedom
## Multiple R-squared: 0.921, Adjusted R-squared: 0.9198
## F-statistic: 734.5 on 1 and 63 DF, p-value: < 2.2e-16
# ANOVA table
anova(model)
## Analysis of Variance Table
##
## Response: MathMean
## Df Sum Sq Mean Sq F value Pr(>F)
## ReadingMean 1 181525 181525 734.53 < 2.2e-16 ***
## Residuals 63 15569 247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

*specific for simple linear regression*– the \(F\)-test will not be equivalent to the \(t\)-test for \(\beta_1\) in Chapter 3). I It happens that \[\begin{align*} F=\frac{\text{SSR}/1}{\text{SSE}/(n-2)}\stackrel{H_0}{\sim} F_{1,n-2}, \end{align*}\]

where \(F_{1,n-2}\) is the *Snedecor’s \(F\) distribution*^{16} with \(1\) and \(n-2\) degrees of freedom. If \(H_0\) is true, then \(F\) is expected to be *small* since SSR will be close to zero. The \(p\)-value of this test is the same as the \(p\)-value of the \(t\)-test for \(H_0:\beta_1=0\).

Recall that the \(F\)-statistic, its \(p\)-value and the degrees of freedom are also given in the output of `summary`

.

For the `y6 ~ x6`

and `y7 ~ x7`

in the `assumptions`

dataset, compute their ANOVA tables. Check that the *p*-values of the *t*-test for *β*_{1} and the *F*-test are the same.

### 2.7.2 The \(R^2\)

The*coefficient of determination*\(R^2\) is closely related with the ANOVA decomposition. \(R^2\) is defined as \[\begin{align*} R^2=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SSR}}{\text{SSR}+\text{SSE}}=\frac{\text{SSR}}{\text{SSR}+(n-2)\hat\sigma^2}. \end{align*}\]

\(R^2\) measures the **proportion of variation** of the response variable \(Y\) that is **explained** by the predictor \(X\) through the regression. The proportion of total variation of \(Y\) that is not explained is \(1-R^2=\frac{\text{SSE}}{\text{SST}}\). Intuitively, \(R^2\) measures the **tightness of the data cloud around the regression line**, since is related directly with \(\hat\sigma^2\). Check in Figure 2.25 how changing the value of \(\sigma^2\) (not \(\hat\sigma^2\), but \(\hat\sigma^2\) is obviously dependent on \(\sigma^2\)) affects the \(R^2\).

*correlation coefficient*\[\begin{align*} r_{xy}=\frac{s_{xy}}{s_xs_y}=\frac{\sum_{i=1}^n \left(X_i-\bar X \right)\left(Y_i-\bar Y \right)}{\sqrt{\sum_{i=1}^n \left(X_i-\bar X \right)^2}\sqrt{\sum_{i=1}^n \left(Y_i-\bar Y \right)^2}} \end{align*}\] and it can be seen that \(R^2=r_{xy}^2\). Interestingly, it also holds that \(R^2=r^2_{y\hat y}\), that is,

*the square of the sample correlation coefficient between \(Y_1,\ldots,Y_n\) and \(\hat Y_1,\ldots,\hat Y_n\) is \(R^2\)*, a fact that is not immediately evident. This can be easily by first noting that \[\begin{align} \hat Y_i=\hat\beta_0+\hat\beta_1X_i=(\bar Y-\hat\beta_1X_i)+\hat\beta_1X_i=\bar Y+\hat\beta_1(X_i-\bar X) \tag{2.12} \end{align}\] and then replacing (2.12) into \[\begin{align*} r^2_{y\hat y}=\frac{s_{y\hat y}^2}{s_y^2s_{\hat y}^2}=\frac{\left(\sum_{i=1}^n \left(Y_i-\bar Y \right)\left(\hat Y_i-\bar Y \right)\right)^2}{\sum_{i=1}^n \left(Y_i-\bar Y \right)^2\sum_{i=1}^n \left(\hat Y_i-\bar Y \right)^2}=\frac{\left(\sum_{i=1}^n \left(Y_i-\bar Y \right)\left(\bar Y+\hat\beta_1(X_i-\bar X)-\bar Y \right)\right)^2}{\sum_{i=1}^n \left(Y_i-\bar Y \right)^2\sum_{i=1}^n \left(\bar Y+\hat\beta_1(X_i-\bar X)-\bar Y \right)^2}=r^2_{xy}. \end{align*}\]

The equality \(R^2=r^2_{y \hat y}\) is still true for the **multiple** linear regression, e.g. *Y* = *β*_{0} + *β*_{1}*X*_{1} + *β*_{2}*X*_{2} + *ε*. On the contrary, there is no coefficient of correlation between three or more variables, so *r*_{x1x2y} does not exist. Hence, *R*^{2} = *r*_{xy}^{2} is a specific fact for **simple** linear regression.

The result \(R^2=r^2_{xy}=r^2_{y\hat y}\) can be checked numerically and graphically with the next code.

```
# Responses generated following a linear model
set.seed(343567) # Fixes seed, allows to generate the same random data
x <- rnorm(50)
eps <- rnorm(50)
y <- -1 + 2 * x + eps
# Regression model
reg <- lm(y ~ x)
yHat <- reg$fitted.values
# Summary
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5834 -0.6015 -0.1466 0.6006 3.0419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9027 0.1503 -6.007 2.45e-07 ***
## x 2.1107 0.1443 14.623 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.059 on 48 degrees of freedom
## Multiple R-squared: 0.8167, Adjusted R-squared: 0.8129
## F-statistic: 213.8 on 1 and 48 DF, p-value: < 2.2e-16
# Square of the correlation coefficient
cor(y, x)^2
## [1] 0.8166752
cor(y, yHat)^2
## [1] 0.8166752
# Plots
scatterplot(y ~ x, smooth = FALSE)
scatterplot(y ~ yHat, smooth = FALSE)
```

We conclude this section by pointing out two common sins regarding the use of \(R^2\). First, recall two important concepts regarding the application of *any* regression model in practice, in particular the linear model:

**Correctness**. The linear model is built on certain assumptions, such as the ones we saw in Section 2.4.**All the inferential results are based on these assumptions being true!**^{17}. A model is formally*correct*whenever the assumptions on which is based are not violated in the data.**Usefulness**. The usefulness of the model is a more subjective concept, but is usually measured by the accuracy on the prediction and explanation of the response \(Y\) by the predictor \(X\). For example, \(Y=0X+\varepsilon\) is a valid linear model, but is completely useless for predicting \(Y\) from \(X\).

Figure 2.25 show a fitted regression line to a small dataset, for various levels of \(\sigma^2\). All the linear models are *correct* by construction, but the ones with a larger \(R^2\) are more *useful* for predicting/explaining \(Y\) from \(X\), since this is done in a more precise way.

*R*^{2} does not measure the correctness of a linear model but its **usefulness** (for prediction, for *explaining the variance* of *Y*), assuming the model is correct.

Trusting blindly the \(R^2\) can lead to catastrophic conclusions, since the model may not be correct. Here is a counterexample of a linear regression performed in a data that clearly does not satisfy the assumptions discussed in Section 2.4, but despite so it has a large \(R^2\). Recall how biased will be the predictions for \(x=0.35\) and \(x=0.65\)!

```
# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps
# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53525 -0.18020 0.02811 0.16882 0.46896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87190 0.05860 14.88 <2e-16 ***
## x -1.69268 0.09359 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7671
## F-statistic: 327.1 on 1 and 98 DF, p-value: < 2.2e-16
# But prediction is obviously problematic
scatterplot(y ~ x, smooth = FALSE)
```

A large *R*^{2} means *nothing* if the **assumptions of the model do not hold**. *R*^{2} is the proportion of variance of *Y* explained by *X*, but, of course, *only when the linear model is correct*.

Recall that SSR is different from RSS (Residual Sum of Squares, Section 2.3).↩

Recall that SSE and RSS (for \((\hat \beta_0,\hat \beta_1)\)) are just different names for referring to the same quantity: \(\text{SSE}=\sum_{i=1}^n\left(Y_i-\hat Y_i\right)^2=\sum_{i=1}^n\left(Y_i-\hat \beta_0-\hat \beta_1X_i\right)^2=\mathrm{RSS}\left(\hat \beta_0,\hat \beta_1\right)\).↩

The \(F_{n,m}\) distribution arises as the quotient of two independent random variables \(\chi^2_n\) and \(\chi^2_m\), \(\frac{\chi^2_n/n}{\chi^2_m/m}\).↩

If the assumptions are not satisfied (mismatch between what is assumed to happen in theory and what the data is), then the inference results may be misleading.↩