Chapter 22 ANOVA Tables and F Tests

22.1 Introduction

In this Section we will focus on the sources of variation in a linear model and how these can be used to determine the most appropriate linear model. We start in Section 22.2 by considering the residuals of the linear model and properties of the residuals. In Section 22.3, we introduce the total sum-of-squares SStot which is a measure of the total amount of variation in the model. This is comprised of two components: the regression sum-of-squares, SSreg, which measures the variability in the observations that is captured by the model and the residual sum-of-squares, SSres, which measures the unexplained variability in the observations. In Section 22.4, we introduce ANOVA tables for summarising variability in the model and testing null hypotheses. In particular, we consider the Fuel Consumption example, introduced in Section 21.2, and show how the conclusions obtained in Section 21.4 can be presented in the form of an ANOVA table. In Section 22.5, we consider two linear models for a given data set, where one model is nested within the other model. This allows us to compare two models which lie between the full model (includes all variables) and the null model (excludes all variables). Finally, in Section 22.6] we extend the comparison of nested models to sequential sum-of-squares to find the most appropriate model out of a range of nested linear models.

22.2 The residuals

Consider the linear model
y=Zβ+ϵ.

Recall the following about the model:

  • E[ϵ]=0;
  • Var(ϵ)=σ2In;
  • β^=(ZTZ)1ZTy is the LSE of β;
  • y^=Zβ^ is the n×1 vector of fitted values;
  • y^=Py, where P=Z(ZTZ)1ZT.
Let r=ϵ^=yy^ be the n×1 vector of residuals. Note that
r=ϵ^=yy^=yZβ^=yPy=(InP)y,

where InP is symmetric, idempotent and has trace (InP)=rank(InP)y=np. Note that rank(InP)y=np denotes the degrees of freedom of the residuals and is equal to the number of observations, n, minus the number of coefficients (parameters), p.

The vector of fitted values is orthogonal to the vector of residuals, that is y^Tϵ^=ϵ^Ty^=0.

Details of the proof can be omitted.

Proof of Theorem 1.
y^Tϵ^=(Py)T(InP)y=yTPT(InP)y=yTPTyyTPTPy=yTPyyTPPy=yTPyyTPy=0,

using that P is orthogonal (PT=P) and idempotent (P2=P).

Therefore, y can be written as a linear combination of orthogonal vectors:
y=y^+ϵ^.

The normal linear model assumes ϵN(0,σ2In). We would expect the sample residuals, ϵ^ to exhibit many of the properties of the error terms. The properties of ϵ^ can be explored via graphical methods as in Section 16.6 and Lab 9: Linear Models.

22.3 Sums of squares

Let yi be the ith observation, y^i be the ith fitted value and y¯ be the mean of the observed values.

Model Deviance
The model deviance is given by
D=i=1n(yiy^i)2+i=1n(y^iy¯)2.
We have
(yiy¯)=(yiy^i)+(y^iy¯),(yiy¯)2=[(yiy^i)+(y^iy¯)]2=(yiy^i)2+(y^iy¯)2+2(yiy^i)(y^iy¯),i=1n(yiy¯)2=i=1n(yiy^i)2+i=1n(y^iy¯)2+2i=1n(yiy^i)(y^iy¯).
Now,
i=1n(yiy^i)(y^iy¯)=i=1n(yiy^i)y^ii=1n(yiy^i)y¯=i=1nϵ^iy^iy¯i=1nϵ^i=ϵ^Ty^y¯i=1nϵ^i=00=0,
using that ϵ^ and y^ are orthogonal, and that i=1nϵ^i=0 is one of the normal equations. Therefore,
i=1n(yiy¯)2=i=1n(yiy^i)2+i=1n(y^iy¯)2.


Total sum of squares   Define SStot=i=1n(yiy¯)2 as the total sum of squares. This is proportional to the total variability in y since SStot=(n1)Var(y). It does not depend on the choice of predictor variables in Z.


Residual sum of squares   Define SSres=i=1n(yiy^i)2 as the residual sum of squares. This is a measure of the amount of variability in y the model was unable to explain. This is equivalent to the deviance of the model, that is SSres=D.


Regression sum of squares   Define SSreg=i=1n(y^iy¯)2 as the regression sum of squares. This is the difference between SStot and SSres and is a measure of the amount of variability in y the model was able to explain.

From our above derivations, note
i=1n(yiy¯)2=i=1n(yiy^i)2+i=1n(y^iy¯)2SStot=SSres+SSreg
Coefficient of determination   The coefficient of determination is
R2=SSregSStot.

The coefficient of determination measures the proportion of variability explained by the regression. Additionally note that:

  • 0R21;
  • R2=1SSresSStot;
  • R2 is often used as a measure of how well the regression model fits the data: the larger R2 is, the better the fit. One needs to be careful in interpreting what “large” is on a case-by-case basis.
Adjusted R2   The adjusted R2 is
Radj2=1SSres/(np)SStot/(n1).

The adjusted R2 is often used to compare the fit of models with different numbers of parameters.

Under the null hypothesis model, Yi=β0+ϵi and y¯=y^i. In this special case,
SStot=SSres=D,SSreg=0,R2=Radj2=0.

22.4 Analysis of Variance (ANOVA)

Recall from Section 21.4 that the F statistic used in the test for the existence of regression is F=(D0D1)/(p1)D1/(np), where D1 and D0 are the model deviances or SSres under the alternative and null hypotheses respectively. We noted above that D0, the deviance under the null hypothesis is equivalent to SStot under any model.

Mean square regression   The mean square regression is the numerator in the F statistic,
MSreg=D0D1p1=SStotSSresp1=SSregp1.
Mean square residual   The mean square residual is the denominator in the F statistic,
MSres=D1np=SSresnp.

Note the mean square residual is an unbiased estimator of σ2. Similarly the residual standard error, RSE=MSres is an unbiased estimate of σ.

The quantities involved in the calculation of the F statistic are usually displayed in an ANOVA table:

Source Degrees of Freedom Sum of Squares Mean Square F Statistic
Regression p1 SSreg MSreg=SSregp1 F=MSregMSres
Residual np SSres MSres=SSresnp
Total n1 SStot

Fuel consumption (continued)   For the data in Section 21.2, Example 1 (Fuel Consumption) the model fuel=β0+β1dlic+β2tax+β3inc+β4road was fitted to the n=51 observations with residual standard error, RSE=64.8912. Summary statistics show Var(fuel)=7913.88. Complete an ANOVA table and compute R2 for the fitted model.

We have

  • Note p1=4, np=46 and n1=50;
  • SStot=(n1)Var(fuel)=50×7913.88=395694;
  • MSres=RSE2=64.89122=4210.87;
  • SSres=(np)MSres=46×4210.87=193700;
  • SSreg=SStotSSres=395694193700=201994;
  • MSreg=SSreg/(p1)=201994/4=50498.50;
  • F=MSreg/MSres=50498.5/4210.87=11.99.

Hence the completed ANOVA table is

Source Degrees of Freedom Sum of Squares Mean Square F statistic
Regression 4 201994 50498.50 11.99
Residual 46 193700 4210.87
Total 50 395694

We compare the computed F-statistic, 11.99, with a Fp1,np=F4,46 distribution to obtain a p-value of 9.331×107=P(F4,46>11.99). That is, if the null hypothesis (no regression parameters β1==β4=0) were true, there is probability 9.331×107 (just under one in a million) of observing an F-statistic larger than 11.99.

Finally, R2=SSregSStot=201994395694=0.5105.

22.5 Comparing models

Consider two models, M1 and M2, where M2 is a simplification of M1. For example,
M1:Y=β0+β1X1+β2X2+β3X3+β4X4+ϵ,M2:Y=β0+β2X2+β4X4+ϵ.

The residual sum of squares from model M1 will always be less then M2, but we can test the hypotheses: H0:β1=β3=0 vs. H1:β10 or β30 at significance level 100α% to test if removing these terms significantly increases the residual sum of squares.

Let D1=i=1n(yiy^i)2 be the model deviance, or SSres, for model M1.

Let D2=i=1n(yiy^i)2 be the model deviance, or SSres, for model M2.

The decision rule is to reject H0 if F=(D2D1)/qD1/(np)>Fq,np,α, where n is the number of observations, p is the number of parameters in M1 and q is the number of parameters that are fixed to reduce M1 to M2.

For the example above, p=5 and q=2.

Watch Video 32 for a work through of comparing models using the F-distribution. The video uses Example 2 (Fuel consumption - continued) given below.

Video 32: Fuel consumption example (continued).

Fuel consumption (continued)   Let Model 1 be fuel=β0+β1dlic+β2tax+β3inc+β4road, and let Model 2 be fuel=β0+β1dlic+β3inc. The residual sum of squares is 193700 for Model 1 and 249264 for Model 2. Test which model fits the data better.

The question is equivalent to testing the hypotheses:
H0:β2=β4=0 vs. H1:β20 or β40,

at α=0.05. The decision rule is to reject H0 if

F=(D2D1)/qD1/(np)>Fq,np,α=F2,46,0.05=3.20.

Substituting in the data gives,

F=(249264193700)/2193700/(515)=18.47.

Consequently, we will reject H0. Model 1 fits the data better at α=0.05.

We note that the p-value is 1.294×106=P(F2,46>18.47), which gives overwhelming support in favour of Model 1.


Let’s consider the more general case where the basic model M1 is
Y=β0+β1X1+β2X2++βp1Xp1+ϵ.
Denote
SSreg(M1)=i=1n(y^iy¯)2=R(β1,β2,,βp1|β0),

assuming there is a constant in the model.

Our goal is to build a regression model which best describes the response variable. Hence we would like to explain as much of the variance in Y as possible, yet keep the model as simple as possible. This is known as the Principle of Parsimony . Consequently we want to determine which explanatory variables are worthwhile to include in the final model.

The idea is that explanatory variables should be included in the model if the extra portion of the regression sum of squares, called the extra sum of squares, which arises from their inclusion in the model is relatively large compared to the unexplained variance in the model, residual sum of squares.

Consider a second model M2 which is a simplification of M1, specifically
Y=β0+β1X1+β2X2++βk1Xk1+ϵ,
where k<p. Then
SSreg(M2)=R(β1,β2,,βk1|β0).
Extra sum of squares   The extra sum of squares due to the inclusion of the terms βkXk++βp1Xp1 in the model is
SSreg(M1)SSreg(M2).
It is denoted
R(βk,,βp1|β0,β1,,βk1)=R(β1,β2,,βp1|β0)R(β1,β2,,βk1|β0).

The extra sum of squares has q=pk degrees of freedom where q is the number of explanatory variables added to the reduced model to make the full model.

We can test the hypotheses

  • H0: The reduced model, M2, best describes the data;
  • H1: The full model, M1, best describes the data.
The decision rule is to reject H0 if
F=R(βk,,βp1|β0,,βk1)/qSSres(M1)/(np)>Fq,np,α.

Rejecting H0 implies the full model describes the data better, so we should include the variables Xk,,Xp1 in our model.

The test for the existence of regression is a special case of this type of test, where
H0:β1=β2==βp=0,

that is, the reduced model is Y=β0+ϵ. Note that SSreg(M1)=R(β1,β2,,βp1|β0) is the extra sum of squares in this case.

22.6 Sequential sum of squares

Sequential sum of squares   The sequential sum of squares for each j, denoted SSseqi, is
R(βj|β0,β1,,βj1)=R(β1,β2,,βj|β0)R(β1,β2,,βj1|β0)

and is the extra sum of squares that one incurs by adding the explanatory variable Xj to the model given that X1,,Xj1 are already present.

The sequential sum of squares is often given in addition to the basic ANOVA table.

Source Degrees of Freedom Sequential Sum of Squares Mean Square F statistic
X1 df1 SSseq1 MSseq1=SSseq1df1 F=MSseq1MSres
X2 df2 SSseq2 MSseq2=SSseq2df2 F=MSseq2MSres
Xp1 dfp1 SSseqp1 MSseqp1=SSseqp1dfp1 F=MSseqp1MSres
Residuals np SSres MSres=SSresnp
Note that given the sequential sum of squares, one can calculate
R(βj,βj+1,,βk|β0,β1,,βj1)=i=jkSSseqi.
However, one cannot calculate the nonsequential sums of squares in this manner, such as
R(β1,β3,β5|β0,β2,β4).
Note that in the fuel example
SSreg=R(β1|β0)+R(β2|β0,β1)+R(β3|β0,β1,β2)+R(β4|β0,β1,β2,β3).

Partial sum of squares   The partial sum of squares for each j is R(βj|β0,β1,,βj1,βj+1,,βp1)=R(β1,β2,,βp1|β0)R(β1,,βj1,βj+1,,βp1|β0) and is the extra sum of squares that one incurs by adding the explanatory variable Xj to the model given that X1,,Xj1,Xj+1,,Xp1 are already present.

Note that the F test for testing the hypotheses:
H0:βj=0 vs. H1:βj0

at level α, is equivalent to the t test for the individual parameter since tnp2=F1,np.

Fuel consumption (continued)   For the data in Section 21.2, Example 1 (Fuel Consumption), we have the following ANOVA output table:

Source Degrees of Freedom Sequential Sum of Square Mean Square
dlic 1 86854 86854
tax 1 19159 19159
inc 1 61408 61408
road 1 34573 34573
Residuals 46 193700 4211

We want to test the following hypotheses:

  • H0:Y=β0+β1dlic+β2tax+ϵ;
  • H1:Y=β0+β1dlic+β2tax+β3inc+β4road+ϵ.


The decision rule is to reject H0 if
F=R(β3,β4|β0,β1,β2)/2SSres/(np)>F2,np,0.95
where
R(β3,β4|β0,β1,β2)=R(β3|β0,β1,β2)+R(β4|β0,β1,β2,β3)=61408+34573=95981.

Hence, F=95981/24211=11.40>F2,46,0.05=3.20. Therefore we will reject H0 at α=0.05. Including the variables inc and road significantly improves the model. The p-value is P(F2,46>11.40)=9.53×105.

To compare the linear models:

  • H0:Y=β0+β1dlic+β3inc+β4road+ϵ;
  • H1:Y=β0+β1dlic+β2tax+β3inc+β4road+ϵ

is equivalent to the hypotheses: H0:β2=0 vs. H1:β20.

The residual sum-of-squares under H0 and H1 are 211964 and 193700, respectively. Since the residual degrees of freedom under the full model is 46 the F-statistic is:
F=211964193700193700/46=4.34.

The p-value is P(F1,46>4.34)=0.0429 which coincides with the p-value for testing β2=0 in Section 21.2, Example 1 (Fuel Consumption).

Task: Lab 12

Attempt the R Markdown file for Lab 12:
Lab 12: Linear Models II

Student Exercises

Attempt the exercises below.

Question 1.

The following R output is from the analysis of 43 years of weather records in California. 10 values denoted {i?} for i=1,,10 have been removed. What are the 10 missing values?

Call:
lm(formula = BSAAM ~ APMAM + OPRC)

Residuals:
     Min       1Q   Median       3Q      Max
-21893.1  -6742.5   -654.1   6725.7  27061.8

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  16703.9     5033.7   3.318  0.00194 **
APMAM          815.0      501.6   1.625  0.11206
OPRC          4589.7      309.0    {1?}     {3?} {4?}
---
Signif. codes:   0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 9948 on {2?} degrees of freedom
Multiple R-squared: {7?},     Adjusted R-squared: 0.848
F-statistic: {8?} on {9?} and {10?} DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: BSAAM

          Df     Sum Sq    Mean Sq F value    Pr(>F)
APMAM      1 1.5567e+09 1.5567e+09  15.730 0.0002945 ***
OPRC       1 2.1836e+10 2.1836e+10    {6?} < 2.2e-16 ***
Residuals 40 3.9586e+09       {5?}
---
Signif. codes: 0   *** 0.001 ** 0.01 * 0.05 . 0.1 1
Solution to Question 1.
  1. t=β^jSE(β^j)=4589.7309.0=14.852.
  2. df=np=433=40.
  3. p=P(|t40|>14.583)=2P(t40>14.853)<0.001.
  4. p-values less than 0.001 have significance code ***.
  5. MSres=SSresnp=3.9586×10940=9.8965×107.
  6. F=R(β2|β0,β1)/qSSres/(np)=2.1836×10109.8965×107=220.644.
  7. SSreg=1.5567×109+2.1836×1010=2.3393×1010 and
    SStot=SSreg+SSres=2.3393×1010+3.9586×109=2.7351×1010 so R2=SSregSStot=2.3393×10102.7351×1010=0.855.
  8. F=(D0D1)/(p1)D1/(np)=(2.7351×10103.9586×109)/23.9586×109/40=118.19
  9. p1=2.
  10. np=40.


Question 2.

An experiment was conducted to find out how long is a piece of string. Six pieces of string were measured along with their colour.
LengthColour9Orange28Grey8Pink31Grey6Pink11Orange
  1. Write down an appropriate model to test for an association between colour and the length of string. Hence write down the design matrix.
  2. Find the least squares estimates for the parameters in your model. You may find the following inverse helpful:
    (622220202)1=12(111121112).
  3. Find the fitted values and residuals for your model.
  4. Calculate the ANOVA table and then use it to test the hypothesis that colour affects the length of string.
Solution to Question 2.
  1. An appropriate model is
    yi={α+ϵiif the string is orange,α+β+ϵiif the string is grey,α+γ+ϵiif the string is pink.
    Hence the design matrix for y=(9,28,8,31,6,11)T and β=(α,β,γ)T is
    Z=[100110101110101100].
  2. The least squares estimate is β^=(ZTZ)1ZTy where
    ZTZ=(622220202)andZTy=(935914).
    Hence
    β^=(ZTZ)1ZTy=(111121112)(935914)=(1019.53).
  3. The fitted values and residuals are
    y^=Zβ^=[100110101110101100](1019.53)=[1029.5729.5710]andϵ^=yy^=[11.511.511]
    respectively.
  4. There are n=6 observations and p=3 parameters. The sums of squares are
    SStot=(n1)Var(y)=i=1nyi21n(i=1nyi)2=20479326=605.5SSres=\bsϵ^T\bsϵ^=(1)2+(1.5)2+12+1.52+(1)2+12=8.5SSreg=SStotSSres=605.58.5=597
    Hence the ANOVA table is
    SourceDfSum SqMean SqFRegression2597.0298.500105.35Residual38.52.833Total5605.5
    The test for the existence of regression has statistic F=105.35 and critical value F2,3,0.95=9.55. Hence, we reject the null hypothesis and conclude that colour does affect the length of a piece of string.