Chapter 22 ANOVA Tables and F Tests

22.1 Introduction

In this section we will focus on the sources of variation in a linear model and how these can be used to determine the most appropriate linear model. We start in Section 22.2 by considering the residuals of the linear model and properties of the residuals. In Section 22.3, we introduce the total sum-of-squares SStot which is a measure of the total amount of variation in the model. This is comprised of two components: the regression sum-of-squares, SSreg, which measures the variability in the observations that is captured by the model and the residual sum-of-squares, SSres, which measures the unexplained variability in the observations. In Section 22.4, we introduce ANOVA tables for summarising variability in the model and testing null hypotheses. In particular, we consider the Fuel Consumption example, introduced in Section 21.2, and show how the conclusions obtained in Section 21.4 can be presented in the form of an ANOVA table. In Section 22.5, we consider two linear models for a given data set, where one model is nested within the other model. This allows us to compare two models which lie between the full model (includes all variables) and the null model (excludes all variables). Finally, in Section 22.6 we extend the comparison of nested models to sequential sum-of-squares to find the most appropriate model out of a range of nested linear models.

22.2 The residuals

Consider the linear model
y=Zβ+ϵ.y=Zβ+ϵ.

Recall the following model notation:

  • E[ϵ]=0E[ϵ]=0;
  • Var(ϵ)=σ2InVar(ϵ)=σ2In;
  • ˆβ=(ZTZ)1ZTy^β=(ZTZ)1ZTy is the LSE of ββ;
  • ˆy=Zˆβ^y=Z^β is the n×1n×1 vector of fitted values;
  • ˆy=Py^y=Py, where P=Z(ZTZ)1ZTP=Z(ZTZ)1ZT.
Let r=ˆϵ=yˆyr=^ϵ=y^y be the n×1n×1 vector of residuals. Note that
r=ˆϵ=yˆy=yZˆβ=yPy=(InP)y,r=^ϵ=y^y=yZ^β=yPy=(InP)y,

where InPInP is symmetric, idempotent and has trace (InP)=rank(InP)y=np(InP)=rank(InP)y=np. Note that rank(InP)y=nprank(InP)y=np denotes the degrees of freedom of the residuals and is equal to the number of observations, nn, minus the number of coefficients (parameters), pp.


The vector of fitted values is orthogonal to the vector of residuals, that is ˆyTˆϵ=ˆϵTˆy=0.^yT^ϵ=^ϵT^y=0.

Details of the proof can be omitted.

Proof of Theorem 22.2.1.
ˆyTˆϵ=(Py)T(InP)y=yTPT(InP)y=yTPTyyTPTPy=yTPyyTPPy=yTPyyTPy=0,^yT^ϵ=(Py)T(InP)y=yTPT(InP)y=yTPTyyTPTPy=yTPyyTPPy=yTPyyTPy=0,

using that PP is orthogonal (PT=PPT=P) and idempotent (P2=PP2=P).

Therefore, yy can be written as a linear combination of orthogonal vectors:
y=ˆy+ˆϵ.y=^y+^ϵ.

The normal linear model assumes ϵN(0,σ2In)ϵN(0,σ2In). We would expect the sample residuals, ˆϵ^ϵ to exhibit many of the properties of the error terms. The properties of ˆϵ^ϵ can be explored via graphical methods as in Section 16.6 and Lab 9: Linear Models.

22.3 Sums of squares

Let yiyi be the ithith observation, ˆyi^yi be the ithith fitted value and ˉy¯y be the mean of the observed values.

Model Deviance

The model deviance is given by
D=ni=1(yiˆyi)2+ni=1(ˆyiˉy)2.D=ni=1(yi^yi)2+ni=1(^yi¯y)2.
We have
(yiˉy)=(yiˆyi)+(ˆyiˉy),(yiˉy)2=[(yiˆyi)+(ˆyiˉy)]2=(yiˆyi)2+(ˆyiˉy)2+2(yiˆyi)(ˆyiˉy),ni=1(yiˉy)2=ni=1(yiˆyi)2+ni=1(ˆyiˉy)2+2ni=1(yiˆyi)(ˆyiˉy).(yi¯y)=(yi^yi)+(^yi¯y),(yi¯y)2=[(yi^yi)+(^yi¯y)]2=(yi^yi)2+(^yi¯y)2+2(yi^yi)(^yi¯y),ni=1(yi¯y)2=ni=1(yi^yi)2+ni=1(^yi¯y)2+2ni=1(yi^yi)(^yi¯y).
Now,
ni=1(yiˆyi)(ˆyiˉy)=ni=1(yiˆyi)ˆyini=1(yiˆyi)ˉy=ni=1ˆϵiˆyiˉyni=1ˆϵi=ˆϵTˆyˉyni=1ˆϵi=00=0,ni=1(yi^yi)(^yi¯y)=ni=1(yi^yi)^yini=1(yi^yi)¯y=ni=1^ϵi^yi¯yni=1^ϵi=^ϵT^y¯yni=1^ϵi=00=0,
using that ˆϵ^ϵ and ˆy^y are orthogonal, and that ni=1ˆϵi=0ni=1^ϵi=0 is one of the normal equations. Therefore,
ni=1(yiˉy)2=ni=1(yiˆyi)2+ni=1(ˆyiˉy)2.ni=1(yi¯y)2=ni=1(yi^yi)2+ni=1(^yi¯y)2.


Total sum of squares

Define SStot=ni=1(yiˉy)2SStot=ni=1(yi¯y)2 as the total sum of squares. This is proportional to the total variability in yy since SStot=(n1)Var(y)SStot=(n1)Var(y). It does not depend on the choice of predictor variables in ZZ.


Residual sum of squares

Define SSres=ni=1(yiˆyi)2SSres=ni=1(yi^yi)2 as the residual sum of squares. This is a measure of the amount of variability in yy the model was unable to explain. This is equivalent to the deviance of the model, that is SSres=DSSres=D.


Regression sum of squares
  Define SSreg=ni=1(ˆyiˉy)2SSreg=ni=1(^yi¯y)2 as the regression sum of squares. This is the difference between SStotSStot and SSresSSres and is a measure of the amount of variability in yy the model was able to explain.

From our above derivations, note
ni=1(yiˉy)2=ni=1(yiˆyi)2+ni=1(ˆyiˉy)2SStot=SSres+SSregni=1(yi¯y)2=ni=1(yi^yi)2+ni=1(^yi¯y)2SStot=SSres+SSreg
Coefficient of determination

The coefficient of determination is
R2=SSregSStot.R2=SSregSStot.

The coefficient of determination measures the proportion of variability explained by the regression. Additionally note that:

  • 0R210R21;
  • R2=1SSresSStotR2=1SSresSStot;
  • R2R2 is often used as a measure of how well the regression model fits the data: the larger R2R2 is, the better the fit. One needs to be careful in interpreting what “large” is on a case-by-case basis.
Adjusted R2R2

The adjusted R2R2 is
R2adj=1SSres/(np)SStot/(n1).R2adj=1SSres/(np)SStot/(n1).

The adjusted R2R2 is often used to compare the fit of models with different numbers of parameters.

Under the null hypothesis model, Yi=β0+ϵiYi=β0+ϵi and ˉy=ˆyi¯y=^yi. In this special case,
SStot=SSres=D,SSreg=0,R2=R2adj=0.SStot=SSres=D,SSreg=0,R2=R2adj=0.

22.4 Analysis of Variance (ANOVA)

Recall from Section 21.4 that the FF statistic used in the test for the existence of regression is F=(D0D1)/(p1)D1/(np),F=(D0D1)/(p1)D1/(np), where D1D1 and D0D0 are the model deviances or SSres under the alternative and null hypotheses respectively. We noted above that D0D0, the deviance under the null hypothesis is equivalent to SStot under any model.

Mean square regression

The mean square regression is the numerator in the FF statistic,
MSreg=D0D1p1=SStotSSresp1=SSregp1.MSreg=D0D1p1=SStotSSresp1=SSregp1.
Mean square residual

The mean square residual is the denominator in the FF statistic,
MSres=D1np=SSresnp.MSres=D1np=SSresnp.

Note the mean square residual is an unbiased estimator of σ2σ2. Similarly, the residual standard error, RSE=MSresRSE=MSres is an unbiased estimate of σσ.

The quantities involved in the calculation of the FF statistic are usually displayed in an ANOVA table:

Source Degrees of Freedom Sum of Squares Mean Square F Statistic
Regression p1p1 SSregSSreg MSreg=SSregp1MSreg=SSregp1 F=MSregMSresF=MSregMSres
Residual npnp SSresSSres MSres=SSresnpMSres=SSresnp
Total n1n1 SStotSStot

Fuel consumption (continued)   For the data in Section 21.2, Example 21.2.1 (Fuel Consumption) the model fuel=β0+β1dlic+β2tax+β3inc+β4roadfuel=β0+β1dlic+β2tax+β3inc+β4road was fitted to the n=51n=51 observations with residual standard error, RSE=64.8912RSE=64.8912. Summary statistics show Var(fuel)=7913.88Var(fuel)=7913.88. Complete an ANOVA table and compute R2R2 for the fitted model.

We have

  • Note p1=4p1=4, np=46np=46 and n1=50n1=50;
  • SStot=(n1)Var(fuel)=50×7913.88=395694SStot=(n1)Var(fuel)=50×7913.88=395694;
  • MSres=RSE2=64.89122=4210.87MSres=RSE2=64.89122=4210.87;
  • SSres=(np)MSres=46×4210.87=193700SSres=(np)MSres=46×4210.87=193700;
  • SSreg=SStotSSres=395694193700=201994SSreg=SStotSSres=395694193700=201994;
  • MSreg=SSreg/(p1)=201994/4=50498.50MSreg=SSreg/(p1)=201994/4=50498.50;
  • F=MSreg/MSres=50498.5/4210.87=11.99F=MSreg/MSres=50498.5/4210.87=11.99.

Hence the completed ANOVA table is

Source Degrees of Freedom Sum of Squares Mean Square F statistic
Regression 44 201994201994 50498.5050498.50 11.9911.99
Residual 4646 193700193700 4210.874210.87
Total 5050 395694395694

We compare the computed FF-statistic, 11.99, with a Fp1,np=F4,46Fp1,np=F4,46 distribution to obtain a pp-value of 9.331×107=P(F4,46>11.99)9.331×107=P(F4,46>11.99). That is, if the null hypothesis (no regression parameters β1==β4=0β1==β4=0) were true, there is probability 9.331×1079.331×107 (just under one in a million) of observing an FF-statistic larger than 11.99.

Finally, R2=SSregSStot=201994395694=0.5105R2=SSregSStot=201994395694=0.5105.

22.5 Comparing models

Consider two models, M1M1 and M2M2, where M2M2 is a simplification of M1M1. For example,
M1:Y=β0+β1X1+β2X2+β3X3+β4X4+ϵ,M2:Y=β0+β2X2+β4X4+ϵ.M1:Y=β0+β1X1+β2X2+β3X3+β4X4+ϵ,M2:Y=β0+β2X2+β4X4+ϵ.

The residual sum of squares from model M1M1 will always be less then M2M2, but we can test the hypotheses: H0:β1=β3=0 vs. H1:β10 or β30H0:β1=β3=0 vs. H1:β10 or β30 at significance level αα to test if removing these terms significantly increases the residual sum of squares.

Let D1=ni=1(yiˆyi)2D1=ni=1(yi^yi)2 be the model deviance, or SSres, for model M1M1.

Let D2=ni=1(yiˆyi)2D2=ni=1(yi^yi)2 be the model deviance, or SSres, for model M2M2.

The decision rule is to reject H0H0 if F=(D2D1)/qD1/(np)>Fq,np,α,F=(D2D1)/qD1/(np)>Fq,np,α, where nn is the number of observations, pp is the number of parameters in M1M1 and qq is the number of parameters that are fixed to reduce M1M1 to M2M2.

For the example above, p=5p=5 and q=2q=2.

Watch Video 32 for a work through of comparing models using the FF-distribution. The video uses Example 22.5.1 (Fuel consumption - continued) given below.

Video 32: Model comparison.

Fuel consumption (continued)
Let Model 1 be fuel=β0+β1dlic+β2tax+β3inc+β4road,fuel=β0+β1dlic+β2tax+β3inc+β4road, and let Model 2 be fuel=β0+β1dlic+β3inc.fuel=β0+β1dlic+β3inc. The residual sum of squares is 193700193700 for Model 1 and 249264249264 for Model 2. Test which model fits the data better.

The question is equivalent to testing the hypotheses:
H0:β2=β4=0 vs. H1:β20 or β40,H0:β2=β4=0 vs. H1:β20 or β40,

at α=0.05α=0.05. The decision rule is to reject H0H0 if

F=(D2D1)/qD1/(np)>Fq,np,α=F2,46,0.05=3.20.F=(D2D1)/qD1/(np)>Fq,np,α=F2,46,0.05=3.20.

Substituting in the data gives,

F=(249264193700)/2193700/(515)=6.598.F=(249264193700)/2193700/(515)=6.598.

Consequently, we will reject H0H0. Model 1 fits the data better at α=0.05α=0.05.

We note that the pp-value is 0.0048=P(F2,46>6.598)0.0048=P(F2,46>6.598), which gives very strong support in favour of Model 1.


Let’s consider the more general case where the basic model M1M1 is
Y=β0+β1X1+β2X2++βp1Xp1+ϵ.Y=β0+β1X1+β2X2++βp1Xp1+ϵ.
Denote
SSreg(M1)=ni=1(ˆyiˉy)2=R(β1,β2,,βp1|β0),SSreg(M1)=ni=1(^yi¯y)2=R(β1,β2,,βp1|β0),

assuming there is a constant in the model.

Our goal is to build a regression model which best describes the response variable. Hence we would like to explain as much of the variance in YY as possible, yet keep the model as simple as possible. This is known as the Principle of Parsimony. Consequently we want to determine which explanatory variables are worthwhile to include in the final model.

The idea is that explanatory variables should be included in the model if the extra portion of the regression sum of squares, called the extra sum of squares, which arises from their inclusion in the model is relatively large compared to the unexplained variance in the model, residual sum of squares.

Consider a second model M2M2 which is a simplification of M1M1, specifically
Y=β0+β1X1+β2X2++βk1Xk1+ϵ,Y=β0+β1X1+β2X2++βk1Xk1+ϵ,
where k<pk<p. Then
SSreg(M2)=R(β1,β2,,βk1|β0).SSreg(M2)=R(β1,β2,,βk1|β0).
Extra sum of squares
The extra sum of squares due to the inclusion of the terms βkXk++βp1Xp1βkXk++βp1Xp1 in the model is
SSreg(M1)SSreg(M2).SSreg(M1)SSreg(M2).
It is denoted
R(βk,,βp1|β0,β1,,βk1)=R(β1,β2,,βp1|β0)R(β1,β2,,βk1|β0).R(βk,,βp1|β0,β1,,βk1)=R(β1,β2,,βp1|β0)R(β1,β2,,βk1|β0).

The extra sum of squares has q=pkq=pk degrees of freedom, where q is the number of explanatory variables added to the reduced model to make the full model.

We can test the hypotheses:

  • H0: The reduced model, M2, best describes the data;
  • H1: The full model, M1, best describes the data.
The decision rule is to reject H0 if
F=R(βk,,βp1|β0,,βk1)/qSSres(M1)/(np)>Fq,np,α.

Rejecting H0 implies the full model describes the data better, so we should include the variables Xk,,Xp1 in our model.

The test for the existence of regression is a special case of this type of test, where
H0:β1=β2==βp=0,

that is, the reduced model is Y=β0+ϵ. Note that SSreg(M1)=R(β1,β2,,βp1|β0) is the extra sum of squares in this case.

22.6 Sequential sum of squares

Sequential sum of squares
The sequential sum of squares for each j, denoted SSseqi, is
R(βj|β0,β1,,βj1)=R(β1,β2,,βj|β0)R(β1,β2,,βj1|β0)

and is the extra sum of squares that one incurs by adding the explanatory variable Xj to the model given that X1,,Xj1 are already present.

The sequential sum of squares is often given in addition to the basic ANOVA table.

Source Degrees of Freedom Sequential Sum of Squares Mean Square F statistic
X1 df1 SSseq1 MSseq1=SSseq1df1 F=MSseq1MSres
X2 df2 SSseq2 MSseq2=SSseq2df2 F=MSseq2MSres
Xp1 dfp1 SSseqp1 MSseqp1=SSseqp1dfp1 F=MSseqp1MSres
Residuals np SSres MSres=SSresnp
Note that given the sequential sum of squares, one can calculate
R(βj,βj+1,,βk|β0,β1,,βj1)=ki=jSSseqi.
However, one cannot calculate the nonsequential sums of squares in this manner, such as
R(β1,β3,β5|β0,β2,β4).
Note that in the fuel example
SSreg=R(β1|β0)+R(β2|β0,β1)+R(β3|β0,β1,β2)+R(β4|β0,β1,β2,β3).

Partial sum of squares
The partial sum of squares for each j is R(βj|β0,β1,,βj1,βj+1,,βp1)=R(β1,β2,,βp1|β0)R(β1,,βj1,βj+1,,βp1|β0) and is the extra sum of squares that one incurs by adding the explanatory variable Xj to the model given that X1,,Xj1,Xj+1,,Xp1 are already present.

Note that the F test for testing the hypotheses:
H0:βj=0 vs. H1:βj0

at level α, is equivalent to the t test for the individual parameter since t2np=F1,np.

Fuel consumption (continued)   For the data in Section 21.2, Example 21.2.1 (Fuel Consumption), we have the following ANOVA output table:

Source Degrees of Freedom Sequential Sum of Square Mean Square
dlic 1 86854 86854
tax 1 19159 19159
inc 1 61408 61408
road 1 34573 34573
Residuals 46 193700 4211

We want to test the following hypotheses:

  • H0:Y=β0+β1dlic+β2tax+ϵ;
  • H1:Y=β0+β1dlic+β2tax+β3inc+β4road+ϵ.


The decision rule is to reject H0 if
F=R(β3,β4|β0,β1,β2)/2SSres/(np)>F2,np,0.05
where
R(β3,β4|β0,β1,β2)=R(β3|β0,β1,β2)+R(β4|β0,β1,β2,β3)=61408+34573=95981.

Hence, F=95981/24211=11.40>F2,46,0.05=3.20. Therefore we will reject H0 at α=0.05. Including the variables inc and road significantly improves the model. The p-value is P(F2,46>11.40)=9.53×105.

To compare the linear models:

  • H0:Y=β0+β1dlic+β3inc+β4road+ϵ;
  • H1:Y=β0+β1dlic+β2tax+β3inc+β4road+ϵ

is equivalent to the hypotheses: H0:β2=0 vs. H1:β20.

The residual sum-of-squares under H0 and H1 are 211964 and 193700, respectively. Since the residual degrees of freedom under the full model is 46 the F-statistic is:
F=211964193700193700/46=4.34.

The p-value is P(F1,46>4.34)=0.0429 which coincides with the p-value for testing β2=0 in Section 21.2, Example 21.2.1 (Fuel Consumption).

Task: Session 12

Attempt the R Markdown file for Session 12:
Session 12: Linear Models II

Student Exercises

Attempt the exercises below.


The following R output is from the analysis of 43 years of weather records in California. 10 values denoted {i?} for i=1,,10 have been removed. What are the 10 missing values?

Call:
lm(formula = BSAAM ~ APMAM + OPRC)

Residuals:
     Min       1Q   Median       3Q      Max
-21893.1  -6742.5   -654.1   6725.7  27061.8

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  16703.9     5033.7   3.318  0.00194 **
APMAM          815.0      501.6   1.625  0.11206
OPRC          4589.7      309.0    {1?}     {3?} {4?}
---
Signif. codes:   0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 9948 on {2?} degrees of freedom
Multiple R-squared: {7?},     Adjusted R-squared: 0.848
F-statistic: {8?} on {9?} and {10?} DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: BSAAM

          Df     Sum Sq    Mean Sq F value    Pr(>F)
APMAM      1 1.5567e+09 1.5567e+09  15.730 0.0002945 ***
OPRC       1 2.1836e+10 2.1836e+10    {6?} < 2.2e-16 ***
Residuals 40 3.9586e+09       {5?}
---
Signif. codes: 0   *** 0.001 ** 0.01 * 0.05 . 0.1 1
Solution to Exercise 22.1.
  1. t=ˆβjSE(ˆβj)=4589.7309.0=14.852.
  2. df=np=433=40.
  3. p=P(|t40|>14.583)=2P(t40>14.853)<0.001.
  4. p-values less than 0.001 have significance code ***.
  5. MSres=SSresnp=3.9586×10940=9.8965×107.
  6. F=R(β2|β0,β1)/qSSres/(np)=2.1836×10109.8965×107=220.644.
  7. SSreg=1.5567×109+2.1836×1010=2.3393×1010 and
    SStot=SSreg+SSres=2.3393×1010+3.9586×109=2.7351×1010 so R2=SSregSStot=2.3393×10102.7351×1010=0.855.
  8. F=(D0D1)/(p1)D1/(np)=(2.7351×10103.9586×109)/23.9586×109/40=118.19
  9. p1=2.
  10. np=40.



An experiment was conducted to find out how long is a piece of string. Six pieces of string were measured along with their colour.
LengthColour9Orange28Grey8Pink31Grey6Pink11Orange
  1. Write down an appropriate model to test for an association between colour and the length of string. Hence write down the design matrix.
  2. Find the least squares estimates for the parameters in your model. You may find the following inverse helpful:
    (622220202)1=12(111121112).
  3. Find the fitted values and residuals for your model.
  4. Calculate the ANOVA table and then use it to test the hypothesis that colour affects the length of string.
Solution to Question 2.
  1. An appropriate model is
    yi={α+ϵiif the string is orange,α+β+ϵiif the string is grey,α+γ+ϵiif the string is pink.
    Hence the design matrix for y=(9,28,8,31,6,11)T and β=(α,β,γ)T is
    Z=[100110101110101100].
  2. The least squares estimate is ˆβ=(ZTZ)1ZTy where
    ZTZ=(622220202)andZTy=(935914).
    Hence
    ˆβ=(ZTZ)1ZTy=(111121112)(935914)=(1019.53).
  3. The fitted values and residuals are
    ˆy=Zˆβ=[100110101110101100](1019.53)=[1029.5729.5710]andˆϵ=yˆy=[11.511.511]
    respectively.
  4. There are n=6 observations and p=3 parameters. The sums of squares are
    SStot=(n1)Var(y)=ni=1y2i1n(ni=1yi)2=20479326=605.5SSres=ˆϵTˆϵ=(1)2+(1.5)2+12+1.52+(1)2+12=8.5SSreg=SStotSSres=605.58.5=597
    Hence the ANOVA table is
    SourceDfSum SqMean SqFRegression2597.0298.500105.35Residual38.52.833Total5605.5
    The test for the existence of regression has statistic F=105.35 and critical value F2,3,0.05=9.55. Hence, we reject the null hypothesis and conclude that colour does affect the length of a piece of string.