4.8 The overall F test for regression
Doing t tests can give us information about individual predictors and whether they help explain the response. But there’s a bigger picture to think about: does anything in the model help explain the response? Is this model, frankly, any better than just guessing ˉy for every prediction?
Hopefully, yes. But let’s check.
For any regression model, we could say that doing the regression reduces the total variation of y to the variation around the line. Let’s make this more precise with some new sums of squares notation:
SSTO or the Total Sum of Squares is the total variation of y around its mean. It’s the numerator of the sample variance of y – ignoring anything to do with x:
SSTO=Syy=∑(yi−ˉy)2
Note that this is not just the sum of the yi squared: we’ve subtracted ˉy first. Strictly speaking, this is the corrected total sum of squares, or SSTOc. In this class, you can assume this is what we mean by SSTO, but be careful when you’re reading other sources. In the regression context, we don’t usually refer to this as “SSY” (even though it is equal to Syy), because Y is the response and so gets special names for things.
The next step is to break up (or decompose) this sum of squares into two pieces. One piece represents what our model does for us; the other piece represents what it fails to do for us – the residual. In other words, the distance of yi from the horizontal line ˉy can be broken into two parts:
yi−ˉy=(yi−^yi)+(^yi−ˉy)
And since that’s true for each i, it’s true for the sum:
∑(yi−ˉy)=∑(yi−^yi)+∑(^yi−ˉy)
Now I claim:
∑(yi−ˉy)2=∑(yi−^yi)2+∑(^yi−ˉy)2
“But wait!” you say. “I know FOIL,” you say. “Fine, ∑(yi−ˉy)2=∑((yi−ˆyi)+(ˆyi−ˉy))2, so I grant you the ∑(yi−ˆyi)2 and the ∑(ˆyi−ˉy)2. But there should be some irritating cross terms,” you say, and you write:
2∑(yi−^yi)(^yi−ˉy) You’re correct – but it turns out that’s 0. Here’s how to show it:
First, recall that ^yi=ˉy+b1(xi−ˉx) and b1=SxySxx.
Use the first equality to substitute for ˆyi, and keep an eye out for Sxy or Sxx terms to simplify:
∑(yi−^yi)(^yi−ˉy)=∑(yi−ˉy−b1(xi−ˉx))(ˉy+b1(xi−ˉx)−ˉy)=∑(yi−ˉy−b1(xi−ˉx))b1(xi−ˉx)=b1∑(yi−ˉy)(xi−ˉx)−b21∑(xi−ˉx)(xi−ˉx)=b1Sxy−b21Sxx=b1(b1Sxx)−b21Sxx=0
So, like I said: ∑(yi−ˉy)2=∑(yi−^yi)2+∑(^yi−ˉy)2
Incidentally, there’s no consensus on this notation. Options include:
- Total Sum of Squares: SSTO=Syy=SST=SSTot (the latter especially in experimental design where T can stand for “treatment”)
- Regression Sum of Squares: SSR=SSReg, or sometimes SST or SSTr in experimental design
- Sum of Squared Errors (or Residual Sum of Squares): SSE. Very occasionally SSR, but please don’t adopt that notation now because it will be unbearably confusing.
For the moment, we’ll stick with SSTO, SSR, and SSE.
Now that breakdown is interesting. That first piece looks like the (squared) residuals again – we’d like that to be small. The second piece represents the (squared) differences between the na"ive prediction (using a constant) and our shiny new prediction (using a line). Both pieces are sums of squared things, so let’s name them accordingly:
SSTO=SSE+SSR
These sums of squares are part of what shows up in R’s ANOVA output! Check out that “Sum Sq” column next time you do ANOVA in R.
“But wait,” you say. “That’s not fair! Every time I get a new data point, it’s going to have a residual. And all of the terms in these sums are squares – they have to be positive. All these sums of squares are just going to get bigger and bigger as n increases, regardless of whether the line’s doing a good job!”
You are of course correct. So we have to consider sums of squares relative to each other. If we’re going to argue that our regression was a good and useful idea, that SSR piece had better be – relatively speaking – large. How can we quantify that?
4.8.1 Mean Squares
Dividing a sum of squares by its degrees of freedom gives what’s called a Mean Square. Because the degrees of freedom for the residuals (and thus SSE) depends on the number of data points, you’re accounting for the number of observations you have.
We noted previously that sums of squares look kind of like variances, but without the denominator. And indeed, the mean squares are variances of observed quantities (which we could use as estimates of “true” variances). For a single-predictor regression, we have:
SSTOn−1 is the (total) variance of y ignoring x
SSEn−2=MSE, which you’ve seen before, is the variance of the residuals e
SSR1=MSR is a measure of how much the regression differs from the mean. In simple linear regression, you’re estimating two quantities (slope and intercept), which is 1 more than the mean model (in which you’d estimate, well, the mean). So the degrees of freedom associated with your regression – as compared to the constant model – is one.
This suggests that we could compare the MSR to the MSE, in order to figure out if the model is helpful – without getting confused by what our n happens to be. If the MSR is relatively large, we could conclude that our model is useful. But how large is “large?” That’s a hypothesis testing question!
4.8.2 Null distribution time!
In order to do a hypothesis test, we’ll need a test statistic and a null distribution. Here are a couple of handy facts:
E(MSE)=σ2
and E(MSR)=σ2+β21Sxx
I’m not going to prove these facts here; maybe I will put them on some practice problems or something. The important thing is to note that β1 has now appeared! This reflects the idea that we could compare MSR and MSE to talk about β1: if β1 isn’t zero, MSR should be bigger than MSE. But if β1 is 0, the two quantities are, in expectation, the same.
Technical note: the mean of an F is actually df2/(df2−2) where df2 is the denominator degrees of freedom, so for simple regression E(F)=n−2n−4. As you can see, this is pretty close to 1 as long as n is decently large. But in any event you don’t really have to worry about it since R is doing the computations for you.
This is what the F test is for. The ratio MSR/MSE is referred to as an F statistic, because under the null hypothesis, it will have an F distribution with 1 and n−2 degrees of freedom. If β1 is really 0, the ratio should be about 1.
So, to test if the slope is zero, we compute F1,n−2=MSRMSE
Check your understanding: why are we only interested if this ratio is too big?
and see if it’s too big. Because we are frequentist-ish in this course, by “see if it’s too big,” I mean “compute the p-value.” If the p-value is smaller than our pre-determined α level, we say: “Well, if the true β were 0, we’d be unlikely to get a ratio of MSR/MSE that’s this big. So I think there’s evidence that the true β is not 0.”
This all works on the same principle when you have multiple predictors in the model. They all contribute terms to the MSR, so if any of the β’s are nonzero, the MSR should be bigger than the MSE. If you see a significantly large F statistic, that’s evidence that something in the model is useful, with a nonzero β. Of course, you don’t know which term(s) in the model are useful until you go follow up with t tests!
4.8.3 WTF is the F?
This isn’t a super critical section for our purposes – it’s a bit more theory. But in case you are interested in where the F distribution comes from, read on!
If a bunch of yi are drawn from a normal distribution N(μ,σ) then yi−μσ=zi, following a standard normal distribution.
If the yi are all independent, then:
z21+...+z2n∼χ2n
You’re going to have to take my word for it on these distribution facts. Or take some more stats courses :)
The sum of squares of n independent z-scores has a chi-squared distribution with n degrees of freedom. And the ratio of two independent chi-squared variables (each divided by its degrees of freedom) is distributed thus:
χ2h1/h1χ2h2/h2∼Fh1,h2
You can also show (well, someone can show) that the regression sum of squares SSR has 1 df and is χ21, and the residual sum of squares SSE has n−2 df and is χ2n−2, and that they are independent under the null hypothesis, so:
MSRMSE∼F1,n−2
It turns out that when the numerator degrees of freedom is 1, as is the case in simple regression:
F1,h=t2h
So they are the same test! When we add more predictors, we have more slopes, so the MSR has additional degrees of freedom. Then the t tests test each individual slope and the F test tests the hypothesis that all the slopes are 0:
H0:β1=⋯=βp=0
For simple regression, there’s only one slope, so these hypotheses are the same.