4.8 The overall F test for regression
Doing \(t\) tests can give us information about individual predictors and whether they help explain the response. But there’s a bigger picture to think about: does anything in the model help explain the response? Is this model, frankly, any better than just guessing \(\bar{y}\) for every prediction?
Hopefully, yes. But let’s check.
For any regression model, we could say that doing the regression reduces the total variation of \(y\) to the variation around the line. Let’s make this more precise with some new sums of squares notation:
\(SSTO\) or the Total Sum of Squares is the total variation of \(y\) around its mean. It’s the numerator of the sample variance of \(y\) – ignoring anything to do with \(x\):
\[SSTO = S_{yy} = \sum(y_i - \bar{y})^2\]
Note that this is not just the sum of the \(y_i\) squared: we’ve subtracted \(\bar{y}\) first. Strictly speaking, this is the corrected total sum of squares, or \(SSTO_c\). In this class, you can assume this is what we mean by \(SSTO\), but be careful when you’re reading other sources. In the regression context, we don’t usually refer to this as “SSY” (even though it is equal to \(S_{yy}\)), because \(Y\) is the response and so gets special names for things.
The next step is to break up (or decompose) this sum of squares into two pieces. One piece represents what our model does for us; the other piece represents what it fails to do for us – the residual. In other words, the distance of \(y_i\) from the horizontal line \(\bar{y}\) can be broken into two parts:
\[y_i - \bar{y} = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y})\]
And since that’s true for each \(i\), it’s true for the sum:
\[\sum(y_i - \bar{y}) = \sum(y_i - \hat{y_i}) + \sum(\hat{y_i} - \bar{y})\]
Now I claim:
\[\sum(y_i - \bar{y})^2 = \sum(y_i - \hat{y_i})^2 + \sum(\hat{y_i} - \bar{y})^2\]
“But wait!” you say. “I know FOIL,” you say. “Fine, \(\sum(y_i - \bar{y})^2 = \sum((y_i - \hat{y}_i) + (\hat{y}_i - \bar{y}))^2\), so I grant you the \(\sum(y_i - \hat{y}_i)^2\) and the \(\sum(\hat{y}_i - \bar{y})^2\). But there should be some irritating cross terms,” you say, and you write:
\[2 \sum(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \] You’re correct – but it turns out that’s 0. Here’s how to show it:
First, recall that \(\hat{y_i} = \bar{y} + b_1(x_i - \bar{x})\) and \(b_1 = \frac{S_{xy}}{S_{xx}}\).
Use the first equality to substitute for \(\hat{y}_i\), and keep an eye out for \(S_{xy}\) or \(S_{xx}\) terms to simplify:
\[ \begin{aligned} \sum(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) &= \sum(y_i - \bar{y} - b_1(x_i - \bar{x}))(\bar{y} + b_1(x_i - \bar{x}) - \bar{y})\\ &= \sum(y_i - \bar{y} - b_1(x_i - \bar{x}))b_1(x_i - \bar{x})\\ &= b_1 \sum(y_i - \bar{y})(x_i - \bar{x}) - b_1^2\sum(x_i - \bar{x})(x_i - \bar{x})\\ &= b_1 S_{xy} - b_1^2 S_{xx}\\ &= b_1 (b_1 S_{xx}) - b_1^2 S_{xx} = 0 \end{aligned} \]
So, like I said: \[\sum(y_i - \bar{y})^2 = \sum(y_i - \hat{y_i})^2 + \sum(\hat{y_i} - \bar{y})^2\]
Incidentally, there’s no consensus on this notation. Options include:
- Total Sum of Squares: \(SSTO = S_{yy} = SST = SSTot\) (the latter especially in experimental design where \(T\) can stand for “treatment”)
- Regression Sum of Squares: \(SSR = SSReg\), or sometimes \(SST\) or \(SSTr\) in experimental design
- Sum of Squared Errors (or Residual Sum of Squares): \(SSE\). Very occasionally \(SSR\), but please don’t adopt that notation now because it will be unbearably confusing.
For the moment, we’ll stick with \(SSTO\), \(SSR\), and \(SSE\).
Now that breakdown is interesting. That first piece looks like the (squared) residuals again – we’d like that to be small. The second piece represents the (squared) differences between the na"ive prediction (using a constant) and our shiny new prediction (using a line). Both pieces are sums of squared things, so let’s name them accordingly:
\[SSTO = SSE + SSR\]
These sums of squares are part of what shows up in R’s ANOVA output! Check out that “Sum Sq” column next time you do ANOVA in R.
“But wait,” you say. “That’s not fair! Every time I get a new data point, it’s going to have a residual. And all of the terms in these sums are squares – they have to be positive. All these sums of squares are just going to get bigger and bigger as \(n\) increases, regardless of whether the line’s doing a good job!”
You are of course correct. So we have to consider sums of squares relative to each other. If we’re going to argue that our regression was a good and useful idea, that \(SSR\) piece had better be – relatively speaking – large. How can we quantify that?
4.8.1 Mean Squares
Dividing a sum of squares by its degrees of freedom gives what’s called a Mean Square. Because the degrees of freedom for the residuals (and thus SSE) depends on the number of data points, you’re accounting for the number of observations you have.
We noted previously that sums of squares look kind of like variances, but without the denominator. And indeed, the mean squares are variances of observed quantities (which we could use as estimates of “true” variances). For a single-predictor regression, we have:
\(\frac{SSTO}{n-1}\) is the (total) variance of \(y\) ignoring \(x\)
\(\frac{SSE}{n-2} = MSE\), which you’ve seen before, is the variance of the residuals \(e\)
\(\frac{SSR}{1} = MSR\) is a measure of how much the regression differs from the mean. In simple linear regression, you’re estimating two quantities (slope and intercept), which is 1 more than the mean model (in which you’d estimate, well, the mean). So the degrees of freedom associated with your regression – as compared to the constant model – is one.
This suggests that we could compare the MSR to the MSE, in order to figure out if the model is helpful – without getting confused by what our \(n\) happens to be. If the MSR is relatively large, we could conclude that our model is useful. But how large is “large?” That’s a hypothesis testing question!
4.8.2 Null distribution time!
In order to do a hypothesis test, we’ll need a test statistic and a null distribution. Here are a couple of handy facts:
\[E(MSE) = \sigma^2\]
and \[E(MSR) = \sigma^2 + \beta_1^2 S_{xx}\]
I’m not going to prove these facts here; maybe I will put them on some practice problems or something. The important thing is to note that \(\beta_1\) has now appeared! This reflects the idea that we could compare MSR and MSE to talk about \(\beta_1\): if \(\beta_1\) isn’t zero, MSR should be bigger than MSE. But if \(\beta_1\) is 0, the two quantities are, in expectation, the same.
Technical note: the mean of an \(F\) is actually \(df2/(df2-2)\) where \(df2\) is the denominator degrees of freedom, so for simple regression \(E(F) = \dfrac{n-2}{n-4}\). As you can see, this is pretty close to 1 as long as \(n\) is decently large. But in any event you don’t really have to worry about it since R is doing the computations for you.
This is what the \(F\) test is for. The ratio \(MSR/MSE\) is referred to as an \(F\) statistic, because under the null hypothesis, it will have an \(F\) distribution with \(1\) and \(n-2\) degrees of freedom. If \(\beta_1\) is really \(0\), the ratio should be about \(1\).
So, to test if the slope is zero, we compute \[F_{1,n-2} = \frac{MSR}{MSE}\]
Check your understanding: why are we only interested if this ratio is too big?
and see if it’s too big. Because we are frequentist-ish in this course, by “see if it’s too big,” I mean “compute the p-value.” If the p-value is smaller than our pre-determined \(\alpha\) level, we say: “Well, if the true \(\beta\) were 0, we’d be unlikely to get a ratio of \(MSR/MSE\) that’s this big. So I think there’s evidence that the true \(\beta\) is not 0.”
This all works on the same principle when you have multiple predictors in the model. They all contribute terms to the MSR, so if any of the \(\beta\)’s are nonzero, the MSR should be bigger than the MSE. If you see a significantly large \(F\) statistic, that’s evidence that something in the model is useful, with a nonzero \(\beta\). Of course, you don’t know which term(s) in the model are useful until you go follow up with \(t\) tests!
4.8.3 WTF is the F?
This isn’t a super critical section for our purposes – it’s a bit more theory. But in case you are interested in where the F distribution comes from, read on!
If a bunch of \(y_i\) are drawn from a normal distribution \(N(\mu,\sigma)\) then \(\frac{y_i-\mu}{\sigma} = z_i\), following a standard normal distribution.
If the \(y_i\) are all independent, then:
\[z^2_1 + ... + z^2_n \sim \chi^2_n\]
You’re going to have to take my word for it on these distribution facts. Or take some more stats courses :)
The sum of squares of \(n\) independent \(z\)-scores has a chi-squared distribution with \(n\) degrees of freedom. And the ratio of two independent chi-squared variables (each divided by its degrees of freedom) is distributed thus:
\[\frac{\chi^2_{h_1}/h_1}{{\chi^2_{h_2}/h_2}} \sim F_{h_1,h_2}\]
You can also show (well, someone can show) that the regression sum of squares \(SSR\) has 1 df and is \(\chi^2_1\), and the residual sum of squares \(SSE\) has \(n-2\) df and is \(\chi^2_{n-2}\), and that they are independent under the null hypothesis, so:
\[ \frac{MSR}{MSE} \sim F_{1,n-2}\]
It turns out that when the numerator degrees of freedom is \(1\), as is the case in simple regression:
\[F_{1,h} = t^2_{h}\]
So they are the same test! When we add more predictors, we have more slopes, so the \(MSR\) has additional degrees of freedom. Then the \(t\) tests test each individual slope and the \(F\) test tests the hypothesis that all the slopes are \(0\):
\[H_0: \beta_1 = \dots = \beta_p = 0\]
For simple regression, there’s only one slope, so these hypotheses are the same.