6.9 Regression F Tests

Back in the simple linear regression days, it was (perhaps) a natural next step to start asking inference questions. Sure, I can observe a relationship between \(x\) and \(y\) in my sample, but am I confident that there really is a relationship at the population level?

Well, we want to ask the same kinds of questions about multiple regression models, too.

6.9.1 Athlete example

For an example, let’s use the dataset of physical measurements on elite Australian athletes from various sports. We want to predict each athlete’s red blood cell count, using their white blood cell count, their sport, and an interaction between them. The data look like this:

athlete_cells_dat = ais %>%
  filter(sport %in% c("T_400m", "T_Sprnt", "Tennis"))
athlete_cells_dat %>%
  ggplot() +
  geom_point(aes(x = wcc, y = rcc,
                 color = sport))

If we were to fit separate regression lines for the athletes in each sport, we’d see that each line is different. Not only do they have different intercepts, but they have different slopes – in particular, the relationship between red cell count and white cell count seems to be different for tennis players than for either type of track athlete.

athlete_cells_dat %>%
  group_by(sport) %>%
  ggplot(aes(x = wcc, y = rcc,
                 color = sport)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

But, I don’t know…maybe I’m not completely convinced. I mean, sure, I can always draw a regression line if I want to, but these relationships don’t look super strong. Maybe I’m just “fitting noise” that I happened to see in this particular sample. Am I really confident that this model helps me predict an athlete’s red cell count?

That’s where the F test comes in!

6.9.2 The full F test

The “full” F test on a regression model asks a very broad question: is anything in this model useful? Does using the model improve our predictions of the response, \(y\)? Or could we pretty much do as well just guessing the average, \(\overline{y}\), for every point? The null hypothesis is that nothing in the model really helps – it’s very general.

The F test in regression is based on the concept of analysis of variance, or ANOVA. We’ve seen ANOVA before and there’s a lot to say about it, but for right now, just think about the phrase at face value: we’re going to analyze variances to tell us whether the model is useful.

You’ll see that a lot of this math reasoning is quite similar to the version of ANOVA we saw previously, looking at whether different groups had different mean values of some quantitative variable. For example, “do students in different class years sleep different amounts, on average?”

Secretly (well, not that secretly, I guess), the math looks the same because it is the same. The question “do students in different class years sleep different amounts, on average?” is actually the question “is there a relationship between class year and sleep quantity?” which is the same as asking “can I use class year to improve my estimate of how much a student sleeps?”…which is a regression equation with a categorical predictor.

The ANOVA we saw before is, in a way, just a special version of doing a regression F test.

The basic premise is this: the \(y\) values in the dataset have some variance. But if the model is useful, we should be able to predict something about which \(y\) values are higher or lower. We’ll still have residuals, and those residuals will have variance…but that spread should be smaller than the original spread of the \(y\) values.

Sound familiar? This is similar to the reasoning behind \(R^2\)! \(R^2\) was the proportion of the variation in \(y\) that we could explain using the model. If the “leftovers” – the residuals – had a noticeably smaller spread than the original \(y\) values, we concluded that there was a relationship between \(x\) and \(y\): the model was helping.

Now our question was: is using this model any better than just using \(\overline{y}\) as our prediction all the time?

There’s a lot of ways to think about this, mathematically, but here’s one.

6.9.3 Sums of Squares

\(SSTot\) or the Total Sum of Squares is the total variation of \(y\) around its mean. It’s the numerator of the sample variance of \(y\) – ignoring anything to do with the predictors. If we say that \(y_i\) is the response value for point \(i\), we have:

\[SSTot = S_{yy} = \sum(y_i - \overline{y})^2\]

Strictly speaking, this is the corrected total sum of squares, or \(SSTot_c\). In this class, you can assume this is what we mean by \(SSTot\), but be careful when you’re reading other sources.

In the regression context, we don’t usually refer to this as “SSY,” because \(Y\) is the response and so it gets special names for things.

Note that this is not just the sum of the \(y_i\) squared: we’ve subtracted \(\bar{y}\) first.

The next step is to break up (or decompose) this sum of squares into two pieces. First we go from the constant model’s prediction (\(\overline{y}\)) to the linear model’s prediction (\(\hat{y}_i\)), then we go from \(\hat{y}_i\) to the true value \(y_i\). If the linear model gets us most of the way there, it’s doing well!

So the distance of \(y_i\) from the horizontal line \(\bar{y}\) can be broken into two parts:

\[y_i - \bar{y} = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y})\]

And since that’s true for each \(i\), it’s true for the sum:

\[\sum(y_i - \bar{y}) = \sum(y_i - \hat{y_i}) + \sum(\hat{y_i} - \bar{y})\]

Now I claim:

\[\sum(y_i - \bar{y})^2 = \sum(y_i - \hat{y_i})^2 + \sum(\hat{y_i} - \bar{y})^2\]

“But wait!” you say. “I know FOIL,” you say. “Shouldn’t there be some sort of cross terms coming out of that multiplication?”

You’re quite right. But it turns out that they’re 0 – there’s an exciting proof for you to do later in life. For now, you can take my word for it.

Now that breakdown is interesting. That first piece on the right-hand side looks like the (squared) residuals again – we’d like that to be small. The second piece represents the (squared) differences between the naive prediction (using a constant) and our shiny new prediction (using a line). Both pieces are sums of squared things, so let’s name them accordingly:

\[SSTot = SSE + SSR\]

Incidentally, there’s no consensus on this notation. Options include:

  • Total Sum of Squares: \(SSTot = S_{yy} = SST = SSTO\) (\(SST\) is confusing if you’re doing experimental design though, because \(T\) can stand for “treatment”)

  • Regression Sum of Squares: \(SSR = SSReg\), or sometimes \(SST\) or \(SSTr\) in experimental design

  • Sum of Squared Errors (or Residual Sum of Squares): \(SSE\). Very occasionally \(SSR\), but please don’t adopt that notation now because it will be unbearably confusing.

“But wait,” you say. “That’s not fair! Every time I get a new data point, it’s going to have a residual. And all of the terms in these sums are squares – they have to be positive. All these sums of squares are going to get bigger and bigger, regardless of whether the line’s doing a good job!”

You are of course correct. So we have to consider sums of squares relative to each other. If we’re going to argue that our regression was a good and useful idea, that \(SSR\) piece had better be – relatively speaking – large. How can we quantify that?

For \(R^2\), we look at the ratio of sums of squares \(SSR/SSTot\). But for the \(F\) test, we take a slightly different approach.

6.9.4 Mean Squares

Dividing a sum of squares by its degrees of freedom gives what’s called a Mean Square. We’ve seen degrees of freedom before in \(t\) tests. In a multiple regression context, the model has one degree of freedom for each coefficient that you estimate, plus the intercept. So if you have a “model” that says “just use the mean,” it has 1 degree of freedom; but if you also have \(k\) different terms in the model with their own coefficients, then that has \(k+1\) degrees of freedom. The total degrees of freedom associated with \(SSTot\) is \(n-1\), where \(n\) is the number of data points you have.

We noted earlier that sums of squares look kind of like variances, but without the denominator. And indeed, the mean squares are variances:

  • \(\frac{SSTot}{n-1} = MSTot\) is the (total) variance of \(y\) ignoring \(x\)
  • \(\frac{SSE}{n - k - 1} = MSE\), which you’ve seen before, is the variance of the residuals \(e\)
  • \(\frac{SSR}{k} = MSR\) is a measure of how much the regression differs from the mean. Because we’re comparing our model to the “just use the mean” model, we have \(k\) degrees of freedom, because our model has \(k\) more coefficients to estimate than the mean model (which just estimates \(\overline{y}\) and that’s it).

In ANOVA, we look at the ratio of MSR/MSE. If it’s large, the regression is doing most of the work. How large is large? Well that sounds like a hypothesis testing question!

To do an \(F\) test, we use that ratio, MSR/MSE, as our test statistic; it’s called an \(F\) statistic.

\[F = MSR/MSE\]

It turns out that under the null hypothesis – namely, that nothing in the model is actually useful – that \(F\) statistic is drawn from a distribution called the \(F\) distribution (sensibly enough), which is defined using two separate degrees of freedom. We won’t go into what exactly that is or how we know, but the point is, we have a test statistic and a null distribution to compare it to! So we can get a p-value and continue as usual.

6.9.5 Back to the athletes

Now, you’re not going to have to go around calculating sums of squares and all that by hand. That’s what R is for. Let’s fit a model that predicts red blood cell count using white blood cell count, sport, and the interaction:

athlete_cells_lm = lm(rcc ~ wcc*sport, data = athlete_cells_dat)
athlete_cells_lm %>% summary()
## 
## Call:
## lm(formula = rcc ~ wcc * sport, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75133 -0.28865 -0.04293  0.30145  1.52911 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.14331    0.45684   9.069 4.65e-12 ***
## wcc               0.08467    0.06938   1.220    0.228    
## sportT_Sprnt      0.72214    0.68068   1.061    0.294    
## sportTennis      -0.59307    0.79779  -0.743    0.461    
## wcc:sportT_Sprnt -0.03883    0.09753  -0.398    0.692    
## wcc:sportTennis   0.10357    0.11924   0.869    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4601 on 49 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.1904 
## F-statistic:  3.54 on 5 and 49 DF,  p-value: 0.008233

Check out the bottom of that summary output. That F-statistic is what we just defined above! It’s the ratio of what the model does explain to what it doesn’t explain. And it comes with its very own p-value. Assuming we’re using a typical \(\alpha\) of 0.01 or something, this p-value is less than \(\alpha\). We reject the null hypothesis that nothing in the model is useful. We have evidence that this model is better than just guessing \(\overline{y}\) for every prediction we make.

Response moment: We rejected \(H_0\) in this overall \(F\) test. What can we actually conclude from that? (What’s the alternative hypothesis?)