6.10 F Test Follow-Ups

6.10.1 Why aren’t we done yet?

So, you’ve come up with a multiple regression model, fit it to data, gotten some estimates of the coefficients, and even done an F test. You rejected \(H_0\) and concluded that there’s evidence your model works – your predictions are better than just using \(\overline{y}\) all the time. Yay!

But…now what? You might have a lot of different terms and predictors in your model. Which ones are useful?

The F test doesn’t tell you this, even if you reject the null hypothesis. It’s too general: it just tells you something is working. So your next step is to try and figure out which terms in your model are actually helping, and which you can leave out.

For an example, let’s go back to the Australian athlete data. We want to predict each athlete’s red blood cell count, using their white blood cell count, their sport, and an interaction between them. We can fit a model using all those terms:

athlete_cells_lm = lm(rcc ~ wcc*sport, data = athlete_cells_dat)
athlete_cells_lm %>% summary()
## 
## Call:
## lm(formula = rcc ~ wcc * sport, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75133 -0.28865 -0.04293  0.30145  1.52911 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.14331    0.45684   9.069 4.65e-12 ***
## wcc               0.08467    0.06938   1.220    0.228    
## sportT_Sprnt      0.72214    0.68068   1.061    0.294    
## sportTennis      -0.59307    0.79779  -0.743    0.461    
## wcc:sportT_Sprnt -0.03883    0.09753  -0.398    0.692    
## wcc:sportTennis   0.10357    0.11924   0.869    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4601 on 49 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.1904 
## F-statistic:  3.54 on 5 and 49 DF,  p-value: 0.008233

The overall F test is described at the very bottom of the output. The associated p-value is 0.008, so at \(\alpha=0.01\), say, we’d reject \(H_0\) and conclude that this model is useful – its predictions are better than just using \(\overline{rcc}\) for everybody. But what about this model is useful?

6.10.2 \(t\) tests for individual variables

The first technique that may come to mind is this. We’ve previously done \(t\) tests in simple linear regression to test whether the slope of a predictor was 0 – that is, whether the predictor really had a significant relationship with the response. Why not use that again? We could keep the predictors that have significant relationships with \(y\), and drop the rest.

In fact, you don’t even have to do anything extra for this. R performs those \(t\) tests on the coefficients of each predictor and shows them to you in the summary output!

athlete_cells_lm %>% summary()
## 
## Call:
## lm(formula = rcc ~ wcc * sport, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75133 -0.28865 -0.04293  0.30145  1.52911 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.14331    0.45684   9.069 4.65e-12 ***
## wcc               0.08467    0.06938   1.220    0.228    
## sportT_Sprnt      0.72214    0.68068   1.061    0.294    
## sportTennis      -0.59307    0.79779  -0.743    0.461    
## wcc:sportT_Sprnt -0.03883    0.09753  -0.398    0.692    
## wcc:sportTennis   0.10357    0.11924   0.869    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4601 on 49 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.1904 
## F-statistic:  3.54 on 5 and 49 DF,  p-value: 0.008233

Each line in that coefficients table corresponds to a different term in the model, from the intercept all the way down to the interaction \(wcc*I_{tennis}\). The “Estimate” column has the associated coefficient \(b_{whatever}\); then R calculates the standard error of that coefficient estimate, the corresponding \(t\) statistic, and finally the p-value. That p-value is from the \(t\) test for that coefficient. So for example, the wcc row describes a \(t\) test with the null hypothesis \(\beta_{wcc} = 0\).

Now we’re cooking!

…or are we? Apart from the intercept (which is not interesting), none of these terms have significant p-values. I mean, not even close. We’re not actually confident that any of these \(\beta\)’s are not 0. And yet we were confident that the model, overall, was doing something! So what’s the deal?

Well, remember that key phrase when you’re talking about multiple regression: after accounting for other predictors. It could be that none of the relationships between any of the terms and the response is significant after you account for the other stuff.

Look what happens if we fit a model using just white blood cell count as a predictor:

athlete_cells_lm_wcc = lm(rcc ~ wcc, data = athlete_cells_dat)
athlete_cells_lm_wcc %>% summary()
## 
## Call:
## lm(formula = rcc ~ wcc, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86908 -0.24397  0.01004  0.28753  1.82326 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.07273    0.30856  13.199   <2e-16 ***
## wcc          0.11606    0.04513   2.571    0.013 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4867 on 53 degrees of freedom
## Multiple R-squared:  0.1109, Adjusted R-squared:  0.09414 
## F-statistic: 6.612 on 1 and 53 DF,  p-value: 0.01297

Incidentally, notice the p-value for wcc and the p-value for the overall F test. They’re the same! That’s because the only difference between this model and the “just use the mean” constant model is wcc. So asking whether “the model” is useful is the same thing as asking whether wcc is useful.

It’s not coincidence that the actual p-values are exactly the same (although the one in the coefficients table has been rounded) – the \(t\) test and the F test are mathematically equivalent when you have only one predictor in the model. This has to do with some cool math about distributions that you can do later in life.

Huh. Now it looks significant!

What if we just use the athlete’s sport?

athlete_cells_lm_sport = lm(rcc ~ sport, data = athlete_cells_dat)
athlete_cells_lm_sport %>% summary()
## 
## Call:
## lm(formula = rcc ~ sport, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79103 -0.37218 -0.03333  0.27288  1.52667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.69103    0.08763  53.532  < 2e-16 ***
## sportT_Sprnt  0.50230    0.15008   3.347  0.00152 ** 
## sportTennis   0.09987    0.16710   0.598  0.55265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4719 on 52 degrees of freedom
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1483 
## F-statistic: 5.701 on 2 and 52 DF,  p-value: 0.005772

Well, the coefficient of \(I_{tennis}\) doesn’t look significant – that is, we’re not confident that there is a difference in average red cell count between tennis players and the baseline group, 400-meter runners. But we are pretty confident that \(\beta_{sprint}\) is not 0 and there is a difference between 400-meter runners and sprinters.

Why don’t these terms show up as significant in the big model? This can happen for a number of reasons, but one reason could be that they are sort of redundant – especially with those interaction terms in there. None of the terms adds that much once you already know all the others.

Well, this confuses things. What should we take out of the model? If we do take things out, will the model get worse?

This brings us to…

6.10.3 Partial F tests

In the overall F test, we decided that our model was better than the “just the mean” model. What if we tried a smaller model, one that had some of the predictors in this model but not all of them? Would that do just as well?

Suppose we have two different “candidate” models. Each one tries to predict the same response variable – red cell count – but they use different predictors to do it. In particular, let’s say that these models are nested. That means that the larger one has all of the terms that the smaller one does, plus some extras. So the smaller one is nested inside the other one. We sometimes refer to the two models as the full and reduced models.

The null hypothesis here is that the two models are effectively the same. That is, the extra terms that we added to get the full model from the reduced model don’t really do anything.

Now, back in the overall F test, we compared the mean square for regression, MSR, to the mean square for error, MSE:

\[F = MSR/MSE\]

The numerator was about “how much better is our model than the constant model?” and the denominator was “okay but how much error is still left when we use our model?”

We want to modify this a bit: instead of comparing a model to the constant model, let’s compare the full model to the reduced model that’s nested within it.

The denominator should still be about error – error that even our full model can’t get rid of. Just to be clear, we’ll call it \(MSE_{full}\): that’s the mean squared error from the big model.

In the numerator, we now want to know how much better the full model is than the reduced model. Well, let’s compare their errors: \(SSE_{full}\) and \(SSE_{reduced}\). Because the full model gets to use all the information in the reduced model plus some extra stuff, its predictions will be better and its residuals will be smaller: \(SSE_{full}\) will always be smaller than \(SSE_{reduced}\). So let’s just subtract them: \(SSE_{reduced} - SSE_{full}\).

There’s one more step, though – in the earlier F test, we had to divide the sum of squares by their degrees of freedom to get mean squares, and then use the ratio of those.

In this case, the degrees of freedom for the numerator is the number of “extra” terms that are in the full model but not the reduced model.

So, all put together, we have:

\[F = \frac{(SSE_{reduced} - SSE_{full})/(\mbox{number of extra terms})}{MSE_{full}}\]

There’s a lot of Math happening here, but again, it comes down to the numerator being about how much better the full model is, and the denominator being about how much the full model fails to explain.

This is the partial F-statistic. It serves as a test statistic! Under the null hypothesis, it follows an \(F\) distribution (with appropriate degrees of freedom). And so we can get ourselves a p-value and carry on as usual.

6.10.4 Partial F tests with data

Let’s see what this looks like on our dataset. Here our full model, athlete_cells_lm, has terms for wcc, sport, and the wcc:sport interaction. But earlier, we saw that the model with just wcc seems to work pretty well too. So we might ask: is this big model actually better than a reduced model that just uses wcc?

Yes, the anova() function that you use to work with linear regression models can also be used for “difference of group means” ANOVA :)

We use the R function anova to perform this test. When you give it the names of two different lm objects, corresponding to a pair of nested models, it automatically does the partial F test:

anova(athlete_cells_lm, athlete_cells_lm_wcc)
## Analysis of Variance Table
## 
## Model 1: rcc ~ wcc * sport
## Model 2: rcc ~ wcc
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     49 10.372                              
## 2     53 12.553 -4   -2.1806 2.5754 0.04902 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output is quite informative! First it lists both models, so you know what terms are in them. Then it shows you information about the partial F test. Notice the Df column: that tells you that the smaller model uses 4 fewer degrees of freedom than the big one. That’s because there are four coefficients that are in the full model but not in the reduced one: the ones for the indicators \(b_{sprint}\) and \(b_{tennis}\), and the interaction terms \(b_{wcc:sprint}\) and \(b_{wcc:tennis}\). And finally, R gives you the p-value.

What do we conclude? At an alpha level of 0.05, we reject the null hypothesis that the two models work equally well. We have evidence that adding sport and the interaction do improve the model’s performance.

Well, okay, but…do we really need the interaction? Let’s fit an additive model – one where the effects of the predictors are only added, with no interaction effects:

athlete_cells_lm_add = lm(rcc ~ wcc + sport, data = athlete_cells_dat)

Now, is the full model with the interaction effects actually better than this model without interaction?

anova(athlete_cells_lm, athlete_cells_lm_add)
## Analysis of Variance Table
## 
## Model 1: rcc ~ wcc * sport
## Model 2: rcc ~ wcc + sport
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     49 10.372                           
## 2     51 10.679 -2  -0.30633 0.7236 0.4901

Ooh. Nope. At least, we don’t have evidence that it’s better. We can, however, confirm that the additive model (using white cell count and sport but not their interaction) is better than just using white cell count alone:

anova(athlete_cells_lm_add, athlete_cells_lm_wcc)
## Analysis of Variance Table
## 
## Model 1: rcc ~ wcc + sport
## Model 2: rcc ~ wcc
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     51 10.679                              
## 2     53 12.553 -2   -1.8743 4.4757 0.01619 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s check the summary output for this additive model:

athlete_cells_lm_add %>% summary()
## 
## Call:
## lm(formula = rcc ~ wcc + sport, data = athlete_cells_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74888 -0.29775 -0.01978  0.28530  1.53146 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.10954    0.29289  14.031  < 2e-16 ***
## wcc           0.08989    0.04333   2.075  0.04309 *  
## sportT_Sprnt  0.44078    0.14852   2.968  0.00456 ** 
## sportTennis   0.08891    0.16212   0.548  0.58579    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4576 on 51 degrees of freedom
## Multiple R-squared:  0.2437, Adjusted R-squared:  0.1992 
## F-statistic: 5.477 on 3 and 51 DF,  p-value: 0.002433

Notice how the coefficients for both wcc and sportT_Sprnt now look significant! By removing extra, redundant stuff from the model, we’ve become more confident that these particular predictors really do matter.

Response moment: The coefficient for sportTennis does not have a significant p-value, but I left it in anyway. Why do you think I did this? What would happen if I took the \(I_{tennis}\) term out of the model?