2.5 Multiple Comparison: issues and fixes

2.5.1 Introduction: why worry?

We’ve been having a grand old time so far with all this analysis. Using ANOVA, we could test whether a particular factor mattered overall – whether any or all of its levels were significantly different. Then we could test specific contrasts to follow up – whether two particular groups were different, or even something more complicated. It has been a statistical wonderland of knowledge.

But we’ve forgotten an important point. As humans often do, we have forgotten that we might be wrong.

Every time we run one of these tests, there is a chance that our conclusion is wrong. And the more tests we do, the more likely we are to be wrong about some of them.

2.5.2 A brief recapitulation of errors

Various people have also, seriously or semi-seriously, proposed definitions for type III and type IV errors. A popular definition for type III error is, as Kimball put it, “giving the right answer to the wrong problem.” Tragically, there is no sample size or \(\alpha\) value that can safeguard you from such an error.

See Wikipedia for more.

In hypothesis testing, we talk about two types of errors. Here is a visualization from our colleagues at Columbia:

If \(H_0\) is in fact true, but you reject it, you have made a type I error. Your \(\alpha\) level is in fact your probability of making such an error if \(H_0\) is true – it’s a conditional probability. So when you choose \(\alpha\), you’re describing your acceptable level of risk of getting excited and rejecting the null when in fact there’s nothing going on.

If, on the other hand, \(H_0\) is not true, but you fail to reject it, you have made a type II error. The conditional probability of making a type II error – the probability of failing to reject when \(H_0\) is actually false – is called \(\beta\). The “opposite” situation – the probability of correctly rejecting \(H_0\) if it’s actually false – is \(1-\beta\). This quantity is called power.

Now, as an important side note: \(\beta\) and power are a bit complicated because they depend on how wrong \(H_0\) is. For example, if we are weighing baby chicks who’ve been fed two different diets, the null hypothesis is that \(\mu_1 = \mu_2\): the average weight in the two diet groups is the same. There are lots of ways this null hypothesis could be false! For example, suppose that actually \(\mu_1 = \mu_2 + 0.00001\). Then \(H_0\) is false, but it’s going to be very hard to tell that based on observed data. But if in truth \(\mu_1 = \mu_2 + 200\), that’s going to be really easy to notice. Even if I look at just a few chickens, I’ll probably see a big enough difference to reject \(H_0\). So when we talk about power we usually say “power to detect an effect of size \(\Delta\).”

You may recall from intro days that there is a tradeoff between type I and type II errors. If I use a very low, strict \(\alpha\) value, I’m not likely to make a type I error. But if \(H_0\) really is false, I’m not likely to reject it, because I’m so strict about my p-values – so I have a higher chance of a type II error, and lower power. I can increase power by rejecting more often, using a higher and more generous \(\alpha\), but then I run an increased risk of committing a type I error and rejecting \(H_0\) when I shouldn’t.

You may also recall that the way out of this dilemma is: increase the sample size. With a larger \(n\), you just have a better idea of what is happening. You are less likely to reject when you shouldn’t, and also more likely to notice if there really is a difference.

There’s a whole heap of methodology and lucrative software offerings out there about calculating your ideal sample size based on your desired \(\alpha\) and power, but we won’t really go into it now. At the moment, I’m mostly concerned with this \(\alpha\).

2.5.3 Multiple hypothesis testing

So suppose you are a conventionalist and thus performing a hypothesis test with an \(\alpha\) of 0.05. That means that, if that null hypothesis is true, there’s a 5% chance you’re going to reject it anyway – you just happened to get a goofy sample that looked interesting when the truth is that nothing’s going on.

Okay, but 5% isn’t that bad. Remember the statistician’s philosophy: weird things happen, but not to me. You’re pretty confident that your sample wasn’t one of the goofy ones.

But then you come back tomorrow and you do another test. And then the next day. And so on every day for three weeks.

By the end of it, you’ve performed 21 different hypothesis tests. And while the chance of a type I error on any single one of them was only 5%, when you think of them all together, well…you’ve probably goofed by now. Even if every single one of your null hypotheses was actually true, it’s pretty likely that you rejected one or more of them.

This is the problem with multiple hypothesis testing, also called multiple comparison testing. Taking an acceptable risk of type I error over and over again leads to a much higher level of overall risk.

Response moment: Suppose you do an ANOVA test on some grouping factor, and then to follow up, you want to do all the pairwise group-vs.-group \(t\) tests: level 1 vs. level 2, level 1 vs. level 3, and so on. If your grouping factor has four levels, how many \(t\) tests will you have to do? What if your grouping factor has 10 levels?

This is why it’s not okay to just do an ANOVA test and then try all the contrasts and group-vs.-group tests you can think of. Even if nothing is really happening, you will probably manage to dig up some contrast that looks significant at the 0.05 level, and then everyone will get excited over nothing. This is one example of p-hacking – you try lots of different subgroups and tests in the hopes of finding something that looks significant and will allow you to publish.

So you have three options here.

One: plan ahead. Think about what contrasts or tests are of greatest scientific interest. What do you really want to know? Then only do those tests.

Two: go ahead and try everything. Mine the socks off that dataset, see what looks interesting. But then, before you make any claims that what you’ve discovered is the truth, get a fresh, independent set of observations and try again on those. If your finding still looks significant on the new data, now you’ve got yourself a real result.

But maybe you’re not sure what will be interesting in advance – or your list of possibilities is so long that you’d be doing a whole bunch of tests anyway. And maybe you don’t have the option of getting more data.

This brings us to option three: math.

2.5.4 Multiple comparison fixes

People have done a lot of work on the multiple comparison problem for obvious reasons, namely, it’s really annoying. So there are a lot of methods for handling it, with varying degrees of complexity. We’ll look at the simplest one and then give a shout-out to one of the others that’s commonly used, but you can always look up more options later!

The first method that everybody learns for this is called the Bonferroni correction. It works like this: suppose you’re really comfortable with some value of \(\alpha\). If you’re going to do \(m\) different hypothesis tests, then instead of using \(\alpha\) as your confidence level for each test, use \(\alpha/m\). So if your preferred \(\alpha\) is 0.05 and you want to do 10 tests, you do each one with an \(\alpha\) level of 0.005. There’s also an extension of this to confidence intervals developed by Olive Jean Dunn, who didn’t get it named after her.

The Bonferroni approach is nice because it’s really simple; it’s easy to explain what you’re doing, and it doesn’t require any extra tools or functions to implement it. The problem is, it’s also really harsh. It does ensure that your overall chance of making any type I error is no higher than \(\alpha\), but at a cost – it becomes extremely difficult to reject \(H_0\) in any given test. You lose a lot of statistical power.

Another approach is called Tukey’s HSD, which stands for “honestly significant difference.” This gets used in the context where you’re comparing all possible pairs of groups: so your null hypotheses are \(H_0: \mu_i = \mu_j\) for all combinations of \(i\) and \(j\).

We won’t go into the math here, but the basic idea is this. If you’re doing a bunch of group-vs.-group comparisons, if anything is going to look significant, it’ll be the comparison between the highest group and the lowest group. So Tukey asks: if \(H_0\) is true overall and none of the groups are really different, how far apart would we expect the highest group and the lowest group to be? If we can’t find a difference bigger than that, then probably nothing is going on. This is called Tukey’s range statistic, and that’s what the HSD method is based on.

In R, the function TukeyHSD in the stats library helpfully performs this analysis for us. It works on an aov object:

TukeyHSD(x = chicken_aov, ordered = FALSE, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## Fit: aov.default(formula = weight ~ Diet, data = chicken_dat)
## $Diet
##          diff         lwr       upr     p adj
## 2-1  35.18824 -27.0570451  97.43352 0.4394508
## 3-1  88.48824  26.2429549 150.73352 0.0024869
## 4-1  63.47712  -0.9086564 127.86290 0.0545918
## 3-2  53.30000 -16.5496130 123.14961 0.1895019
## 4-2  28.28889 -43.4747665 100.05254 0.7186387
## 4-3 -25.01111 -96.7747665  46.75254 0.7877511

This output gives you a lot of information! Each row corresponds to a pair of groups: 2 vs. 1, 3 vs. 1, and so on. The diff column gives the difference between the two group means. Then lwr and upr give a confidence interval for the true difference between the groups – adjusted for the fact that you’re doing multiple comparisons. And finally p adj gives the adjusted p-value for that comparison.

You can see how these results align with what you saw in the boxplot:

chicken_dat %>%
  ggplot() +
  geom_boxplot(aes(x = Diet, y = weight))

These results tell us that diet 3 does seem to be different from diet 1, but I’m not really confident that any of the other pairs are different. Mayyyybe 4 vs. 1 if I’m feeling generous. But note that that 3-vs.-1 difference was enough to reject the ANOVA hypothesis that all the group means were the same!