Chapter 11 Post-hoc comparisons

11.1 Introduction

Analysis of variance, as we have seen, can be used to test null-hypotheses about overall effects of certain factors (categorical variable) or combinations of factors (moderation). This is done with \(F\)-test statistics, with degrees of freedom that depend on the number of (combinations of) categories. The regression table, with \(t\)-tests in the output, can be used to compute specific contrasts, either the standard contrasts based on dummy coding, or contrasts based on alternative coding schemes.

In the previous chapter, all alternatives for specifying contrasts have been discussed in the context of specific research questions. It is important to make a distinction here between research questions that are posed before the data gathering and analysis, and research questions that pop up during the data analysis. Research questions of the first kind we call a priori (“at the outset”) questions, and questions of the second kind we call post hoc (“after the fact”) questions.

We’ve seen that the whole data analysis approach for inference is based on sampling distributions, for instance the sampling distribution of the \(t\)-statistic given that a population value equals 0. We then look at what \(t\)-value can be deemed large enough to reject the null-hypothesis (or to construct a confidence interval). Such a critical \(t\)-value is chosen in a way that if the null-hypothesis is true, it only happens in a low percentage of cases (\(\alpha\)) that we find a \(t\)-value more extreme than this critical value. This helps us to reject the null-hypothesis: we see something extreme that can’t be explained very well by the null-hypothesis.

However, if we look at a linear regression output table, we often see many \(t\)-values: one for the intercept, several for slope coefficients, and if the model includes moderation, we also may see several \(t\)-values for interaction effects. Every single \(t\)-value is based on a hypothesis that the null-hypothesis is true, that is, that the actual parameter value (or contrast) is 0 in the population. For every single \(t\)-test, we therefore know that if we would draw many many samples, in only \(\alpha\)% of the samples, we would find a \(t\)-value more extreme than the critical value (given that the null-hypothesis is true). But if we have for instance 6 different \(t\)-values in our output, how large is the probability that any of these 6 different \(t\)-values is more extreme than than the critical value?

Let’s use a very simple example. Let’s assume we have a relatively large data set, so that the \(t\)-distibution is very close to the normal distribution. When we assume we use two-sided testing with an \(\alpha\) of 5%, we know that the critical values for the null-hypothesis are \(-1.96\) and \(1.96\). Imagine we have a dependent variable \(Y\) and a categorical independent variable \(X\) that consists of two levels, A and B. If we run a standard linear model on those variables \(Y\) and \(X\), using dummy coding, we will see two parameters in the regression table: one for reference level A (labelled “(Intercept)”), and one coefficient for the difference between level B and A (labelled “XB”). Suppose that in reality, the population values for these parameters are both 0. That would mean that the two group means are equal to 0. When we do the research many times, drawing a large sample and repeat taking new samples 100 times, how many times will the intercept have a \(t\)-value more extreme than \(\pm 1.96\)? Well, by definition, that frequency would be about 5, because we know that for the \(t\)-distribution, 5% of this distribution has values more extreme than \(\pm 1.96\). Thus, if the intercept is truly 0 in the population, we will see a significant \(t\)-value in 5% of the samples.

The same is true for the second parameter “XB”: if this value is 0 in the population, then we will see a significant \(t\)-value for this parameter in the output in 5% of the samples.

Both of these events would be Type I errors: the kind of error that you make when you reject the null-hypothesis while it is actually true (see Chapter 2).

For any given sample, there can be either no Type I error, or there is one Type I error, of there are two Type I errors. Now, if the probability for a Type I error is 5% for a significant value for “(Intercept),” and the probability is 5% for a type I error for “XB,” what then is the probability for at least one Type I error?

This is a question for probability theory. If we assume that the Type I errors for the intercept and the slope are independent, we can use the binomial distribution (Ch. 3) and know that the probability of finding no Type I errors equals

\[P(errors = 0 | \alpha = 0.05) = {2 \choose 0} \times 0.05^0 \times (1-0.05)^2 = 0.95^2 = 0.9025\]

[Note: In case you skipped Chapter 3, \(4 \choose 2\) is pronounced as “4 choose 2” and it stands for the number of combinations of two elements that you can have when you have four elements. For instance, if you have 4 letters A, B, C and D, then there are 6 possible pairs: AB, AC, AD, BC, BD, and CD. The general case of \(a \choose b\) can be computed by R using choose (a, b). \(2 \choose 0\) is defined as 1 (there is only one way in which neither of the 2 tests results in a Type I error.)]

Therefore, since probabilities sum to 1, we know that the probability of at least one Type I error equals \(1- 0.9025= 0.0975\).

We see that when we look at two \(t\)-tests in one analysis, the probability of a Type I error is no longer 5%, but almost twice that: 9.75%. This is under the assumption that the \(t\)-tests are independent of each other, which is often not the case. We will discuss what we mean with independent later. For now it suffices to know that the more null-hypotheses you test, the higher the risk of a Type I error.

For instance, if you have a categorical variable \(X\) with not two, but ten different groups, your regression output table will contain ten null-hypothesis tests: one for the intercept (reference category) and nine tests for the difference between the remaining groups and the reference group. In that case, the probability of at least one Type I error, if you perform each test with an \(\alpha\) of 5%, will be

\[1 - P(errors = 0| \alpha = 0.05) = 1 - { 10 \choose 0} \times 0.05^0 \times 0.95^{10} = 1 - 0.95^{10} = 0.4013\] And all this is only in the situation that you stick to the default output of a regression. Imagine that you not only test the difference between each group and the reference group, but that you also make many other contrasts: difference between group 2 and group 9, etcetera. If we would look at each possible pair among these ten groups, there would be \({10 \choose 2} = 144\) contrasts and consequently 144 \(t\)-tests. The probability of a Type I error will then be very close to 1, almost certainty!

The problem is further complicated by the fact that tests for all these possible pairs cannot be independent of each other. This is easy to see: If Jim is 5 centimetres taller than Sophie, and Sophie is 4 centimetres taller than Wendy, we already know the difference in height between Jim and Sophie: \(5 + 4 = 9\) centimetres. In other words, if you want to know the contrast between Jim and Wendy, you don’t need a new analysis: the answer is already there in the other two contrasts. Such a dependence in the contrasts that you estimate can lead to an even higher probability of a Type I error.

In order to get some grip on this dependence problem, we can use the so-called Bonferroni inequality. In the context of null-hypothesis testing, this states that the probability of at least one Type I error is less than or equal to \(\alpha\) times the number of contrasts J. This is called the upper bound.

\[P (errors > 0) = 1 - P (errors = 0) \leq J\alpha\]

This inequality is true whether two contrast are heavily dependent (as in the height example above) or only slightly, or not at all. For instance, if you have two contrasts in your output (the intercept and the slope), the probability of at least one Type I error equals 0.0975, but only if we assume these two contrasts are independent. In contrasts, if the two contrasts are dependent, we can use the Bonferroni inequality to know that the probability of at least one Type I error is less than or equal to \(0.05 \times 2 = 0.10\). Thus, if there is dependency we know that the probability of at least one Type I error is at the most 0.10 (it could be less bad).

Note that if \(J\alpha > 1\), then the upper bound is set equal to 1.

This Bonferroni upperbound can help us to take control over the overall probability of making Type I errors. Here we make a distinction between the test-wise Type I error rate, \(\alpha_{TW}\), and the family-wise Type I error rate, \(\alpha_{FW}\). Here, \(\alpha_{TW}\) is the probability of a Type I error used for one individual hypothesis test, and \(\alpha_{FW}\) is the probability of at least one Type I error among all tests performed. If we have a series of null-hypothesis tests, and if we want to have an overall probability of at most 5% (i.e., \(\alpha_{FW}=0.05\)), then we should set the level for any individual test \(\alpha_{TW}\) at \(\alpha_{FW}/J\). Then we know that the probability of at least one error is 5% or less.

Note that what is true here for null-hypothesis testing is also true for the calculation of confidence intervals. Also note that we should only look at output for which we have research questions. Below we see an example of how to apply these principles.


We use the ChickWeight data available in R. It is a data set on the weight of chicks over time, where the chicks are categorised into four different groups, each on a different feeding regime. Suppose we do a study on diet in chicks with one control group and three experimental groups. For each of these three experimental groups, we want to estimate the difference with the control condition (hence there are three research questions). We perform a regression analysis with dummy coding with the control condition (Diet 3) as the reference group to obtain these three contrasts. For the calculation of the confidence intervals, we want to have a family-wise Type I error rate of 0.05. That means that we need to have a test-wise Type I error rate of 0.05/3 = 0.0167. We therefore need to compute \(100-1.67 = 98.33\)% confidence intervals and we do null-hypothesis testing where we reject the null-hypothesis if \(p < 0.0167\). The R syntax would look something like the following:

ChickWeight <- ChickWeight %>% 
  mutate(Diet = relevel(Diet, ref = "3"))
model <- ChickWeight %>% 
  lm(weight ~ Diet, data =.) 
model %>% 
  tidy( = TRUE, conf.level = 0.9833)
## # A tibble: 4 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   143.        6.33    22.6   2.59e-81    128.     158.  
## 2 Diet1         -40.3       7.87    -5.12  4.11e- 7    -59.2    -21.4 
## 3 Diet2         -20.3       8.95    -2.27  2.35e- 2    -41.8      1.15
## 4 Diet4          -7.69      8.99    -0.855 3.93e- 1    -29.3     13.9
## [1] 574

In the output we see the three contrasts that we need to answer our research questions. We can then report:

"We estimated the difference between each of the three experimental diet with the control diet. In order to control the family-wise Type I error rate and keep it below 5%, we used Bonferroni correction and chose a test-wise significance level of 0.0167 and computed 98.3% confidence intervals. The chicks on Diet 1 had a significantly lower weight than the chicks in the control conditions (Diet 3, \(b = -40.3\), \(SE = 7.87\), \(t(574) = -5.12\), \(p < .001\), 98.33% CI: -59.2, -21.4). The chicks on Diet 2 also had a lower weight than chicks on Diet 3, although the null-hypothesis could not be rejected (\(b = -20.3\), \(SE = 8.95\), \(t(574) = -2.27\), \(p = .0235\), 98.33% CI: -41.8, 1.15). The chicks on Diet 4 also had a weight not significantly different from chicks on Diet 3, (\(b = -7.69\), \(SE = 8.99\), \(t(574) = -0.855\), \(p = .0393\), 98.33% CI: -29.3, 13.9). "

Note that we do not report on the Intercept. Since we had no research question about the average weight of chicks on Diet 3, we ignore those results in the regression table, and divide the desired family-wise error rate by 3 (and not 4).

As we said, there are two kinds of research questions: a priori questions and post hoc questions. A priori questions are questions posed before the data collection. Often they are the whole reason why data were collected in the first place. Post hoc questions are questions posed during data analysis. When analysing data, some findings may strike you and they inspire you to do some more analyses. In order to explain the difference, let’s think of two different scenarios for analysing the ChickWeight data.

In Scenario 1, researchers are interested in the relationship between the diet and the weight of chicks. They see that in different farms, chicks show different mean sizes, and they are also on different diets. The researchers suspect that the weight differences are induced by the different diets, but they are not sure, because there are also many other differences between the farms (differences in climate, chicken breed and daily regime). In order to control for these other differences, the researchers pick one specific farm, they use one particular breed of chicken, and assign the chicks randomly to one of four diets. They reason that if they find differences between the four groups regarding weight, then diet is the factor responsible for those differences. Thus, their research question is: “Are there any differences in mean weight as a function of diet?” They run an ANOVA and find the following results:

model %>% anova() %>% tidy()
## # A tibble: 2 × 6
##   term         df    sumsq meansq statistic      p.value
##   <chr>     <int>    <dbl>  <dbl>     <dbl>        <dbl>
## 1 Diet          3  155863. 51954.      10.8  0.000000643
## 2 Residuals   574 2758693.  4806.      NA   NA

They answer their (a priori) research question in the following way.

"We tested the null-hypothesis of equal mean weights in the four diet groups at an alpha of 5%, and found that it could be rejected, \(F(3, 574), MSE = 4806, p < .001\). We conclude that diet does have an influence on the mean weight in chicks."

When analysing the data more closely, they also look at the mean weights per group.

ChickWeight %>% 
  group_by(Diet) %>% 
  summarise(mean = mean(weight))
## # A tibble: 4 × 2
##   Diet   mean
##   <fct> <dbl>
## 1 3      143.
## 2 1      103.
## 3 2      123.
## 4 4      135.

They are struck by the relatively small difference in mean weight between Diet 4 and Diet 3. They are surprised because they know that one of them contains a lot more protein than the other. They are therefore curious to see whether the difference between Diet 4 and Diet 3 is actually significant. Moreover, they are very keen on finding out which Diets are different from each other, and which Diets are not different from each other. They decide to perform 6 additional \(t\)-tests: one for every possible pair of diets.

In this Scenario I, we see two kinds of research questions: the initial a priori question was whether Diet affects weight, and they answer this question with one test. The second question only arose during the analysis of the data. They look at the means, see some kind of pattern and want to know whether all means are different from each other. This follow-up question that arises during data analysis is a post hoc question.

Let’s look at Scenario II. A group of researchers wonder whether they can get chicks to grow more quickly using alternative diets. There is one diet, Diet 3, that is used in most farms across the world. They browse through the scientific literature and find three alternative diets. These alternative diets each have a special ingredient that makes the researchers suspect that they might lead to weight gain in chicks. The objective of their research is to estimate the differences between these three alternative diets and the standard diet, Diet 3. They use one farm and one breed of chicken and assign chicks randomly to one of the four diets. They perform the regression analysis with the dummy coding and reference group Diet 3 as shown above, and find that the differences between the three experimental diets and the standard diet are all negative: they show slower growth rather than faster growth. They report the estimates, the standard errors and the confidence intervals.

In general we can say that a priori questions can be answered with regular alphas and confidence intervals. For instance, if you state that your Type I error rate \(\alpha\) is set at 0.05, then you can use this \(\alpha = 0.05\) also for all the individual tests that you perform confidence intervals that you calculate. However, for post hoc questions, where your questions are guided by the results that you see, you should correct the test-wise error rate \(\alpha_{TW}\) in such a way that you control the family-wise error rate \(\alpha_{FW}\).

Returning to the two scenarios, let’s look at the question whether Diet 4 differs from Diet 3. In Scenario I, this is a post hoc question, where in total you have 6 post hoc questions. You should therefore do the hypothesis test with an alpha of 0.05/6 = 0.0083, and/or compute a 99.17% confidence interval. In contrast, in Scenario II, the same question about Diets 4 and 3 is an a priori question, and can therefore be answered with an \(\alpha=5\)% and/or a 95% confidence interval.

Summarising, for post hoc questions you adjust your test-wise type I error rate, whereas you do not for a priori questions. The reason for this different treatment has to do with the dependency in contrasts that we talked about earlier. It also has to do with the fact that you only have a limited number of model degrees of freedom. In the example of the ChickWeight data, we have four groups, hence we can estimate only four contrasts. In the regression analysis with dummy coding, we see one contrast for the intercept and then three contrasts between the experimental groups and the reference group. Also if we use Helmert contrasts, we will only obtain four estimates in the output. This has to do with the dependency between contrasts: if you know that group A has a mean of 5, group B differs from group A by +2 and group C differs from group A by +3, you don’t need to estimate the difference between B and C any more, because you know that based on these numbers, the difference can only be +1. In other words, the contrast C to B totally depends on the contrast A versus B and A versus C. The next section discusses the dependency problem in more detail.

11.2 Independent (orthogonal) contrasts

Whether two contrasts are dependent is easily determined. Suppose we have \(J\) independent samples (groups), each containing values from a population of normally distributed values (assumption of normality). Each group is assumed to come from a population with the same variance \(\sigma^2_e\) (assumption of equal variance). For the moment also assume that the \(J\) groups have an equal sample size \(n\). Any group \(j\) will have a mean \(\bar{Y}_j\). Now imagine two contrasts among these means. The first contrast, \(L1\), has the weights \(c_{1j}\), and the second contrasts, \(L2\), has the weights \(c_{2j}\). Then we know that contrasts \(L1\) and \(L2\) are independent if

\[\sum_{j=1}^J c_{1j}c_{2j}=0\]

Thus, if you have \(J\) independent samples (groups), each of size \(n\), one can decide if two contrasts are dependent by checking if the products of the weights sum to zero:

\[c_{11}c_{21} + c_{12}c_{22} + \dots + c_{1J}c_{2J} = 0\]

Another word for independent is orthogonal. Two contrasts are said to be orthogonal if the two contrast are independent. Let’s look at some examples for a situation of four groups: one set of dependent contrasts and a set of orthogonal contrasts. For the first example, we look at default dummy coding. For contrast \(L1\), we estimate the mean of group 1. Hence

\[L1 = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix}\]

Let contrast \(L2\) be the contrast between group 2 and group 1:

\[L2 = \begin{bmatrix} -1 & 1 & 0 & 0 \end{bmatrix}\]

If we calculate the products of the weights, we get:

\[\sum_{j} c_{1j}c_{2j} = 1\times -1 + 0 \times 1 + 0 \times 0 + 0\times 0 = -1\] So we see that when we use dummy coding, the contrasts are not independent (not orthogonal).

For the second example, we look at Helmert contrasts. Helmert contrasts are independent (orthogonal). The Helmert contrast matrix for four groups looks like

\[L = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ -1 & 1 & 0 & 0 \\ -\frac{1}{2} & -\frac{1}{2} & 1 & 0 \\ -\frac{1}{3} & -\frac{1}{3} & -\frac{1}{3} & 1 \end{bmatrix}\]

For the first two contrasts, we see that the product of the weights equals zero:

\[\sum_{j} c_{1j}c_{2j} = \frac{1}{4} \times -1 + \frac{1}{4} \times 1 + \frac{1}{4} \times 0 + \frac{1}{4} \times 0 = 0\] Check for yourself and find that all four Helmert contrasts are independent of each other.

11.3 The number of independent contrasts is limited

Earlier we saw that there is only so much information you can gain from a data set. Once you have certain information, asking further questions leads to answers that depend on the answers already available.

This dependency has a bearing on the number of orthogonal comparisons that can be made with \(J\) group means. Given \(J\) independent sample means, there can be, apart from the grand mean, no more than \(J-1\) comparisons, without them being dependent on each other. This means that if you have \(J\) completely independent contrasts for \(J\) group means, it is impossible to find one more comparison which is also orthogonal to the first \(J\) ones.

This implies that if you ask more questions (i.e., ask for more contrasts) you should tread carefully. If you ask more questions, the answers to your questions will not be independent of each other (you are to some extent asking the same thing twice).

As an example, earlier we saw that if you know that group B differs from group A by +2 and group C differs from group A by -3, you don’t need to estimate the difference between B and C any more, because you know that based on these numbers, the difference can only be 5. In other words, the contrast C to B totally depends on the contrasts A versus B and A versus C. You can also see this in the contrast matrix for groups A, B and C below:

\[L = \begin{bmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ -1 & 0 & 1 \\ 0 & -1 & 1 \end{bmatrix} \]

The last contrast is dependent both on the third and the second contrast. Contrast \(L4\) can be calculated as \(L3 - L2\) by doing the calculation element-wise:

\[ \begin{bmatrix} -1 & 0 & 1 \end{bmatrix} - \begin{bmatrix} -1 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & -1 & 1 \end{bmatrix} \]

In other words, \(L4\) is a linear combination (weighted sum) of \(L2\) and \(L3\): \(L4 = 1\times L3 - 1 \times L2\). Statistically therefore, contrast \(L4\) is completely redundant given the contrasts \(L2\) and \(L3\): it doesn’t provide any extra information.

It should however be clear that if you have a research question that can be answered with contrast \(L4\), it is perfectly valid to make this contrast. However, you should realise that the number of independent research questions is limited. It is a wise idea to limit the number of research questions to the number of contrasts you can make: apart from the grand mean, you should make no more than \(J-1\) comparisons (your regression table should have no more than \(J\) parameters).

These contrasts that you specify belong to the a priori research question. Good research has a limited number of precisely worded research questions that should be answerable by a limited number of contrasts, usually 2 or 3, sometimes only 1. These can be answered by using the regular significance level. In social and behavioural sciences, oftentimes \(\alpha\) for each individual test or confidence interval equals 5%. However, the follow-up questions that arise only after the initial data analysis (computing means, plotting the data, etc.), should however be corrected to control the overall Type I error rate.

11.4 Fishing expeditions

Research and data analysis can sometimes be viewed as a fishing expedition. Imagine you fish the high seas for herring. Given your experience and what colleagues tell you (you browse the scientific literature, so to speak), you choose a specific location where you expect a lot of herring. By choosing this location, you maximise the probability of finding herring. This is analogous to the setting up of a data collection scheme where you maximise the probability of finding a statistically significant effect, or you maximise the precision of your estimates; in other words, you maximise statistical power (see Chapter 5). However, while fishing in that particular spot for herring, irrespective of whether you actually find herring, you find a lot of other fish and seafood. This is all coincidence, as you never planned to find these kinds of fish and seafood in your nets. The fact that you find a crab in your nets, might seem very interesting, but it should never be reported as if you were looking for that crab. You would have equally regarded it interesting if you had found a lobster, or a seahorse, or a baseball hat. You have to realise that it is pure random sampling error: you hang out your nets, and just pick up what’s there by chance. In research it works the same way: if you do a lot of statistical tests, or compute a large number of confidence intervals, you’re bound to find something that seems interesting, but is actually merely random noise due to sampling error. If the family-wise error rate is large, say 60%, then you cannot tell your family and friends ashore that the base-ball hat you found is very fascinating. Similarly, in research you have to control the number of Type I errors by adjusting the test-wise error rate in such a way that the family-wise error rate is low.

11.5 Several ways to define your post hoc questions

One question that often arises when we find that a categorical variable has an effect in an ANOVA, is to ask where this overall significant effect is coming from. For instance, we find that the four diets result in different mean weights in the chicks. This was demonstrated with an \(F\)-test at an \(\alpha\) of 5%. A follow-up question might then be, what diets are different from each other. You might then set up contrasts for all \({4 \choose 2} = 6\) possible pairs of the four diets.

Alternatively, you may specify your post hoc questions as simple or more complex contrasts in the same way as for your a priori questions, but now with no limit on how many. For instance, you may ask what alternative diets are significantly different from the standard diet (Diet 3). The number of comparisons is then limited to 3. Additionally, you might ask whether the alternative diets combined (grand mean of diets 1, 2 and 4) are significantly different from Diet 3.

Be aware, however, that the more comparisons you make, the more severe the correction must be to control the family-wise Type I error rate.

The analyses for the questions that you answer by setting up the entire data collection, and that are thus planned before the data collection (a priori), can be called confirmatory analyses. You would like to confirm the workings of an intervention, or you want to precisely estimate the size of a certain effect. Preferably, the questions that you have are statistically independent of each other, that is, the contrasts that you compute should preferably be orthogonal (independent).

In contrast, the analyses that you do for questions that arise while analysing the data (post hoc) are called exploratory analyses. You explore the data for any interesting patterns. Usually, while exploring data, a couple of questions are not statistically independent. Any interesting findings in these exploratory analyses could then be followed up by confirmatory analyses using a new data collection scheme, purposely set up to confirm the earlier findings. It is important to do that with a new or different sample, since the finding could have resulted from mere sampling error (i.e., a Type I error).

Also be aware of the immoral practice of \(p\)-hacking. \(P\)-hacking, sometimes referred to as selective reporting, is defining your research questions and setting up your analysis (contrasts) in such a way that you have as many significant results as possible. With p-hacking one presents their research in such a way that they find all these interesting results, ignoring the fact that they made a selection of the results based on what they saw in the data (post-hoc). For instance, their research was set up to find evidence for the workings of medicine A on the alleviation of migraine. Their study included a questionnaire on all sorts of other complaints and daily activities, for the sake of completeness. When analysing the results, they might find that the contrast between medicine A with placebo is not significant for migraine. But when exploring the data further, they find that medicine A was significantly better with regards to bloodpressure and the number of walks in the park. A \(p\)-hacker would write up the research as a study of the workings of medicine A on bloodpressure and walks in the park. This form of \(p\)-hacking is called cherry-picking: only reporting statistically significant findings and pretending you never set out to find the other things and not reporting them. Another \(p\)-hacking example would be to make a clever selection of the migraine data after which the effect becomes significant, for instance by filtering out the males in the study. Thus, \(p\)-hacking is the practice of trying to select the data or choose the method of analysis in such a way that the \(p\)-values in the report are as small as possible. The research questions are then changed from exploratory to confirmatory, without informing the reader.

11.6 Controlling the family-wise Type I error rate

There are several strategies that control the number of Type I errors. One is the Bonferroni method, where we adjust the test-wise error rate by dividing the family-wise error rate by the number of comparisons, \(\alpha_{TW} = \alpha_{FW} / J\). This method is pretty conservative, in that the \(\alpha_{TW}\) becomes low with already a couple of comparisons, so that the statistical power to spot differences that also exist in the population becomes very low. The upside is that this method is easy to understand and perform. Alternative ways of addressing the problem are Scheffé’s procedure, and the Tukey HSD method. Of these two, Scheffé’s procedure is also relatively conservative (i.e., little statistical power). The logic of the Tukey HSD method is based on the probability that a difference between two group means is more than a critical value, by chance alone. This critical value is called Honestly Significant Difference (HSD). We fix the probability of finding such a difference (or more) between the group means under the null-hypothesis at \(\alpha_{FW}\). The details of the actual method will not be discussed here. Interested readers may refer to Wikipedia and references therein.

11.7 Post-hoc analysis in R

In general, post hoc contrasts can be done in the same way as in the previous chapter: specifying the contrasts in an \(\mathbf{L}\) matrix, taking the inverse and assigning the matrix to the variable in your model. Here, you are therefore limited to the number of levels of a factor: you can only have \(J-1\) new variables, apart from the intercept of 1s. You can then adjust \(\alpha\) yourself using Bonferroni. For instance if you want to have a family-wise type I error rate of 0.05, and you look at two post-hoc contrasts, you can declare a contrast significant if the corresponding \(p\)-value is less than 0.025.

There are also other options in R to get post hoc contrasts, where you can ask for as many comparisons as you want.

There are two ways in which you can control the overall Type I error rate: either by using an adjusted \(\alpha\) yourself (as above), or adjusting the \(p\)-value. For now we assume you generally want to test at an \(\alpha\) of 5%. But of course this can be any value.

In the first approach, you run the model and the contrast just as you would normally do. If the output contains answers to post hoc questions, you do not use \(\alpha = 0.05\)%, but you use 0.05 divided by the number of tests that you inspect: \(\alpha_{TW}= \alpha_{FW}/k\), with \(k\) being the number of tests you do.

For instance, if the output for a linear model with a factor with four levels contains the comparison on groups 1 and 2, and it applies to an a priori question, you simply report the statistics and concludes significance if the \(p < .05\).

If the contrast pertains to a post hoc question and you compare all six possible pairs, you report the usual statistics and conclude significance if the \(p < \frac{0.05}{6}\).

In the second approach, you can change the \(p\)-value itself: you multiply the plotted value by the number of comparisons and declare a difference to be significant if the corresponding adjusted \(p\)-value is less than 0.05. As an example, suppose you make six comparisons. Then you multiply the usual \(p\)-values by a factor 6: \(p_{adj}= 6p\). Thus, if you see a \(p\)-value of 0.04, you compute \(p_{adj}\) to be 0.24 and conclude that the contrast is not significant. This is often done in R: the output yields \(adjusted\) \(p\)-values. Care should be taken with the confidence intervals: make sure that you know whether these are adjusted 95% confidence intervals or not. If not, then you should compute your own. Note that when you use the adjusted \(p\)-values, you should no longer adjust the \(\alpha\). Thus, an adjusted \(p\)-value of 0.24 is not significant, because \(p_{adj}>.05\).

In this section we will see how to perform post hoc comparisons in two situations: either with only one factor in your model, or when you have two factors in your model.

11.7.1 ANOVA with only one factor

We give an example of an ANOVA post hoc analysis with only one factor, using the data from the four diets. We first run an ANOVA to answer the primary research question whether diet has any effect on weight gain in chicks.

ChickWeight %>% 
  lm(weight ~ Diet, data = .) %>% 
  anova() %>% 
## # A tibble: 2 × 6
##   term         df    sumsq meansq statistic      p.value
##   <chr>     <int>    <dbl>  <dbl>     <dbl>        <dbl>
## 1 Diet          3  155863. 51954.      10.8  0.000000643
## 2 Residuals   574 2758693.  4806.      NA   NA

Seeing these results, noting that there is indeed a significant effect of diet, a secondary question pops up: “Which pairs of two diets show significant differences?” We answer that by doing a post hoc analysis, where we study each pair of diets, and control Type I error rate using the Bonferroni method. We can do that in the following way:

# Check your sample means
ChickWeight %>%
  group_by(Diet) %>% 
  summarise(mean = mean(weight))
## # A tibble: 4 × 2
##   Diet   mean
##   <fct> <dbl>
## 1 3      143.
## 2 1      103.
## 3 2      123.
## 4 4      135.
pairwise.t.test(ChickWeight$weight, # dependent variable
                ChickWeight$Diet,  # independent variable
                p.adjust.method = 'bonferroni') # adjustment method
##  Pairwise comparisons using t tests with pooled SD 
## data:  ChickWeight$weight and ChickWeight$Diet 
##   3         1       2      
## 1 0.0000025 -       -      
## 2 0.14077   0.06838 -      
## 4 1.00000   0.00026 0.95977
## P value adjustment method: bonferroni

In the output we see six Bonferroni adjusted \(p\)-values for all six possible pairs. The column and row numbers refer to the levels of the Diet factor: Diet 3, Diet 1 and Diet 2 in the three columns, and Diet 1, Diet 2 and Diet 4 in the three rows. We see that all \(p\)-values are non significant (\(p > 0.05\)), except for two comparisons: the difference between Diet 3 and Diet 1 is significant, \(p < 0.05\). as well as the difference between Diet 4 and Diet 1, \(p < 0.05\).

"An analysis of variance showed that the mean weight was signficantly different for the four diets, \(F(3, 574) = 10.8, p < .001\). We performed post hoc pair-wise comparisons, for all six possible pairs of diets. A family-wise Type I error rate of 0.05 was used, with Bonferroni correction. The difference between Diet 1 and Diet 3 was significant, and the difference between Diets 4 and 1 was significant. All other differences were not signficantly different from 0. "

11.7.2 ANOVA with two factors and moderation

In the previous subsection we did pair-wise comparisons in a one-way ANOVA (i.e., ANOVA with only one factor). In the previous chapter we also discussed how to set up contrasts in the situation of two factors that are modelled with interaction effects (Ch. 10). Let’s return to that example.

data("jobsatisfaction", package = "datarium")

# Run linear model with interaction effect and compute simple slopes
sslopes <- jobsatisfaction %>% 
  lm(score ~ gender + education_level + gender:education_level, data = .) %>% 
  simple_slopes(confint = TRUE, ci.width = 0.95)
##   gender     education_level Test Estimate Std. Error    2.5%   97.5% t value
## 1   male    sstest (college)        0.7967     0.2593  0.2764  1.3170  3.0726
## 2   male sstest (university)        3.8653     0.2527  3.3582  4.3724 15.2950
## 3 female    sstest (college)        0.7220     0.2460  0.2284  1.2156  2.9352
## 4 female sstest (university)        2.6650     0.2460  2.1714  3.1586 10.8343
## 5 sstest              school        0.3143     0.2527 -0.1928  0.8214  1.2438
## 6 sstest             college        0.2397     0.2527 -0.2674  0.7468  0.9484
## 7 sstest          university       -0.8860     0.2460 -1.3796 -0.3924 -3.6020
##   df              Pr(>|t|) Sig.
## 1 52             0.0033741   **
## 2 52 < 0.00000000000000022  ***
## 3 52             0.0049525   **
## 4 52  0.000000000000006074  ***
## 5 52             0.2191473     
## 6 52             0.3473349     
## 7 52             0.0007058  ***

In the example, we were only interested in the gender effect for each of the education levels. That means only the last three lines are relevant.

In the simple_slopes() code we used the argument ci.width = 0.95. That was because we hadn’t discussed post hoc analysis yet, nor adjustment of \(p\) or \(\alpha\). In the case that we want to control the Type I error rate, we could use Bonferroni correction. We should then make the relevant \(p\)-values three times bigger than what they are uncorrected, because we are interested in three contrasts.

# Obtain adjusted p-values
p_uncorrected <- sslopes[5:7, "Pr(>|t|)"] # get rows 5:7 and column with p-values
p_uncorrected # check you got the right ones
## [1] 0.2191473106 0.3473348890 0.0007057855
p_corrected <- p_uncorrected*3 # Bonferroni correction, multiplying by number of contrasts
p_corrected  # for school, college and university, respectively
## [1] 0.657441932 1.042004667 0.002117357

Confidence intervals should also be changed. For that we need to adjust the Type I error rate \(\alpha\).

# Correct Type I error rate
alpha <- 0.05 / 3 # adjust alpha
CI <- 1 - alpha # adjust confidence interval
## [1] 0.9833333
sslopes <- jobsatisfaction %>% 
  lm(score ~ gender + education_level + gender:education_level, data = .) %>% 
  simple_slopes(confint = TRUE, ci.width = CI) # adjusted conf. interval
##   gender     education_level Test Estimate Std. Error 0.833333333333336%
## 1   male    sstest (college)        0.7967     0.2593             0.1552
## 2   male sstest (university)        3.8653     0.2527             3.2401
## 3 female    sstest (college)        0.7220     0.2460             0.1135
## 4 female sstest (university)        2.6650     0.2460             2.0565
## 5 sstest              school        0.3143     0.2527            -0.3109
## 6 sstest             college        0.2397     0.2527            -0.3855
## 7 sstest          university       -0.8860     0.2460            -1.4945
##   99.1666666666667% t value df              Pr(>|t|) Sig.
## 1            1.4381  3.0726 52             0.0033741   **
## 2            4.4905 15.2950 52 < 0.00000000000000022  ***
## 3            1.3305  2.9352 52             0.0049525   **
## 4            3.2735 10.8343 52  0.000000000000006074  ***
## 5            0.9395  1.2438 52             0.2191473     
## 6            0.8649  0.9484 52             0.3473349     
## 7           -0.2775 -3.6020 52             0.0007058  ***

In the output we see adjusted confidence intervals (note that the \(p\)-values are the original ones). We conclude for our three contrasts that the females score 0.31 (adjusted 95% CI: -0.3109, 0.9395) higher than boys in the “school” group, females score 0.24 (adjusted 95% CI: -0.3855, 0.8649) higher than boys in the “college” group, and 0.89 (adjusted 95% CI: -1.4945, -0.2775) below the boys in the “university” group.

The same logic of adjusting \(p\)-values and adjusting confidence intervals can be applied in situations with numeric independent variables.

11.8 Take-away points

  • Your main research questions are generally very limited in number. If they can be translated into contrasts, we call these a priori contrasts.

  • Your a priori contrasts can be answered using a pre-set level of significance, in the social and behavioural sciences this is often 5% for \(p\)-values and using 95% for confidence intervals. No adjustment necessary.

  • This pre-set level of significance, \(\alpha\), should be set before looking at the data (if possible before the collection of the data).

  • If you are looking at the data and want to answer specific research questions that arise because of what you see in the data (post hoc), you should use adjusted \(p\)-values and confidence intervals.

  • There are several ways of adjusting the test-wise \(\alpha\)s to obtain a reasonable family-wise \(\alpha\): Bonferroni is the simplest method but rather conservative (low statistical power). Many alternative methods exist, among them are Scheffé’s procedure, and Tukey HSD method.

Key concepts

  • a priori
  • post hoc
  • orthogonality/independence
  • \(p\)-hacking
  • family-wise Type I error rate
  • test-wise Type I error rate
  • Bernoulli correction