Table of contents

 

1. The structure of this page

This page is written to guide the reader through the logic of how to perform ANOVAs in R with the most accurate approach. It first covers the functions available in R to do this, then compares the output of these functions, and finally concludes on the most optimal approaches for different types of data. When running examples, some chunks of code are commented-out, so that the content of the page doesn’t change.

 

2. The different ANOVAs in R

There are several functions in R that you can choose to do an ANOVA. The most common stand-alone ones are:

      aov() (nested within summary())

      anova_test()

However, you can also fit a linear model to your data with lm(), and then run an ANOVA on the model using:

      anova(lm())

      Anova(lm())

(To make things more confusing, you can also run an aov() and then nest it within anova() or Anova(). We’ll talk about that later.)

Some of the functions above are appropriate only for certain types of ANOVA. Let’s use an example to find out the differences.

Let’s say we did a study where we had three groups of participants: healthy controls (HC), people with anxiety (A) and people with depression (D). The numbers of participants per group were: n(HC)=40, n(A)=32, n(D)=28. We measured the mood of the participants with PHQ-9, asked the participants to eat a slice of cake every day for a month, and measured their PHQ-9 scores again. The data looks like this:

# df = data.frame(paste("SUBJECT",seq(from=1, to=100),sep=""),
#                 c(rep("HC",40),rep("A",32),rep("D",28)),
#                 c(round(runif(n=40, min=1, max=15)),round(runif(n=60,min=5,max=25))),
#                 c(round(runif(n=40, min=1, max=15)),round(runif(n=60,min=5,max=25))),
#                 round(runif(n=100, min=18, max=60)),
#                 sample(c(rep("Female",50),rep("Male",50))))
# colnames(df)=c("ID","Group","T1","T2","Age","Sex")
# write.csv(df, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data.csv", row.names = FALSE)
df = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data.csv")
head(df)
##         ID Group T1 T2 Age    Sex
## 1 SUBJECT1    HC  9  4  46 Female
## 2 SUBJECT2    HC  7 10  34 Female
## 3 SUBJECT3    HC  5  1  34   Male
## 4 SUBJECT4    HC  4 12  58 Female
## 5 SUBJECT5    HC 10  5  27 Female
## 6 SUBJECT6    HC 14  2  23   Male

Let’s say that in the first step, we want to check if there were group differences at baseline in participants’ PHQ-9 scores. The basic syntax for this example, no matter what function you pick for your ANOVA, will be:

function(phq9 ~ group, data=df)

Let’s see if all the functions above give us the same output if we run that syntax through them:

First, we’ll code Group and Sex as factor, so that R won’t confuse them for numeric variables:

Note: get into the habit of setting any nominal variables as factors in R. If any of these variables are ever set as numbers (e.g., sex is set to 1 and 2 rather than female and male), R will assume they are numeric and your analyses will be off.

df$Group = as.factor(df$Group)
df$Sex = as.factor(df$Sex)
summary(aov(T1 ~ Group, data=df))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2  910.2   455.1   21.76 1.56e-08 ***
## Residuals   97 2028.7    20.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T1 ~ Group, data=df)
## ANOVA Table (type II tests)
## 
##   Effect DFn DFd      F        p p<.05  ges
## 1  Group   2  97 21.759 1.56e-08     * 0.31
anova(lm(T1 ~ Group, data=df))
## Analysis of Variance Table
## 
## Response: T1
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Group      2  910.15  455.08  21.759 1.561e-08 ***
## Residuals 97 2028.69   20.91                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(lm(T1 ~ Group, data=df))
## Anova Table (Type II tests)
## 
## Response: T1
##            Sum Sq Df F value    Pr(>F)    
## Group      910.15  2  21.759 1.561e-08 ***
## Residuals 2028.69 97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aov(), anova_test(), anova(lm()), Anova(lm()) and summary(aov()) all gave us the same p-value. So it’s fine then, right?

Sadly, no. Let’s see what happens if we look at T2 and include T1 as a covariate, using the same syntax for all the functions we considered above:

summary(aov(T1 ~ Group+Age, data=df))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2  910.2   455.1  21.905 1.46e-08 ***
## Age          1   34.3    34.3   1.652    0.202    
## Residuals   96 1994.4    20.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T1 ~ Group+Age, data=df)
## ANOVA Table (type II tests)
## 
##   Effect DFn DFd      F        p p<.05   ges
## 1  Group   2  96 21.640 1.75e-08     * 0.311
## 2    Age   1  96  1.652 2.02e-01       0.017
anova(lm(T1 ~ Group+Age, data=df))
## Analysis of Variance Table
## 
## Response: T1
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Group      2  910.15  455.08 21.9052 1.456e-08 ***
## Age        1   34.31   34.31  1.6516    0.2018    
## Residuals 96 1994.38   20.77                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(lm(T1 ~ Group+Age, data=df))
## Anova Table (Type II tests)
## 
## Response: T1
##            Sum Sq Df F value    Pr(>F)    
## Group      899.15  2 21.6405 1.747e-08 ***
## Age         34.31  1  1.6516    0.2018    
## Residuals 1994.38 96                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can see that this time, aov() and anova(lm()) gave us the same results, and anova_test() and Anova(lm()) gave us the same results, but the two pairs of functions did not agree. Moreover, lm() gave us different values altogether. What’s going on? The solution comes from the default calculation methods that these functions assume.

 

3. The different types of ANOVAs

At the core of it, an ANOVA compares the between-group variance with the within-group variance. The F value is a ratio of the between-group variance measure and the within-group variance measure (Mean Square). These are calculated by dividing the sum of squares by the degrees of freedom (between and within groups, respectively):

\[\LARGE F = \frac{MS_{bg}}{MS_{wg}} = \frac{\frac{SS_{bg}}{df_{bg}}}{\frac{SS_{wg}}{df_{wg}}}\] where F - F statistic, MS - mean of squares, SS - sum of squares, df - degrees of freedom, bg - between groups, wg - within groups

equation source

The sum of squares can be calculated in different ways (ref1, ref2, ref3, ref4):

Note: in each type of SS calculation, you can say you have estimated the contribution of your predictor of interest (e.g., group) while controlling for another predictor (of no interest, e.g., age). But you’ll have to be careful about the syntax you use (in type I) or the interpretation you draw (type II/III). Only in type III you can take any estimate from the output and say this is the results while controlling for all other predictors.

 

Now, let’s see which SS calculation method the functions above assume by default, and if we can change these defaults in case we need to:

 

To summarise:

      aov() and anova() are the devil. (That’s why we will ignore nesting aov() in anything from now on. The trust is just not there anymore.)

      anova_test() and Anova(lm()) are cool, although Anova(lm()) is more limited than anova_test().

 

4. Does SS type choice matter if I have no covariates?

It depends, of course! We saw that if we have only one factor, it does not matter. But what if we had more than one factor?

Let’s first see what syntax we would use to run a two-way ANOVA. Say, we randomly allocated people from each of our groups (HC, A and D) to either the ‘cake’ or the ‘no cake’ condition and we wanted to test the effect of treatment on T2 scores across groups:

# df$treatment = sample(c(rep("cake",50),rep("no cake",50)), replace=TRUE)
# df$treatment = as.factor(df$treatment)
# write.csv(df, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE.csv", row.names = FALSE)
df.cake = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE.csv")
head(df.cake)
##         ID Group T1 T2 Age    Sex treatment
## 1 SUBJECT1    HC  9  4  46 Female   no cake
## 2 SUBJECT2    HC  7 10  34 Female      cake
## 3 SUBJECT3    HC  5  1  34   Male      cake
## 4 SUBJECT4    HC  4 12  58 Female   no cake
## 5 SUBJECT5    HC 10  5  27 Female   no cake
## 6 SUBJECT6    HC 14  2  23   Male      cake

The syntax for running this two-way ANOVA with no covariates is:

df.cake$treatment = as.factor(df.cake$treatment)
df.cake$Group = as.factor(df.cake$Group)
df.cake$Sex = as.factor(df.cake$Sex)
summary(aov(T2 ~ Group*treatment, data=df.cake))
anova(lm(T2 ~ Group*treatment, data=df.cake))
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake)
car::Anova(lm(T2 ~ Group*treatment, data=df.cake))

You can immediately see that even though we did not put any “covariates” in our analysis (like age or sex), the syntax is the same as if one of our factors was a covariate. Hence, we encouter the same SS type issue as we did in a one-way ANCOVA. How do we decide which SS type to use in our two-way ANOVA?

The first thing to consider is that we will likely not want to treat our factors sequentially, as we want to look at the effect of group and treatment simultaneously. Thus, type I SS - and so, aov() or anova(lm()) are unlikely to be our choice.

The second thing to consider is how balanced our sub-groups are. In other words: do we have balanced numbers of participants in each treatment arm (cake/no cake) per group (HC/A/D)? Let’s see what would happen if we did:

# df.cake.balanced = df
# df.cake.balanced$treatment = c(rep("cake",20),rep("no cake",20),rep("cake",16),rep("no cake",16),rep("cake",14),rep("no cake",14))
# df.cake.balanced$treatment = as.factor(df.cake.balanced$treatment)
# write.csv(df.cake.balanced, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE_balanced.csv", row.names = FALSE)
df.cake.balanced = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE_balanced.csv")
df.cake.balanced$treatment = as.factor(df.cake.balanced$treatment)
df.cake.balanced$Group = as.factor(df.cake.balanced$Group)
df.cake.balanced$Sex = as.factor(df.cake.balanced$Sex)
table(df.cake.balanced$Group, df.cake.balanced$treatment)
##     
##      cake no cake
##   A    16      16
##   D    14      14
##   HC   20      20
summary(aov(T2 ~ Group*treatment, data=df.cake.balanced))
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## Group            2 1594.7   797.4  29.091 1.47e-10 ***
## treatment        1    0.5     0.5   0.018    0.894    
## Group:treatment  2   34.8    17.4   0.635    0.532    
## Residuals       94 2576.5    27.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(T2 ~ Group*treatment, data=df.cake.balanced))
## Analysis of Variance Table
## 
## Response: T2
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## Group            2 1594.73  797.37 29.0912 1.465e-10 ***
## treatment        1    0.49    0.49  0.0179    0.8939    
## Group:treatment  2   34.82   17.41  0.6352    0.5321    
## Residuals       94 2576.46   27.41                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake.balanced)
## ANOVA Table (type II tests)
## 
##            Effect DFn DFd      F        p p<.05     ges
## 1           Group   2  94 29.091 1.47e-10     * 0.38200
## 2       treatment   1  94  0.018 8.94e-01       0.00019
## 3 Group:treatment   2  94  0.635 5.32e-01       0.01300
car::Anova(lm(T2 ~ Group*treatment, data=df.cake.balanced))
## Anova Table (Type II tests)
## 
## Response: T2
##                  Sum Sq Df F value    Pr(>F)    
## Group           1594.73  2 29.0912 1.465e-10 ***
## treatment          0.49  1  0.0179    0.8939    
## Group:treatment   34.82  2  0.6352    0.5321    
## Residuals       2576.46 94                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All results are the same. But in our example above, we don’t have balanced sub-groups (and it’s unlikely that in real life, we ever would):

table(df.cake$Group, df.cake$treatment)
##     
##      cake no cake
##   A    15      17
##   D    16      12
##   HC   18      22

How does that influence our results?

summary(aov(T2 ~ Group*treatment, data=df.cake))
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## Group            2 1594.7   797.4  29.068 1.49e-10 ***
## treatment        1    2.3     2.3   0.085    0.771    
## Group:treatment  2   30.9    15.5   0.563    0.571    
## Residuals       94 2578.5    27.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(T2 ~ Group*treatment, data=df.cake))
## Analysis of Variance Table
## 
## Response: T2
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## Group            2 1594.73  797.37 29.0680 1.486e-10 ***
## treatment        1    2.34    2.34  0.0852    0.7711    
## Group:treatment  2   30.91   15.46  0.5634    0.5712    
## Residuals       94 2578.53   27.43                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake)
## ANOVA Table (type II tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  94 29.102 1.46e-10     * 0.382000
## 2       treatment   1  94  0.085 7.71e-01       0.000905
## 3 Group:treatment   2  94  0.563 5.71e-01       0.012000
car::Anova(lm(T2 ~ Group*treatment, data=df.cake))
## Anova Table (Type II tests)
## 
## Response: T2
##                  Sum Sq Df F value    Pr(>F)    
## Group           1596.60  2 29.1020 1.455e-10 ***
## treatment          2.34  1  0.0852    0.7711    
## Group:treatment   30.91  2  0.5634    0.5712    
## Residuals       2578.53 94                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are back to square one: aov() and anova(lm()) are giving us different results to anova_test() and Anova(lm()). Why?

Additionally to the sequential, simultaneous or mixed methods of SS calculation, the three types of SS also treat balanced and unbalanced designs differently:

*This is another reason to hate aov(). Even though its documentation says it’s good for “balanced or unbalanced experimental designs” (ref), there is no way to specify the SS calculation in it, and we clearly see it’s important. What’s worse, the output will sometimes warn you the estimates might not be correct (or ‘balanced’, which means nothing if you don’t know about the SS types) and sometimes it won’t. Horrible!

 

5. Any other traps you know about?

Trap 1: The order of arguments in the syntax when using type I SS

The order of arguments in syntax matters only if the type of SS you pick uses sequential SS calculation at all (so, type I, or type II with interactions). For example, if type I SS are used, function(phq9 ~ group+age, data=df) produces a different result to function(phq9 ~ age+group, data=df). This is because in type I SS, once the variance is attributed to the first predictor, the estimate of the second predictor explains any LEFT-OVER variance. In other words, the syntax will answer the question: ’once I’ve estimated the contribution to the variance of all other predictors (covariates), what’s the unique contribution of my main predictor of interest (the final predictor, or in our case, group)?* (ref) This will also be the case for type II ANOVAs with interactions, as interaction effects will be calculated sequentially after all main effects.

*since both aov() and anova() assume type I SS (and you cannot change that), this applies to both these functions

IF YOU DISCOVER A TRAP NOT COVERED ON THIS PAGE, PLEASE TELL ME SO I CAN LEARN ABOUT IT AND ADD IT HERE

ASSUMPTIONS?

 

Trap 2: specifying covariates in anova_test()

You might think: anova_test() is more robust than Anova(lm()), so I’ll stick to that. What could go wrong?

There is a trap in anova_test() that is unrelated to SS calculation. Namely, anova_test(x ~ y+a) and anova_test(x ~ y, covariate=a) will give you different results regardless of the SS type used. Let’s compare the output of a two-way type III ANOVA on our sample data:

rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3)
## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  94 29.024 1.53e-10     * 0.382000
## 2       treatment   1  94  0.017 8.96e-01       0.000183
## 3 Group:treatment   2  94  0.563 5.71e-01       0.012000
rstatix::anova_test(T2 ~ Group, data=df.cake, type = 3, covariate = treatment)
## ANOVA Table (type III tests)
## 
##   Effect DFn DFd      F        p p<.05   ges
## 1  Group   2  97 29.614 9.14e-11     * 0.379
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3, covariate = treatment)
## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  94 29.024 1.53e-10     * 0.382000
## 2       treatment   1  94  0.017 8.96e-01       0.000183
## 3 Group:treatment   2  94  0.563 5.71e-01       0.012000

The last syntax seems to give us the same results as the first one, so we can ignore it from now on.

Immediately, we can see that if we specify treatment as a covariate, anova_test() won’t use it as a factor. What if we entered factors in the formula, but covariates in the “covariate” argument?

rstatix::anova_test(T2 ~ Group*treatment+Age, data=df.cake, type = 3)
## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  93 28.882 1.75e-10     * 0.383000
## 2       treatment   1  93  0.011 9.18e-01       0.000113
## 3             Age   1  93  0.325 5.70e-01       0.003000
## 4 Group:treatment   2  93  0.575 5.65e-01       0.012000
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3, covariate = Age)
## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  94 29.024 1.53e-10     * 0.382000
## 2       treatment   1  94  0.017 8.96e-01       0.000183
## 3 Group:treatment   2  94  0.563 5.71e-01       0.012000

The output is different. In the first syntax, Age seems to be included in the model, whereas in the second one, it seems to disappear. Where is it disappearing to?

rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3)
## ANOVA Table (type III tests)
## 
##            Effect DFn DFd      F        p p<.05      ges
## 1           Group   2  94 29.024 1.53e-10     * 0.382000
## 2       treatment   1  94  0.017 8.96e-01       0.000183
## 3 Group:treatment   2  94  0.563 5.71e-01       0.012000

It looks like it’s being completely ignored from the calculations! Just when we thought anova_test() was the most robust, trustworthy AN(C)OVA function…

 

6. How do I do post-hoc tests after an AN(C)OVA?

Firstly: what are post-hoc tests?

Post-hoc tests are pairwise comparisons done after your AN(C)OVA. Since AN(C)OVA can only tell you about main effects and interactions, you will still not know which pairwise comparisons drive these effects. For instance, if we compared three groups with each other, the most precise answer you could get from an AN(C)OVA would be: yes, they differ. Post-hoc tests answer the question: which ones?

Secondly: should you run post-hoc tests, or not?

It depends on your hypothesis #pre-register_EVERYTHING. “Post-hoc” means “after this” in latin. So, you can run post-hoc tests if your primary hypothesis concerns the main effects/interactions, but you want to find out what drives them. Note that if you have no main effects or interactions, there is no need to run any post-hoc tests - the AN(C)OVA already told you there are no differences to explore.

An interesting question is whether to run post-hoc tests if your primary hypothesis is about a difference between two groups. Well, if they’re supposed to be run “after this”, then by definition - no. If your primary hypothesis doesn’t concern main effects or interactions, you can pre-plan specific pairwise comparisons to answer your question. But then, you fall into the multiple comparison correction dilemma. Read through the page, take in the pros and cons, and discuss with your supervisor what you are going to do.

OK, I want to run post-hoc tests. How do I do that in R?

There are a few functions we can consider. The main ones that come up in a google search are:

      tukey_hsd(). Needs to be fed data and a formula. Does not accept covariates. (ref)

      emmeans_test(). Needs to be fed data and a formula. Accepts covariates (well, kind of…). (ref)

      glht() (nested within summary()). Needs to be fed a model (i.e., not an Anova() or anova_test() object), and that model can include covariates. (ref)

Note 1: We will be assuming a Tukey test in the following section. If you want to specify any of the other post-hoc tests, here is how you do it (page 9).

Note 2: Since tukey_hsd() does not accept covariates, it is not robust for practical uses. Thus, we will stay away from tukey_hsd() from now on.

 
Say, we wanted to run post-hoc tests for a one-way type III ANOVA for our sample data. The two equivalent ways to calculate this would be:

rstatix::anova_test(T1 ~ Group, data=df.cake, type = 3)
## ANOVA Table (type III tests)
## 
##   Effect DFn DFd      F        p p<.05  ges
## 1  Group   2  97 21.759 1.56e-08     * 0.31
car::Anova(lm(T1 ~ Group, data=df.cake), type = 3)
## Anova Table (Type III tests)
## 
## Response: T1
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 6441.1  1 307.977 < 2.2e-16 ***
## Group        910.2  2  21.759 1.561e-08 ***
## Residuals   2028.7 97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 
With either of the mentioned post-hoc functions, we could run Tukey post-hoc tests like so:

rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake)
## # A tibble: 3 × 9
##   term  .y.   group1 group2    df statistic            p      p.adj p.adj.signif
## * <chr> <chr> <chr>  <chr>  <dbl>     <dbl>        <dbl>      <dbl> <chr>       
## 1 Group T1    A      D         97     0.340 0.735           1   e+0 ns          
## 2 Group T1    A      HC        97     5.84  0.0000000686    2.06e-7 ****        
## 3 Group T1    D      HC        97     5.27  0.000000833     2.50e-6 ****
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4018     1.1834  -0.340    0.938    
## HC - A == 0  -6.3375     1.0846  -5.843   <1e-05 ***
## HC - D == 0  -5.9357     1.1269  -5.268   <1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

(Note that if glht() is used, it is fed the same model as Anova(). This is neat if done one after the other, so might be preferable if you ever need to revisit the code in the future. Anyways!)

Have a look at the value in the D-A comparison row of the output from emmeans_test(), column “p”. Now look at the D-A comparison row of the output from glht(), column “Pr(>|t|)”. The results do not converge. Why?

The reason is the defaults they assume (of course). But this time, it’s not the SS type - it’s the multiple comparison correction. Before we dive into whether/when/how to do multiple comparison corrections, let’s just see what the defaults are in emmeans_test() and glht():

rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake)
## # A tibble: 3 × 9
##   term  .y.   group1 group2    df statistic            p      p.adj p.adj.signif
## * <chr> <chr> <chr>  <chr>  <dbl>     <dbl>        <dbl>      <dbl> <chr>       
## 1 Group T1    A      D         97     0.340 0.735           1   e+0 ns          
## 2 Group T1    A      HC        97     5.84  0.0000000686    2.06e-7 ****        
## 3 Group T1    D      HC        97     5.27  0.000000833     2.50e-6 ****
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none")
## # A tibble: 3 × 9
##   term  .y.   group1 group2    df statistic            p      p.adj p.adj.signif
## * <chr> <chr> <chr>  <chr>  <dbl>     <dbl>        <dbl>      <dbl> <chr>       
## 1 Group T1    A      D         97     0.340 0.735           7.35e-1 ns          
## 2 Group T1    A      HC        97     5.84  0.0000000686    6.86e-8 ****        
## 3 Group T1    D      HC        97     5.27  0.000000833     8.33e-7 ****
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4018     1.1834  -0.340    0.938    
## HC - A == 0  -6.3375     1.0846  -5.843   <1e-05 ***
## HC - D == 0  -5.9357     1.1269  -5.268   <1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted("single-step"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4018     1.1834  -0.340    0.938    
## HC - A == 0  -6.3375     1.0846  -5.843   <1e-05 ***
## HC - D == 0  -5.9357     1.1269  -5.268   <1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted("none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4018     1.1834  -0.340    0.735    
## HC - A == 0  -6.3375     1.0846  -5.843 6.86e-08 ***
## HC - D == 0  -5.9357     1.1269  -5.268 8.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

emmeans_test() gave us p=0.735 both when we didn’t specify correction method, and when we specified it as “none” - therefore, “none” must be its default.

glht() gave us p=0.938 when we did not specify correction method, and when we did specify it as “single-step” (don’t worry about what it means, we’ll tackle it later!) - therefore, “single-step” must be its default.

Notice that when we specified the correction method as “none”, glht() gave us the same p-value as emmeans_test().

 

7. How do I correct my post-hoc tests for multiple comparisons in R?

Firstly: what is the main idea behind multiple comparison correction?

The principle of the multiple comparison correction is that if your alpha is 5% (i.e., your p-value threshold is 0.05), you accept that if you find a significant result, there is a 5% chance of it being a false-positive. With each extra test you are performing, you are multiplying your alphas (e.g., doing three comparisons at alpha=5% gives you an overall alpha of 1 - (.95 * .95 * .95) = .143. Adjusting the alpha by the number of tests you are doing is supposed to protect you from false-positives due to running multiple tests (ref).

Secondly: should I correct for multiple comparisons, or not?

Sadly, there is no consensus here. Below are some considerations I managed to find which might help you make your decision.

In the GraphPad guide, we read that a correction is not needed if:

  1. You pre-planned to only analyse certain pairwise comparisons, but not others - and stick to that after you get your data. They don’t say if there is any evidence of anyone ever stopping themselves after getting a negative results on these pre-planned analyses.

  2. The multiple comparisons ask the same question that the main test asked (e.g., big picture: does a drug prevent a negative outcome of any kind (A, B or C); small pictures: separately, does a drug prevent negative outcome A, negative outcome B and negative outcome C)

Thirdly: what methods are available?

There are probably a million out there (ref) and the only way to understand them is to be a mathematician, which I am definitely not. Take this section as a starting point and please do extra reading before you settle on a method.

Multiple comparison correction methods are broadly categorised as single-step - determining a single adjusted threshold for significance for all pairwise comparisons - and step-wise - determining an adjusted threshold for significance for each individual pairwise comparison (ref).

Another useful distinction is between FWE (family-wise error) and FDR (false-discovery rate), best put into plain language in this book on page 430:

By focusing on the familywise error rate we are obsessing about the possibility of making one or more Type I errors. The corresponding belief system can be summed up as ‘if I make even one Type I error then my entire set of conclusions is meaningless’. Benjamini and Hochberg think about things differently. Their belief system can be summed up as the rather more joyful ‘let’s try to estimate how many Type I errors (or false discoveries) we have made’. The FDR is simply the proportion of falsely rejected null hypotheses. [quote slightly trimmed]

With the power of hindsight, I can narrow the list down to the methods that I know for sure we can run in R. These are:

      Single-step methods:

  • Tukey’s HSD - as above - it assumes equal group sample sizes, so it is unlikely to be our friend.

  • Tukey-Kramer method - Tukey’s method, but adapted to unbalanced designs (ref).

  • Dunnet - this is only applicable when we want to compare several groups to a single HC group, but not with each other (ref). Again, unlikely in real life.

  • what R calls ‘single-step’ - I think these are the default single-step methods applicable to your pairwise comparison test of choice. In other words, if you tell R (glht() secifically, more on it below) to run post-hoc Tukey’s tests, it will automatically include the Tukey’s single-step correction method. I think that beacuse when you go into a rabbit hole about the ‘single-step’ method glht() uses, you end up in a ‘multivariate t-distribution’ math hell; Tukey’s test uses a studentized t-distribution and Dunnet’s uses Dunnet’s t-distribution ref. If this rings any math bells to you, please let me know.

  • Bonferroni - this is the correction we all know as a very strict one. It simply changes the threshold of our p-value: if our initial p-value threshold was 0.05, it will divide 0.05 by the number of tests. The resulting p-value is our new threshold for significance. In R, often the reverse is done: the p-value you get from your test is multiplied by the number of tests. This is why if you use Bonferroni in R, some of your p-values will be 1.

 

      Step-wise methods:

  • BH - the Benjamini-Hochberg procedure (or: the FDR procedure) ranks all p-values you received from a statistical test with multiple comparisons, and adjusts each p-value by (1) dividing it by its rank, and then (2) multiplying the result by the number of tests done (ref). Simply put here: “In other words, FDR is the expected proportion of false positives among all positives which rejected the null hypothesis and not among all the tests undertaken.”

  • Hochberg - the BH method improves the approach (ref), so we can ignore it.

  • Holm -simply divides the individual test’s p-value by its rank (page 430 here)

  • BY - the Benjamini-Yekutieli procedure is a different flavour of the FDR (BH) method (ref). If we trust GraphPad, it seems to be better and it works by looking at the distribution of received p-values to estimate what fraction is likely to be false negatives (ref).

  • Hommel - according to R documentation, very similar to the Hochberg method, but slower (ref). Ignore!

  • Shaffer - an extension of Holm’s method, stopping at the first insignificant result (page 34 and 95 here)

  • Westfall - an extension of Shaffer’s method (page 62 here)

  • what R calls ‘free’ - “implements multiple testing procedures under free combinations” (ref). At this point, I’m too scared to even ask what that means in practice.

Thirdly: which correction method do I choose?

I do not dare to give any advice here, or order the methods above by any metric. I don’t understand the math behind them to do that. However, what I have gathered from my reading is that you can:

  1. Make your choice according to whether you want to apply an FWE or an FDR method. FDR might be preferable, because it reduces false-positives and minimises false-negatives (ref).

  2. Make your choice according to whether you want to do a single-step or a step-wise method (if you choose step-wise, then do you want your test to have higher power by stopping at the first non-significant result).

  3. Pick Hommel over Hochberg, and Hochberg over Holm (page 62 here)

  4. Pick Holm over Bonferroni (page 62 here)

  5. Pick BY over BH (ref)

  6. Consider Tukey stricter than Shaffer, and Shaffer stricter than Westfall (page 100 here)

OK, I want to correct for multiple comparisons and I know which method I want to use. How do I do that in R?

Thankfully, this bit is easy. Let’s have a look again at the syntax of emmeans_test() and glht() to see how to specify our correction method:

rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none")
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))

We can see that emmeans_test() uses the “p.adjust.method=” argument, and glht() uses “test=multcomp::adjusted(type=)”. These are the options these two arguments take:

  • “p.adjust.method=” takes: “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr” and “none” (ref)

  • “test=multcomp::adjusted(type=)” takes anything p.adjust.method does (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr” and “none”) plus: “single-step”, “Shaffer”, “Westfall” and “free” (ref1, ref2: page 17 here)

Note 1: “fdr” is the same thing as “BH” in the adjusted() function (ref)

Note 2: The classic Tukey’s test was not developed for unbalanced data, only for balanced datasets (ref). A modified Tukey’s procedure, the Tukey-Kramer method, was developed for unbalanced data. I couldn’t find any official documentation on the type of Tukey correction tukey_hsd() or glht() use (we can ignore emmeans_test() for these considerations, because it doesn’t have an argument where you can put in the Tukey correction method). I was tempted to trust this post, reassuring about tukey_hsd(), but with R’s deafult trap behaviour, I wanted to have some more certainty. So, I asked my friend Dr Amanda Kiemes to run a one-way ANOVA on my sample data (as I knew GraphPad uses Tukey-Kramer) and compared it to the default Tukey post-hoc tests in tukey_hsd(), emmeans_test() and glht():

rstatix::tukey_hsd(lm(T1 ~ Group, data=df.cake))
## # A tibble: 3 × 9
##   term  group1 group2 null.value estimate conf.low conf.high       p.adj
## * <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
## 1 Group A      D               0   -0.402    -3.22      2.42 0.938      
## 2 Group A      HC              0   -6.34     -8.92     -3.76 0.000000205
## 3 Group D      HC              0   -5.94     -8.62     -3.25 0.00000248 
## # ℹ 1 more variable: p.adj.signif <chr>
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4018     1.1834  -0.340    0.938    
## HC - A == 0  -6.3375     1.0846  -5.843   <1e-05 ***
## HC - D == 0  -5.9357     1.1269  -5.268   <1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

GraphPad output:

The output seems to align, so I think tukey_hsd() and glht() default to the Tukey-Kramer method.

 

In case you are still worried about which method to use, here’s a comparison of the outputs from glht() with different multiple comparison correction methods. In each case, the syntax is:

## [1] summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group=''Tukey'')), test=multcomp::adjusted(type=''METHOD''))$test$pvalues[[P.VALUE.INDEX.IN.OUTPUT]]

Here is the comparison. Each column is colour-coded separately, from the lowest p-value in red to the highest in blue:

8. Can I run these post-hoc tests on ANCOVA models, too?

Let’s see!* Let’s start by including just one covariate - age:

*Unless you don’t want to see, of course. The tl;dr is: yes with glht(), but not in all cases with emmeans()

rstatix::emmeans_test(formula=T1 ~ Group+Age, data=df.cake, p.adjust.method = "none")
## Error in `pull()`:
## ! `!!enquo(var)` must select exactly one column.

It looks like emmeants_test() formula does not like covariates. What if we specified them separately?

rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none", covariate=Age)
## # A tibble: 3 × 9
##   term      .y.   group1 group2    df statistic           p   p.adj p.adj.signif
## * <chr>     <chr> <chr>  <chr>  <dbl>     <dbl>       <dbl>   <dbl> <chr>       
## 1 Age*Group T1    A      D         96     0.357     7.22e-1 7.22e-1 ns          
## 2 Age*Group T1    A      HC        96     5.84      7.21e-8 7.21e-8 ****        
## 3 Age*Group T1    D      HC        96     5.24      9.43e-7 9.43e-7 ****

Okay, that worked! What if we included both age and sex as covariates?

rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none", covariate=c(Age,Sex))
## Error in contrast.emmGrid(res.emmeans, by = grouping.vars, method = method, : Nonconforming number of contrast coefficients

Not working again. What if Sex was in the formula?

rstatix::emmeans_test(formula=T1 ~ Group+Sex, data=df.cake, p.adjust.method = "none", covaraiate=Age)
## Error in rstatix::emmeans_test(formula = T1 ~ Group + Sex, data = df.cake, : unused argument (covaraiate = Age)
rstatix::emmeans_test(formula=T1 ~ Group*Sex, data=df.cake, p.adjust.method = "none", covaraiate=Age)
## Error in rstatix::emmeans_test(formula = T1 ~ Group * Sex, data = df.cake, : unused argument (covaraiate = Age)

Still not working. Despite a long search. I haven’t found the solution to using nominal covariates in emmeans_test().

PLEASE MESSAGE ME IF YOU WORK IT OUT!

 
Let’s see if glht() can deal with ANCOVAs. Including just one covariate - age:

summary(multcomp::glht(lm(T1 ~ Group+Age, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group + Age, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4209     1.1796  -0.357    0.722    
## HC - A == 0  -6.3108     1.0812  -5.837 7.21e-08 ***
## HC - D == 0  -5.8900     1.1237  -5.242 9.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

And now including both age and sex:

summary(multcomp::glht(lm(T1 ~ Group+Age+Sex, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = T1 ~ Group + Age + Sex, data = df.cake)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## D - A == 0   -0.4466     1.1809  -0.378    0.706    
## HC - A == 0  -6.3580     1.0834  -5.869 6.40e-08 ***
## HC - D == 0  -5.9114     1.1249  -5.255 9.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Success!

For now, my conclusion is that glht() is more robust than emmeans_test(), because it can deal with both ANOVAs and ANCOVAs with different covariate types.

 

9. Can I do all of the above tests just using lm()?

This page is already long enough, and this question opens a whole new world of statistical headaches. For all things lm(), have a look at the separate page I made on the topic. If you don’t fancy doing that, avoid stand-alone summary(lm()) as you might interpret the results wrong.

 

10. Take-home messages

  1. The only time you can not care about the SS type your ANOVA is using is when you are ONLY running a one-way ANOVA without covariates. Since that never happens in real life (at least because you would want to follow that up with an ANCOVA with nuisance regressors like age and/or sex), in practice you can assume type I is not your friend.

  2. Avoid aov() and anova(). They assume type I SS and there is nothing you can do about it. Use anova_test() or Anova(lm()) instead.

  3. When using anova_test() or Anova(lm()), ALWAYS specify the type of SS you want to use.

  4. When using anova_test(), NEVER use the ‘covariate’ argument - it does absolutely nothing.

  5. Whether you do post-hoc tests or not depends on your a priori hypothesis.

  6. If doing post-hoc tests on an ANCOVA, use glht(). For reasons I do not know, emmeans_test() does not accept nominal covariates like sex.

  7. Whether you do correction for multiple comparisons in your post-hoc tests depends on your a priori hypothesis. The choice is complicated and sadly, there is no consensus method to choose.