This page is written to guide the reader through the logic of how to perform ANOVAs in R with the most accurate approach. It first covers the functions available in R to do this, then compares the output of these functions, and finally concludes on the most optimal approaches for different types of data. When running examples, some chunks of code are commented-out, so that the content of the page doesn’t change.
There are several functions in R that you can choose to do an ANOVA. The most common stand-alone ones are:
However, you can also fit a linear model to your data with lm(), and then run an ANOVA on the model using:
(To make things more confusing, you can also run an aov() and then nest it within anova() or Anova(). We’ll talk about that later.)
Some of the functions above are appropriate only for certain types of ANOVA. Let’s use an example to find out the differences.
Let’s say we did a study where we had three groups of participants: healthy controls (HC), people with anxiety (A) and people with depression (D). The numbers of participants per group were: n(HC)=40, n(A)=32, n(D)=28. We measured the mood of the participants with PHQ-9, asked the participants to eat a slice of cake every day for a month, and measured their PHQ-9 scores again. The data looks like this:
# df = data.frame(paste("SUBJECT",seq(from=1, to=100),sep=""),
# c(rep("HC",40),rep("A",32),rep("D",28)),
# c(round(runif(n=40, min=1, max=15)),round(runif(n=60,min=5,max=25))),
# c(round(runif(n=40, min=1, max=15)),round(runif(n=60,min=5,max=25))),
# round(runif(n=100, min=18, max=60)),
# sample(c(rep("Female",50),rep("Male",50))))
# colnames(df)=c("ID","Group","T1","T2","Age","Sex")
# write.csv(df, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data.csv", row.names = FALSE)
df = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data.csv")
head(df)
## ID Group T1 T2 Age Sex
## 1 SUBJECT1 HC 9 4 46 Female
## 2 SUBJECT2 HC 7 10 34 Female
## 3 SUBJECT3 HC 5 1 34 Male
## 4 SUBJECT4 HC 4 12 58 Female
## 5 SUBJECT5 HC 10 5 27 Female
## 6 SUBJECT6 HC 14 2 23 Male
Let’s say that in the first step, we want to check if there were group differences at baseline in participants’ PHQ-9 scores. The basic syntax for this example, no matter what function you pick for your ANOVA, will be:
function(phq9 ~ group, data=df)
Let’s see if all the functions above give us the same output if we run that syntax through them:
First, we’ll code Group and Sex as factor, so that R won’t confuse them for numeric variables:
Note: get into the habit of setting any nominal variables as factors in R. If any of these variables are ever set as numbers (e.g., sex is set to 1 and 2 rather than female and male), R will assume they are numeric and your analyses will be off.
df$Group = as.factor(df$Group)
df$Sex = as.factor(df$Sex)
summary(aov(T1 ~ Group, data=df))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 910.2 455.1 21.76 1.56e-08 ***
## Residuals 97 2028.7 20.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T1 ~ Group, data=df)
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 97 21.759 1.56e-08 * 0.31
anova(lm(T1 ~ Group, data=df))
## Analysis of Variance Table
##
## Response: T1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 910.15 455.08 21.759 1.561e-08 ***
## Residuals 97 2028.69 20.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(lm(T1 ~ Group, data=df))
## Anova Table (Type II tests)
##
## Response: T1
## Sum Sq Df F value Pr(>F)
## Group 910.15 2 21.759 1.561e-08 ***
## Residuals 2028.69 97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov(), anova_test(), anova(lm()), Anova(lm()) and summary(aov()) all gave us the same p-value. So it’s fine then, right?
Sadly, no. Let’s see what happens if we look at T2 and include T1 as a covariate, using the same syntax for all the functions we considered above:
summary(aov(T1 ~ Group+Age, data=df))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 910.2 455.1 21.905 1.46e-08 ***
## Age 1 34.3 34.3 1.652 0.202
## Residuals 96 1994.4 20.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T1 ~ Group+Age, data=df)
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 96 21.640 1.75e-08 * 0.311
## 2 Age 1 96 1.652 2.02e-01 0.017
anova(lm(T1 ~ Group+Age, data=df))
## Analysis of Variance Table
##
## Response: T1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 910.15 455.08 21.9052 1.456e-08 ***
## Age 1 34.31 34.31 1.6516 0.2018
## Residuals 96 1994.38 20.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(lm(T1 ~ Group+Age, data=df))
## Anova Table (Type II tests)
##
## Response: T1
## Sum Sq Df F value Pr(>F)
## Group 899.15 2 21.6405 1.747e-08 ***
## Age 34.31 1 1.6516 0.2018
## Residuals 1994.38 96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can see that this time, aov() and anova(lm()) gave us the same results, and anova_test() and Anova(lm()) gave us the same results, but the two pairs of functions did not agree. Moreover, lm() gave us different values altogether. What’s going on? The solution comes from the default calculation methods that these functions assume.
At the core of it, an ANOVA compares the between-group variance with the within-group variance. The F value is a ratio of the between-group variance measure and the within-group variance measure (Mean Square). These are calculated by dividing the sum of squares by the degrees of freedom (between and within groups, respectively):
\[\LARGE F = \frac{MS_{bg}}{MS_{wg}} = \frac{\frac{SS_{bg}}{df_{bg}}}{\frac{SS_{wg}}{df_{wg}}}\] where F - F statistic, MS - mean of squares, SS - sum of squares, df - degrees of freedom, bg - between groups, wg - within groupsThe sum of squares can be calculated in different ways (ref1, ref2, ref3, ref4):
Type I sum of squares is calculated by attributing a portion of the variation in your observed data (represented by its SS) that has not yet been attributed to any of the PREVIOUS terms. This means that your result will be dependent on the order in which you feed the script your predictors, and that it will always prioritise some predictors over others. As you can imagine, this also means that the order of predictors in your syntax will matter if you use type I SS in your ANCOVA in R (more on that later).
Type II sum of squares is calculated by attributing a portion of the variation in your observed data to the predictors that is not attributable to any of the other predictors - that is, each predictor’s unique contribution to the variation in your data. It does not account for variance attributable to interactions, though; these are calculated sequentially after the main effects. In other words, it calculates the effects of interactions after ‘getting rid’ of any variance attributable to individual predictors. Or, put simply, it treats predictors equally unless they are an interaction.
Type III sum of squares is calculated like type II (i.e., it estimates each predictor’s unique contribution to the data variation), but it DOES account for variance attributable to interactions. In other words, it treats interaction terms on par with individual predictors and calculates both the interactions’ and predictors’ unique contributions simultaneously.
Note: in each type of SS calculation, you can say you have estimated the contribution of your predictor of interest (e.g., group) while controlling for another predictor (of no interest, e.g., age). But you’ll have to be careful about the syntax you use (in type I) or the interpretation you draw (type II/III). Only in type III you can take any estimate from the output and say this is the results while controlling for all other predictors.
Now, let’s see which SS calculation method the functions above assume by default, and if we can change these defaults in case we need to:
It depends, of course! We saw that if we have only one factor, it does not matter. But what if we had more than one factor?
Let’s first see what syntax we would use to run a two-way ANOVA. Say, we randomly allocated people from each of our groups (HC, A and D) to either the ‘cake’ or the ‘no cake’ condition and we wanted to test the effect of treatment on T2 scores across groups:
# df$treatment = sample(c(rep("cake",50),rep("no cake",50)), replace=TRUE)
# df$treatment = as.factor(df$treatment)
# write.csv(df, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE.csv", row.names = FALSE)
df.cake = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE.csv")
head(df.cake)
## ID Group T1 T2 Age Sex treatment
## 1 SUBJECT1 HC 9 4 46 Female no cake
## 2 SUBJECT2 HC 7 10 34 Female cake
## 3 SUBJECT3 HC 5 1 34 Male cake
## 4 SUBJECT4 HC 4 12 58 Female no cake
## 5 SUBJECT5 HC 10 5 27 Female no cake
## 6 SUBJECT6 HC 14 2 23 Male cake
The syntax for running this two-way ANOVA with no covariates is:
df.cake$treatment = as.factor(df.cake$treatment)
df.cake$Group = as.factor(df.cake$Group)
df.cake$Sex = as.factor(df.cake$Sex)
summary(aov(T2 ~ Group*treatment, data=df.cake))
anova(lm(T2 ~ Group*treatment, data=df.cake))
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake)
car::Anova(lm(T2 ~ Group*treatment, data=df.cake))
You can immediately see that even though we did not put any “covariates” in our analysis (like age or sex), the syntax is the same as if one of our factors was a covariate. Hence, we encouter the same SS type issue as we did in a one-way ANCOVA. How do we decide which SS type to use in our two-way ANOVA?
The first thing to consider is that we will likely not want to treat our factors sequentially, as we want to look at the effect of group and treatment simultaneously. Thus, type I SS - and so, aov() or anova(lm()) are unlikely to be our choice.
The second thing to consider is how balanced our sub-groups are. In other words: do we have balanced numbers of participants in each treatment arm (cake/no cake) per group (HC/A/D)? Let’s see what would happen if we did:
# df.cake.balanced = df
# df.cake.balanced$treatment = c(rep("cake",20),rep("no cake",20),rep("cake",16),rep("no cake",16),rep("cake",14),rep("no cake",14))
# df.cake.balanced$treatment = as.factor(df.cake.balanced$treatment)
# write.csv(df.cake.balanced, "C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE_balanced.csv", row.names = FALSE)
df.cake.balanced = read.csv("C:/Users/plukow/OneDrive - University College London/Posters and presentations/Internal/lab_clinic_data_CAKE_balanced.csv")
df.cake.balanced$treatment = as.factor(df.cake.balanced$treatment)
df.cake.balanced$Group = as.factor(df.cake.balanced$Group)
df.cake.balanced$Sex = as.factor(df.cake.balanced$Sex)
table(df.cake.balanced$Group, df.cake.balanced$treatment)
##
## cake no cake
## A 16 16
## D 14 14
## HC 20 20
summary(aov(T2 ~ Group*treatment, data=df.cake.balanced))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1594.7 797.4 29.091 1.47e-10 ***
## treatment 1 0.5 0.5 0.018 0.894
## Group:treatment 2 34.8 17.4 0.635 0.532
## Residuals 94 2576.5 27.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(T2 ~ Group*treatment, data=df.cake.balanced))
## Analysis of Variance Table
##
## Response: T2
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1594.73 797.37 29.0912 1.465e-10 ***
## treatment 1 0.49 0.49 0.0179 0.8939
## Group:treatment 2 34.82 17.41 0.6352 0.5321
## Residuals 94 2576.46 27.41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake.balanced)
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.091 1.47e-10 * 0.38200
## 2 treatment 1 94 0.018 8.94e-01 0.00019
## 3 Group:treatment 2 94 0.635 5.32e-01 0.01300
car::Anova(lm(T2 ~ Group*treatment, data=df.cake.balanced))
## Anova Table (Type II tests)
##
## Response: T2
## Sum Sq Df F value Pr(>F)
## Group 1594.73 2 29.0912 1.465e-10 ***
## treatment 0.49 1 0.0179 0.8939
## Group:treatment 34.82 2 0.6352 0.5321
## Residuals 2576.46 94
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All results are the same. But in our example above, we don’t have balanced sub-groups (and it’s unlikely that in real life, we ever would):
table(df.cake$Group, df.cake$treatment)
##
## cake no cake
## A 15 17
## D 16 12
## HC 18 22
How does that influence our results?
summary(aov(T2 ~ Group*treatment, data=df.cake))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1594.7 797.4 29.068 1.49e-10 ***
## treatment 1 2.3 2.3 0.085 0.771
## Group:treatment 2 30.9 15.5 0.563 0.571
## Residuals 94 2578.5 27.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(T2 ~ Group*treatment, data=df.cake))
## Analysis of Variance Table
##
## Response: T2
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1594.73 797.37 29.0680 1.486e-10 ***
## treatment 1 2.34 2.34 0.0852 0.7711
## Group:treatment 2 30.91 15.46 0.5634 0.5712
## Residuals 94 2578.53 27.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake)
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.102 1.46e-10 * 0.382000
## 2 treatment 1 94 0.085 7.71e-01 0.000905
## 3 Group:treatment 2 94 0.563 5.71e-01 0.012000
car::Anova(lm(T2 ~ Group*treatment, data=df.cake))
## Anova Table (Type II tests)
##
## Response: T2
## Sum Sq Df F value Pr(>F)
## Group 1596.60 2 29.1020 1.455e-10 ***
## treatment 2.34 1 0.0852 0.7711
## Group:treatment 30.91 2 0.5634 0.5712
## Residuals 2578.53 94
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We are back to square one: aov() and anova(lm()) are giving us different results to anova_test() and Anova(lm()). Why?
Additionally to the sequential, simultaneous or mixed methods of SS calculation, the three types of SS also treat balanced and unbalanced designs differently:
Type I sum of squares is only appropriate for balanced data, i.e., data with no collinearity and equal numbers of data points in each factor combination (e.g., the same number of HC and patients in both the cake and no-cake arms). This is because it uses the weighted marginal means approach (i.e., the estimated main effect of, say, group, will be proportional to the total number of participants in the study, rather than proportional to group sample sizes).*
Type II sum of squares uses unweighted marginal means calculation, so it accounts for differing sample sizes. If the data is balanced, it gives the same output as a function using type I SS.
Type III sum of squares also uses unweighted marginal means calculation, so it also accounts for differing sample sizes. If the data is balanced, it gives the same output as a function using type I SS.
*This is another reason to hate aov(). Even though its documentation says it’s good for “balanced or unbalanced experimental designs” (ref), there is no way to specify the SS calculation in it, and we clearly see it’s important. What’s worse, the output will sometimes warn you the estimates might not be correct (or ‘balanced’, which means nothing if you don’t know about the SS types) and sometimes it won’t. Horrible!
The order of arguments in syntax matters only if the type of SS you pick uses sequential SS calculation at all (so, type I, or type II with interactions). For example, if type I SS are used, function(phq9 ~ group+age, data=df) produces a different result to function(phq9 ~ age+group, data=df). This is because in type I SS, once the variance is attributed to the first predictor, the estimate of the second predictor explains any LEFT-OVER variance. In other words, the syntax will answer the question: ’once I’ve estimated the contribution to the variance of all other predictors (covariates), what’s the unique contribution of my main predictor of interest (the final predictor, or in our case, group)?* (ref) This will also be the case for type II ANOVAs with interactions, as interaction effects will be calculated sequentially after all main effects.
*since both aov() and anova() assume type I SS (and you cannot change that), this applies to both these functions
IF YOU DISCOVER A TRAP NOT COVERED ON THIS PAGE, PLEASE TELL ME SO I CAN LEARN ABOUT IT AND ADD IT HERE
ASSUMPTIONS?
You might think: anova_test() is more robust than Anova(lm()), so I’ll stick to that. What could go wrong?
There is a trap in anova_test() that is unrelated to SS calculation. Namely, anova_test(x ~ y+a) and anova_test(x ~ y, covariate=a) will give you different results regardless of the SS type used. Let’s compare the output of a two-way type III ANOVA on our sample data:
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.024 1.53e-10 * 0.382000
## 2 treatment 1 94 0.017 8.96e-01 0.000183
## 3 Group:treatment 2 94 0.563 5.71e-01 0.012000
rstatix::anova_test(T2 ~ Group, data=df.cake, type = 3, covariate = treatment)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 97 29.614 9.14e-11 * 0.379
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3, covariate = treatment)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.024 1.53e-10 * 0.382000
## 2 treatment 1 94 0.017 8.96e-01 0.000183
## 3 Group:treatment 2 94 0.563 5.71e-01 0.012000
The last syntax seems to give us the same results as the first one, so we can ignore it from now on.
Immediately, we can see that if we specify treatment as a covariate, anova_test() won’t use it as a factor. What if we entered factors in the formula, but covariates in the “covariate” argument?
rstatix::anova_test(T2 ~ Group*treatment+Age, data=df.cake, type = 3)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 93 28.882 1.75e-10 * 0.383000
## 2 treatment 1 93 0.011 9.18e-01 0.000113
## 3 Age 1 93 0.325 5.70e-01 0.003000
## 4 Group:treatment 2 93 0.575 5.65e-01 0.012000
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3, covariate = Age)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.024 1.53e-10 * 0.382000
## 2 treatment 1 94 0.017 8.96e-01 0.000183
## 3 Group:treatment 2 94 0.563 5.71e-01 0.012000
The output is different. In the first syntax, Age seems to be included in the model, whereas in the second one, it seems to disappear. Where is it disappearing to?
rstatix::anova_test(T2 ~ Group*treatment, data=df.cake, type = 3)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 94 29.024 1.53e-10 * 0.382000
## 2 treatment 1 94 0.017 8.96e-01 0.000183
## 3 Group:treatment 2 94 0.563 5.71e-01 0.012000
It looks like it’s being completely ignored from the calculations! Just when we thought anova_test() was the most robust, trustworthy AN(C)OVA function…
Post-hoc tests are pairwise comparisons done after your AN(C)OVA. Since AN(C)OVA can only tell you about main effects and interactions, you will still not know which pairwise comparisons drive these effects. For instance, if we compared three groups with each other, the most precise answer you could get from an AN(C)OVA would be: yes, they differ. Post-hoc tests answer the question: which ones?
It depends on your hypothesis #pre-register_EVERYTHING. “Post-hoc” means “after this” in latin. So, you can run post-hoc tests if your primary hypothesis concerns the main effects/interactions, but you want to find out what drives them. Note that if you have no main effects or interactions, there is no need to run any post-hoc tests - the AN(C)OVA already told you there are no differences to explore.
An interesting question is whether to run post-hoc tests if your primary hypothesis is about a difference between two groups. Well, if they’re supposed to be run “after this”, then by definition - no. If your primary hypothesis doesn’t concern main effects or interactions, you can pre-plan specific pairwise comparisons to answer your question. But then, you fall into the multiple comparison correction dilemma. Read through the page, take in the pros and cons, and discuss with your supervisor what you are going to do.
There are a few functions we can consider. The main ones that come up in a google search are:
Note 1: We will be assuming a Tukey test in the following section. If you want to specify any of the other post-hoc tests, here is how you do it (page 9).
Note 2: Since tukey_hsd() does not accept covariates, it is not robust for practical uses. Thus, we will stay away from tukey_hsd() from now on.
Say, we wanted to run post-hoc tests for a one-way type III ANOVA for
our sample data. The two equivalent ways to calculate this would be:
rstatix::anova_test(T1 ~ Group, data=df.cake, type = 3)
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 97 21.759 1.56e-08 * 0.31
car::Anova(lm(T1 ~ Group, data=df.cake), type = 3)
## Anova Table (Type III tests)
##
## Response: T1
## Sum Sq Df F value Pr(>F)
## (Intercept) 6441.1 1 307.977 < 2.2e-16 ***
## Group 910.2 2 21.759 1.561e-08 ***
## Residuals 2028.7 97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With either of the mentioned post-hoc functions, we could run Tukey
post-hoc tests like so:
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake)
## # A tibble: 3 × 9
## term .y. group1 group2 df statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Group T1 A D 97 0.340 0.735 1 e+0 ns
## 2 Group T1 A HC 97 5.84 0.0000000686 2.06e-7 ****
## 3 Group T1 D HC 97 5.27 0.000000833 2.50e-6 ****
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4018 1.1834 -0.340 0.938
## HC - A == 0 -6.3375 1.0846 -5.843 <1e-05 ***
## HC - D == 0 -5.9357 1.1269 -5.268 <1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
(Note that if glht() is used, it is fed the same model as Anova(). This is neat if done one after the other, so might be preferable if you ever need to revisit the code in the future. Anyways!)
Have a look at the value in the D-A comparison row of the output from emmeans_test(), column “p”. Now look at the D-A comparison row of the output from glht(), column “Pr(>|t|)”. The results do not converge. Why?
The reason is the defaults they assume (of course). But this time, it’s not the SS type - it’s the multiple comparison correction. Before we dive into whether/when/how to do multiple comparison corrections, let’s just see what the defaults are in emmeans_test() and glht():
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake)
## # A tibble: 3 × 9
## term .y. group1 group2 df statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Group T1 A D 97 0.340 0.735 1 e+0 ns
## 2 Group T1 A HC 97 5.84 0.0000000686 2.06e-7 ****
## 3 Group T1 D HC 97 5.27 0.000000833 2.50e-6 ****
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none")
## # A tibble: 3 × 9
## term .y. group1 group2 df statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Group T1 A D 97 0.340 0.735 7.35e-1 ns
## 2 Group T1 A HC 97 5.84 0.0000000686 6.86e-8 ****
## 3 Group T1 D HC 97 5.27 0.000000833 8.33e-7 ****
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4018 1.1834 -0.340 0.938
## HC - A == 0 -6.3375 1.0846 -5.843 <1e-05 ***
## HC - D == 0 -5.9357 1.1269 -5.268 <1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted("single-step"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4018 1.1834 -0.340 0.938
## HC - A == 0 -6.3375 1.0846 -5.843 <1e-05 ***
## HC - D == 0 -5.9357 1.1269 -5.268 <1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted("none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4018 1.1834 -0.340 0.735
## HC - A == 0 -6.3375 1.0846 -5.843 6.86e-08 ***
## HC - D == 0 -5.9357 1.1269 -5.268 8.33e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
emmeans_test() gave us p=0.735 both when we didn’t specify correction method, and when we specified it as “none” - therefore, “none” must be its default.
glht() gave us p=0.938 when we did not specify correction method, and when we did specify it as “single-step” (don’t worry about what it means, we’ll tackle it later!) - therefore, “single-step” must be its default.
Notice that when we specified the correction method as “none”, glht() gave us the same p-value as emmeans_test().
The principle of the multiple comparison correction is that if your alpha is 5% (i.e., your p-value threshold is 0.05), you accept that if you find a significant result, there is a 5% chance of it being a false-positive. With each extra test you are performing, you are multiplying your alphas (e.g., doing three comparisons at alpha=5% gives you an overall alpha of 1 - (.95 * .95 * .95) = .143. Adjusting the alpha by the number of tests you are doing is supposed to protect you from false-positives due to running multiple tests (ref).
Sadly, there is no consensus here. Below are some considerations I managed to find which might help you make your decision.
In the GraphPad guide, we read that a correction is not needed if:
There are probably a million out there (ref) and the only way to understand them is to be a mathematician, which I am definitely not. Take this section as a starting point and please do extra reading before you settle on a method.
Multiple comparison correction methods are broadly categorised as single-step - determining a single adjusted threshold for significance for all pairwise comparisons - and step-wise - determining an adjusted threshold for significance for each individual pairwise comparison (ref).
Another useful distinction is between FWE (family-wise error) and FDR (false-discovery rate), best put into plain language in this book on page 430:
By focusing on the familywise error rate we are obsessing about the possibility of making one or more Type I errors. The corresponding belief system can be summed up as ‘if I make even one Type I error then my entire set of conclusions is meaningless’. Benjamini and Hochberg think about things differently. Their belief system can be summed up as the rather more joyful ‘let’s try to estimate how many Type I errors (or false discoveries) we have made’. The FDR is simply the proportion of falsely rejected null hypotheses. [quote slightly trimmed]
With the power of hindsight, I can narrow the list down to the methods that I know for sure we can run in R. These are:
I do not dare to give any advice here, or order the methods above by any metric. I don’t understand the math behind them to do that. However, what I have gathered from my reading is that you can:
Thankfully, this bit is easy. Let’s have a look again at the syntax of emmeans_test() and glht() to see how to specify our correction method:
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none")
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))
We can see that emmeans_test() uses the “p.adjust.method=” argument, and glht() uses “test=multcomp::adjusted(type=)”. These are the options these two arguments take:
Note 1: “fdr” is the same thing as “BH” in the adjusted() function (ref)
Note 2: The classic Tukey’s test was not developed for unbalanced data, only for balanced datasets (ref). A modified Tukey’s procedure, the Tukey-Kramer method, was developed for unbalanced data. I couldn’t find any official documentation on the type of Tukey correction tukey_hsd() or glht() use (we can ignore emmeans_test() for these considerations, because it doesn’t have an argument where you can put in the Tukey correction method). I was tempted to trust this post, reassuring about tukey_hsd(), but with R’s deafult trap behaviour, I wanted to have some more certainty. So, I asked my friend Dr Amanda Kiemes to run a one-way ANOVA on my sample data (as I knew GraphPad uses Tukey-Kramer) and compared it to the default Tukey post-hoc tests in tukey_hsd(), emmeans_test() and glht():
rstatix::tukey_hsd(lm(T1 ~ Group, data=df.cake))
## # A tibble: 3 × 9
## term group1 group2 null.value estimate conf.low conf.high p.adj
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Group A D 0 -0.402 -3.22 2.42 0.938
## 2 Group A HC 0 -6.34 -8.92 -3.76 0.000000205
## 3 Group D HC 0 -5.94 -8.62 -3.25 0.00000248
## # ℹ 1 more variable: p.adj.signif <chr>
summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group="Tukey")))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4018 1.1834 -0.340 0.938
## HC - A == 0 -6.3375 1.0846 -5.843 <1e-05 ***
## HC - D == 0 -5.9357 1.1269 -5.268 <1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
GraphPad output:
The output seems to align, so I think tukey_hsd() and glht() default to the Tukey-Kramer method.
In case you are still worried about which method to use, here’s a comparison of the outputs from glht() with different multiple comparison correction methods. In each case, the syntax is:
## [1] summary(multcomp::glht(lm(T1 ~ Group, data=df.cake), linfct = multcomp::mcp(Group=''Tukey'')), test=multcomp::adjusted(type=''METHOD''))$test$pvalues[[P.VALUE.INDEX.IN.OUTPUT]]
Here is the comparison. Each column is colour-coded separately, from the lowest p-value in red to the highest in blue:
Let’s see!* Let’s start by including just one covariate - age:
*Unless you don’t want to see, of course. The tl;dr is: yes with glht(), but not in all cases with emmeans()
rstatix::emmeans_test(formula=T1 ~ Group+Age, data=df.cake, p.adjust.method = "none")
## Error in `pull()`:
## ! `!!enquo(var)` must select exactly one column.
It looks like emmeants_test() formula does not like covariates. What if we specified them separately?
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none", covariate=Age)
## # A tibble: 3 × 9
## term .y. group1 group2 df statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Age*Group T1 A D 96 0.357 7.22e-1 7.22e-1 ns
## 2 Age*Group T1 A HC 96 5.84 7.21e-8 7.21e-8 ****
## 3 Age*Group T1 D HC 96 5.24 9.43e-7 9.43e-7 ****
Okay, that worked! What if we included both age and sex as covariates?
rstatix::emmeans_test(formula=T1 ~ Group, data=df.cake, p.adjust.method = "none", covariate=c(Age,Sex))
## Error in contrast.emmGrid(res.emmeans, by = grouping.vars, method = method, : Nonconforming number of contrast coefficients
Not working again. What if Sex was in the formula?
rstatix::emmeans_test(formula=T1 ~ Group+Sex, data=df.cake, p.adjust.method = "none", covaraiate=Age)
## Error in rstatix::emmeans_test(formula = T1 ~ Group + Sex, data = df.cake, : unused argument (covaraiate = Age)
rstatix::emmeans_test(formula=T1 ~ Group*Sex, data=df.cake, p.adjust.method = "none", covaraiate=Age)
## Error in rstatix::emmeans_test(formula = T1 ~ Group * Sex, data = df.cake, : unused argument (covaraiate = Age)
Still not working. Despite a long search. I haven’t found the solution to using nominal covariates in emmeans_test().
PLEASE MESSAGE ME IF YOU WORK IT OUT!
Let’s see if glht() can deal with ANCOVAs. Including just one covariate
- age:
summary(multcomp::glht(lm(T1 ~ Group+Age, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group + Age, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4209 1.1796 -0.357 0.722
## HC - A == 0 -6.3108 1.0812 -5.837 7.21e-08 ***
## HC - D == 0 -5.8900 1.1237 -5.242 9.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
And now including both age and sex:
summary(multcomp::glht(lm(T1 ~ Group+Age+Sex, data=df.cake), linfct = multcomp::mcp(Group="Tukey")), test=multcomp::adjusted(type="none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = T1 ~ Group + Age + Sex, data = df.cake)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## D - A == 0 -0.4466 1.1809 -0.378 0.706
## HC - A == 0 -6.3580 1.0834 -5.869 6.40e-08 ***
## HC - D == 0 -5.9114 1.1249 -5.255 9.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
Success!
For now, my conclusion is that glht() is more robust than emmeans_test(), because it can deal with both ANOVAs and ANCOVAs with different covariate types.
This page is already long enough, and this question opens a whole new world of statistical headaches. For all things lm(), have a look at the separate page I made on the topic. If you don’t fancy doing that, avoid stand-alone summary(lm()) as you might interpret the results wrong.