14 ANOVA

14.1 Introduction: What is ANOVA and when do we use it?

This session continues looking at the data of Nettle and Saxe’s study on intuitions about social sharing (Nettle and Saxe 2020). We will be exploring a different type of statistical test, ANOVA, that we have not encountered so far. ANOVA stands for Analysis of Variance. It is a test that produces a test statistic (the F-ratio) and an associated p-value, interpreted in the same way as other p-values we have encountered so far.

In traditional statistics courses, ANOVA is taught as if it were a completely separate type of thing from linear regression and other instances of the General Linear Model. This is wrong though; underlying every ANOVA is a General Linear Model (or Linear Mixed Model), and every General Linear Model can be used to produce an ANOVA table. ANOVA is just a different way of testing statistically whether each of your predictor variables is significantly associated with the outcome or not.

You are likely to use or encounter ANOVA statistics particularly in the following circumstances:

  • The study is an experiment rather than observational.

  • There is a continuous outcome variable.

  • There are categorical predictor variables.

  • And especially, when at least one categorical predictor variable has more than two levels. This is where ANOVA is really useful.

ANOVA is quite common in experimental psychology and cognitive science, exactly because there are often experiments with multiple, categorical independent variables, and sometimes some of these have more than two levels.

In this session, we will work with two datasets from Nettle and Saxe (2020) (study 7 and the one we have already seen, study 1), testing hypotheses both in the familiar way of looking at the General Linear Model parameter estimates and their confidence intervals, and using an ANOVA table instead. This will allow us to see why in some cases ANOVA might be a good approach.

It is easiest to understand ANOVA through a practical example, so let us plunge straight in.

14.2 An example: Nettle and Saxe (2020), study 7

14.2.1 Background

Today we are going to use the data from study 7 of the intuitions about sharing paper. We are focusing on this one because it is the only study in the paper that uses between-subjects manipulation (i.e. there were three conditions, and each of the nearly 1800 participants was in only one of these). So, we do not need a Linear Mixed Model, and can fit a straightforward General Linear Model.

The dependent variable in study 7 was once again that percentage of the harvest that the participant thought should be shared out between the villages (in the data, variable redistlevel). The independent variable (Condition) was again the importance of luck in the production of food. It had three levels: High (participant is told luck is important); Low (participant is told luck is not very important); and Unspecified (participant is not told anything about the role of luck).

Condition was the only experimental independent variable in study 7. However, as ANOVA is most useful when there are multiple independent variables, we are going to cheat and make another categorical predictor. The researchers measured the political orientation of the participants on a continuous scale of left to right. We are going to split this at the median and thereby categorically divide our participants into ‘left-wing’ and ‘right-wing’ groups. I am not necessarily saying this would be the right analysis for these data; only that we could use another categorical predictor for the sake of the exercise.

The predictions of the study were that it would make a difference to how much the participant thought should be shared: how important luck was; whether they identified as left or right wing; and (maybe) some interaction between these two predictors.

14.2.2 Getting and preparing the data

The data are archived at: https://osf.io/xrqae/. Today we want the file ‘study7.data.csv’. You know the deal by now; save the file into your working directory, and then:

library(tidyverse)
d7 <- read_csv("study7.data.csv")

I am calling it d7 because later in the session we will get the data from study 1 back as well.

Check the column names; the ones that are going to matter for us are redistlevel (the dependent); Condition (the experimental independent); and leftright (the self-identified political orientation). The others do not matter for today.

colnames(d7)
##  [1] "StartDate"             "Duration (in seconds)"
##  [3] "ResponseId"            "prolificid"           
##  [5] "redistlevel"           "importanceluck"       
##  [7] "redistributionmoral_1" "redistributionmoral_2"
##  [9] "redistributionmoral_3" "redistributionmoral_4"
## [11] "housebetter"           "housemoral_1"         
## [13] "housemoral_2"          "housemoral_3"         
## [15] "housemoral_4"          "deathapproval"        
## [17] "deathmoral_1"          "deathmoral_2"         
## [19] "deathmoral_3"          "deathmoral_4"         
## [21] "gender"                "age"                  
## [23] "SDO_1"                 "SDO_2"                
## [25] "SDO_3"                 "SDO_4"                
## [27] "SDO_5"                 "SDO_6"                
## [29] "SDO_7"                 "SDO_8"                
## [31] "leftright"             "welfare"              
## [33] "effortluck"            "SDO"                  
## [35] "Condition"

Now, as I said, we want to make a categorical variable out of leftwing by splitting it at the median. Here’s how we do this:

d7 <- d7 %>% mutate(Leftwing = (leftright < median(leftright, na.rm = TRUE)))

If you find this confusing, you can break it down into smaller steps too:

split.point <- median(d7$leftright, na.rm=TRUE)
d7 <- d7 %>% mutate(Leftwing = (leftright < split.point))

You might wonder why I needed to specify na.rm=TRUE in the call to median(). It is because there are some missing values in leftright, and R, ever the stickler, will return NA (missing) as the median of a vector containing any NA values.

Ok, but our ‘Leftwing’ variable now has two values, TRUE and FALSE. But maybe we would prefer the labels Left and Right. So let’s recode:

d7 <- d7 %>% mutate(Leftwing = 
                      case_when(Leftwing == TRUE ~ "Left",                                                          Leftwing == FALSE ~ "Right"))

Let’s just check that we have successfully divide the dataset into two groups:

table(d7$Leftwing)
## 
##  Left Right 
##   798   994

The right wing group is bigger than the left wing one, even though we split at the median. This is because a lot of participants were at exactly the median (40), and we (arbitrarily) decided them to put them into the right group.

Now, we are going to do one more thing, which is to make our predictor variables Condition and Leftwing into factors. Factors are a special class of character vectors in R whose levels are ordered, and can have some other attributes. This is often useful in ANOVA, so we will do it here.

d7 <- d7 %>% mutate(Condition = as.factor(Condition), 
                    Leftwing = as.factor(Leftwing))

14.2.3 Problems with hypothesis testing without ANOVA

Now, we are going to model the data using our standard General Linear Model approach. So, we fit a model m1 with redistlevel as the outcome and Condition, Leftwing and their interaction as the outcome.

m1 <- lm(redistlevel ~ Condition*Leftwing, data=d7)
summary(m1)
## 
## Call:
## lm(formula = redistlevel ~ Condition * Leftwing, data = d7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.254 -18.712  -5.863  14.746  61.568 
## 
## Coefficients:
##                                    Estimate Std. Error t value
## (Intercept)                          45.254      1.411  32.066
## ConditionLow                         -4.296      2.013  -2.134
## ConditionUnspecified                 -4.643      2.015  -2.304
## LeftwingRight                        -3.391      1.909  -1.777
## ConditionLow:LeftwingRight            1.144      2.707   0.423
## ConditionUnspecified:LeftwingRight    1.213      2.709   0.448
##                                    Pr(>|t|)    
## (Intercept)                          <2e-16 ***
## ConditionLow                         0.0330 *  
## ConditionUnspecified                 0.0213 *  
## LeftwingRight                        0.0758 .  
## ConditionLow:LeftwingRight           0.6726    
## ConditionUnspecified:LeftwingRight   0.6545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.28 on 1785 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.009321,   Adjusted R-squared:  0.006546 
## F-statistic: 3.359 on 5 and 1785 DF,  p-value: 0.005042

This is fine, but it is quite a complicated output. Why is it complicated? It is to do with the fact that Condition has three levels. So we have a coefficient representing the deviation of the mean redistlevel when Condition is Low rather than the reference level of High; and another coefficient representing the the deviation of the mean redistlevel when Condition is Unspecified rather than the reference level of High. For the interactions, it is also complicated: we have one coefficient representing the modification of the effect of being left wing on redistlevel when Condition is Low rather than High, and another representing the modification of the effect of being left wing when Condition is Unspecified rather than High. If we had more categorical predictors, this explosion of different coefficients would be even worse.

More than just the output being complex, conclusions about significance of variables are somewhat complex too. Notably, conclusions about significance depend on the choice of reference level. In model m1, the reference level for Condition was High. This is because by default, R orders the levels alphabetically. But let’s choose a different reference level and refit the model:

d7$Condition <- relevel(d7$Condition, ref="Unspecified")
m2 <- lm(redistlevel ~ Condition*Leftwing, data=d7)
summary(m2)
## 
## Call:
## lm(formula = redistlevel ~ Condition * Leftwing, data = d7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.254 -18.712  -5.863  14.746  61.568 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 40.61069    1.43794  28.242   <2e-16 ***
## ConditionHigh                4.64299    2.01478   2.304   0.0213 *  
## ConditionLow                 0.34749    2.03162   0.171   0.8642    
## LeftwingRight               -2.17825    1.92210  -1.133   0.2573    
## ConditionHigh:LeftwingRight -1.21262    2.70883  -0.448   0.6545    
## ConditionLow:LeftwingRight  -0.06821    2.71682  -0.025   0.9800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.28 on 1785 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.009321,   Adjusted R-squared:  0.006546 
## F-statistic: 3.359 on 5 and 1785 DF,  p-value: 0.005042

In m1, we had two significant effects of Condition, whereas in m2 we only have one. This is because High differs significantly from Low, and from Unspecified, but Unspecified does not differ significantly from Low. And the parameter estimates for all the interaction terms are also changed by changing the reference level for Condition.

If we think about it, there are actually quite a few different predictions you could test the moment you have a variable with more than two levels: that High will differ from Low, Low from Unspecified, and so on. But the prediction we actually wanted to test was more general than this: it was simply that Condition will make some difference to the mean of redistlevel. In other words, we might want to test the prediction that at least one of the Condition groups will differ from at least one of the others, without wanting to commit a priori to which one(s) it will be. Same for the interaction term: the test we want is: are there any cases where the level of Condition modifies the effect of being left wing?

It is this general test that the ANOVA test statistic, the F-ratio, gives you. In other words, if you have a significant F-statistic in an ANOVA, it means that that predictor or interaction makes some difference to the outcome, without regard to where in the various levels that difference resides. It does this by examining whether the differences in the means of the groups are larger than you would expect if every group was simply sampling at random from a distribution whose mean was the overall mean of the data. The more the F-ratio exceeds one, the greater the extent the differences in means between the groups is too big for the null hypothesis that all groups are sampled from distributions with the same mean. The ANOVA statistics, unlike the parameter estimates, are insensitive to the choice of reference level.

14.2.4 Producing the ANOVA table

So, let’s get our ANOVA table for model m1.

anova(m1)
## Analysis of Variance Table
## 
## Response: redistlevel
##                      Df Sum Sq Mean Sq F value   Pr(>F)   
## Condition             2   5951 2975.43  5.4925 0.004188 **
## Leftwing              1   3010 3009.57  5.5555 0.018530 * 
## Condition:Leftwing    2    137   68.71  0.1268 0.880888   
## Residuals          1785 966987  541.73                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So what this table tell us is that Condition significantly affects redistlevel, as does being left wing, but there is no evidence for an interaction between the two. The rows for Condition and Leftwing are known as the ‘main effects’, since they respectively represent the differences between conditions averaging across left and right wing; and the effects of being left or right wing averaging across conditions. The F-ratio of about 5.5 for the main effect of Condition tell us that the difference between the means of the three condition groups is about 5.5 times larger than we would expect if the responses for each condition group were sampled from distributions with the same mean.

A word of warning. There are three ways of generating F-ratios, so-called type I, type II and type III sums of squares. It makes no difference which is used when models contain no interactions, or when the numbers of cases in each group are perfectly balanced, but it does make a difference in other cases. R’s default is type II. This corresponds both to current recommendations, and intuitions about what a main effect ought to test (Langsrud 2003). However, the default in some widely-used non-R statistical programmes (and even some R contributed packages) is type III. So, you should report in your methods that you are using ANOVA based on type II sums of squares.

14.2.5 Reporting your ANOVA

You will want to know how to report the ANOVA results shown above. The F-ratio statistic has two degrees of freedom values associated with it, respectively called the numerator and denominator degrees of freedom. You can see these in the ANOVA table, one on the row for the predictor variable in question, and one on the row called Residual. For the main effect of Condition, the degrees of freedom are respectively 2 and 1785. The first of these is the number of levels of Condition, minus 1; and the second is the number of observations, minus the number of groups. You must always report both degrees of freedom values for any F-statistic, usually in brackets as shown below. So, here is how I would report our ANOVA results:

“We fitted a model predicting the mean level of redistribution from condition, left wing status, and their interaction. There were significant main effects of condition (F(2, 1785) = 5.49, p = 0.004) and of left-wing status (F(1, 1785) = 5.56, p = 0.019), but their interaction was not significant ((F(2, 1785) = 0.13, p = 0.881).”

This is fine as an overall statement. It does not tell the reader which conditions differed significantly from which others, so you would want to do follow up analyses to investigate this. You can do this with tools you already have (set a reference level, and test individual parameter estimates against the reference category).

14.3 ANOVA for Linear Mixed Models

You might still want to perform an ANOVA test, but for a case where you have within-subjects manipulation, or some other form of clustering, and hence are using a Linear Mixed Model. That’s fine: you can still get your ANOVA table, though there are some statistical adjustments (automatically performed) in producing it, to take account of the clustering in the data. In this section, I will show you how to do that using the study 1 data that we analysed using a Linear Mixed Model in the previous session.

First, let’s get the data again (you should have it saved from the previous session), load it in, and convert the independent variables into factors.

d1 <-read_csv("study1.data.csv") %>%
  mutate(luck=as.factor(luck), 
         heterogeneity=as.factor(heterogeneity))
## New names:
## Rows: 600 Columns: 18
## ── Column specification
## ─────────────────────────────────────────────── Delimiter: "," chr
## (4): participant, luck, heterogeneity, gender dbl (12): ...1, level,
## mode, age, left.right, support.for.welfare, ... lgl (2):
## passed.luck.comprehension, passed.het.comprehension
## ℹ Use `spec()` to retrieve the full column specification for this
## data. ℹ Specify the column types or set `show_col_types = FALSE` to
## quiet this message.
## • `` -> `...1`

Then we are going to fit our Linear Mixed Model as we did in the previous session, though here we are going to include the interaction term as well:

library(lmerTest)
l1 <-lmer(level ~ luck*heterogeneity + (1|participant), data=d1)

Now, you should have your model object l1 in your environment. To get the ANOVA table is quite simple; though beware! This ANOVA is provided by lmerTest which uses type III sums of squares by default. To be consistent with the analysis of study 7, we want therefore to specify that we require type II. As it happens it does not make any difference in this case as the groups are perfectly balanced; but, I am including this step for completeness.

anova(l1, type=2)
## Type II Analysis of Variance Table with Satterthwaite's method
##                     Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## luck               29205.1 14602.6     2   495 29.7506 6.294e-13 ***
## heterogeneity       2020.3  2020.3     1   495  4.1161   0.04301 *  
## luck:heterogeneity   187.0    93.5     2   495  0.1905   0.82664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The only thing to note that can be different from the General Linear Model is that where your ANOVA table comes from a Linear Mixed Model, you can get non-integer degrees of freedom (degrees of freedom are always integers in the simple General Linear Model case). This is to do with the statistical adjustment for the clustering in the data (aka Satterthwaite’s method). Otherwise, you report your F ratios in just the same way as you did for study 1.

14.4 Summary

In this unit, we have:

  • Introduced ANOVA tables;

  • Described what they are for and how they relate to General Linear Models;

  • Worked through a practical example where the independent variable has three levels and ANOVA is a useful test;

  • And worked through a second example where there is clustering in the data and hence a Linear Mixed Model.

Atkinson, Beth M., Tom V. Smulders, and Joel C. Wallenberg. 2017. “An Endocrine Basis for Tomboy Identity: The Second-to-Fourth Digit Ratio (2D:4D) in Tomboys.” Psychoneuroendocrinology 79 (May): 9–12. https://doi.org/10.1016/j.psyneuen.2017.01.032.
Button, Katherine S., John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” Nature Reviews Neuroscience 14 (5): 365–76. https://doi.org/10.1038/nrn3475.
Fan, Lei, Joshua M. Tybur, and Benedict C. Jones. 2022. “Are People More Averse to Microbe-Sharing Contact with Ethnic Outgroup Members? A Registered Report.” Evolution and Human Behavior 43 (6): 490–500. https://doi.org/10.1016/j.evolhumbehav.2022.08.007.
Fitouchi, Léo, Jean-Baptiste André, Nicolas Baumard, and Daniel Nettle. 2022. “Harmless Bodily Pleasures Are Moralized Because They Are Perceived as Reducing Self-Control and Cooperativeness.” https://doi.org/10.31234/osf.io/fzv43.
Giner-Sorolla, Roger, Amanda K. Montoya, Alan Reifman, Tom Carpenter, Neil A. Lewis, Christopher L. Aberson, Dries H. Bostyn, et al. 2024. “Power to Detect What? Considerations for Planning and Evaluating Sample Size.” Personality and Social Psychology Review, February, 10888683241228328. https://doi.org/10.1177/10888683241228328.
Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLOS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.
Langsrud, Øyvind. 2003. “ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares.” Statistics and Computing 13 (2): 163–67. https://doi.org/10.1023/A:1023260610025.
Nettle, Daniel, and Rebecca Saxe. 2020. “Preferences for Redistribution Are Sensitive to Perceived Luck, Social Homogeneity, War and Scarcity.” Cognition 198 (May): 104234. https://doi.org/10.1016/j.cognition.2020.104234.
Open Science Collaboration. 2015. “PSYCHOLOGY. Estimating the reproducibility of psychological science.” Science (New York, N.Y.) 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.
Paál, Tünde, Thomas Carpenter, and Daniel Nettle. 2015. “Childhood Socioeconomic Deprivation, but Not Current Mood, Is Associated with Behavioural Disinhibition in Adults.” PeerJ 3 (May): e964. https://doi.org/10.7717/peerj.964.

References

Langsrud, Øyvind. 2003. “ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares.” Statistics and Computing 13 (2): 163–67. https://doi.org/10.1023/A:1023260610025.
Nettle, Daniel, and Rebecca Saxe. 2020. “Preferences for Redistribution Are Sensitive to Perceived Luck, Social Homogeneity, War and Scarcity.” Cognition 198 (May): 104234. https://doi.org/10.1016/j.cognition.2020.104234.