19 Equivalence testing

19.1 Introduction

In earlier units, we introduced the idea of null hypothesis significance testing. We estimated a parameter, and if zero fell outside the 95% confidence interval of the estimate, we said that the null hypothesis could be rejected. In such a case, we have evidence for the claim that the parameter of interest, in the world, is probably not zero.

But what if we fail to reject the null hypothesis? That is, the confidence interval includes zero. Are we justified in the inference that the parameter in question is zero? Or, is it simply that our evidence is inconclusive?

To give a more concrete example, in a trial of the effect of ibuprofen on depressive symptoms, the effect size for the difference in symptoms between the ibuprofen group and the control group is an SMD of -0.3, with a confidence interval from -0.7 to +0.1. On the conventions of NHST, we are not justified in saying that we have evidence from this study that ibuprofen reduces symptoms, because zero is in the confidence interval. On the other hand, it would be odd to conclude that we do have evidence that ibuprofen has absolutely no effect on symptoms. Although zero is within our confidence interval, it is not in the middle of that interval, and in fact our best estimate (-0.3) is some distance below zero. There are lots of scenarios that are compatible with our data in which ibuprofen has a non-zero effect, for example if the true effect is -0.1, -0.2, -0.3, -0.4, etc. In short, we don’t have evidence that would lead us to confidently reject the null, but we don’t obviously have evidence that the null is true either.

Another way of putting this problem is that NHST never gives you positive evidence that the null hypothesis is true. Mathematically, you can’t even calculate the probability that a parameter is exactly zero. The best you can do is work out the probability, given the data, that it lies in some small interval that contains zero. More practically, there are plenty of non-zero numbers that amount to zero for all intents and purposes. If, in the world, ibuprofen reduced depressive symptoms, but only by 0.00000001 standard deviations, then the null hypothesis would be technically false. The parameter is not exactly zero. On the other hand, the effect of ibuprofen would be so small that it would be negligible from any conceivable practical or theoretical point of view. It is so small as to be as good as zero, even if it, pedantically, it is not exactly zero. What is the status of the null hypothesis then, true or false?

This chapter introduces an additional type of a statistical testing, equivalence testing (Lakens, Scheel, and Isager 2018). Equivalence testing allows you to go a bit further than NHST. It allows you to ask, in effect: do I have positive evidence that my effect is as good as zero; positive evidence that my effect is meaningfully different from zero; or just inconclusive evidence. You can perform equivalence tests on parameter estimates from your familiar models such as those fitted using lm(), glm() or lmer(). Equivalence testing is a complement to NHST and usually used in conjunction with it. We return to how and when you might use it at the end of the chapter.

19.2 The region of practical equivalence (ROPE)

The first step in performing an equivalence test is to define a region of practical equivalence (ROPE). The ROPE is the region of the parameter space that you take to be as good as zero. The ROPE is closely related to the SESOI (smallest effect size of interest), which we encountered in the chapter on power. A usual choice of ROPE is the region that lies between the SESOI in the negative direction and the SESOI in the positive direction, including zero.

Let’s return to the example of ibuprofen and depressive symptoms. You might decide that if ibuprofen produces a difference from control of less than 0.2 standard deviations of symptom score, then from a clinical and scientific point of view, its effect is so small as so be as good as zero. In this case, you define your ROPE as the interval \((-0.2, +0.2)\). You are going to say that any parameter value within the ROPE is equivalent to zero for all meaningful purposes.

Once you get the results of your study, there are three possible situations: -

  • The confidence interval for the effect of ibuprofen might be completely within the ROPE. Let’s say you got a parameter estimate of -0.002 with a confidence interval -0.06 to +0.05. In this case, an equivalence test would tell you to accept the hypothesis that your effect is equivalent to zero.
  • The estimate of the effect of ibuprofen might be outside the ROPE. For example, you might get a parameter estimate of -0.5. This cannot be equivalent to zero with a ROPE of -0.2 to 0.2. An equivalence test would tell you to reject the hypothesis of equivalence to zero.
  • The parameter estimate for the effect of ibuprofen might be within the ROPE, but some of its confidence interval is not. For example, you might get a parameter estimate of -0.1, with a confidence interval from -0.5 to +0.2. This interval includes some effects that are not meaningfully different from zero, and some that are. The equivalence test would tell you that the hypothesis of equivalence to zero is undecided.

19.3 Equivalence tests and null hypothesis significance tests

The surprising thing about equivalence tests is that their results are not just the mirror image of null hypothesis significance tests of the same parameter on the same data. You might think that there would only be two possibilities: the NHST is significant and the hypothesis of equivalence to zero is rejected; or the NHST is not significant and the hypothesis of equivalence to zero is accepted. In fact, these are not the only possibilities.

The possible results of the NHST are {significant, not significant}, and the possible results of the equivalence tests are that equivalence is {accepted, rejected, undecided}. Five different scenarios are possible for the combination of the results of these two tests. These scenarios are illustrated on the forest plot below, where we have defined the ROPE as the interval \((-0.2, 0.2)\) in terms of SMD.

Here:

  • In study 1 we have an effect which is significant by NHST (zero is not in the confidence interval), and the hypothesis of equivalence to zero is rejected (all of the confidence interval is outside the ROPE). Colloquially, the study showed that there is an effect and that it is big enough to matter.

  • In study 2, we have an effect which is significant by NHST, but the effect is so small that the entire confidence interval is within the ROPE, so the hypothesis of equivalence to zero is accepted. The effect is significantly different from zero, but equivalent to zero according to the criteria given. Colloquially, the study showed that there is an effect but also showed that it is too small to matter.

  • In study 3, we have an effect which is significant by NHST, but the equivalence test is undecided, because some of the confidence interval of the parameter is inside the ROPE, and some of it is outside. Colloquially, the study showed that there is an effect but did not provide enough evidence to know if the effect is big enough to matter.

  • In study 4, we have an effect which is not significant by NHST, and the hypothesis of equivalence to zero is accepted. Colloquially, the study showed that there is no effect here big enough to matter.

  • In study 5, we have an effect which is not significant by NHST, and the hypothesis of equivalence to zero is undecided because some of the confidence interval is outside the ROPE. Colloquially, the study did not establish that there is an effect, nor whether or not any effect is big enough to matter.

I hope that your studies give you results like study 1 or study 4. At least they are clear-cut! Often though, they will not, and you will find yourself in one of the more inconclusive scenarios. In particular, results like study 3 or study 5 could indicate that your study was a bit short on statistical power for clearly answering your question.

19.4 How to do equivalence tests

A nice function for performing equivalence tests is to be found in the contributed package parameters. The function is called equivalence_test() and it takes as its input any model object such as those created by lm(), glm() or 'lmer(). Now is the moment to install the parameters package if you have not done so before using install.packages("parameters").

Let us go right back to the first data set we encountered in this book, the data on behavioural inhibition (Paál, Carpenter, and Nettle 2015). In fact any other of the data sets we have encountered would have given us material for an example. Look back to the earlier unit where we analysed this data (section 6) if necessary.

First, let’s load in the data, and recode the Condition variable to be nicer.

d <- read_csv("https://dfzljdn9uc3pi.cloudfront.net/2015/964/1/data_supplement.csv")
## New names:
## Rows: 58 Columns: 13
## ── Column specification
## ──────────────────────────────── Delimiter: "," chr
## (1): Experimenter dbl (12): ...1, Sex, Age,
## Deprivation_Rank, Deprivat...
## ℹ Use `spec()` to retrieve the full column
## specification for this data. ℹ Specify the column types
## or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
d <- d %>% mutate(Condition = case_when(Mood_induction_condition == 1 ~ "Negative", Mood_induction_condition == 2 ~ "Neutral"))

Now we are going to fit a model predicting the outcome SSRT from the variables Condition, Deprivation_Score, Age and GRT. Critically, in addition to scaling all the continuous predictors as usual, we are going to scale the outcome variable. This means that the parameter estimates are to be interpreted in numbers of standard deviations’ change in SSRT associated with that predictor. This standardisation will be important when we define our ROPE.

m1 <- lm(scale(SSRT) ~ Condition + scale(Deprivation_Score) + scale(Age) + scale(GRT), data = d)
summary(m1)
## 
## Call:
## lm(formula = scale(SSRT) ~ Condition + scale(Deprivation_Score) + 
##     scale(Age) + scale(GRT), data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04214 -0.53479 -0.03738  0.51124  2.73117 
## 
## Coefficients:
##                          Estimate Std. Error t value
## (Intercept)               0.09004    0.16701   0.539
## ConditionNeutral         -0.17390    0.24563  -0.708
## scale(Deprivation_Score)  0.32352    0.12188   2.654
## scale(Age)                0.38712    0.14033   2.759
## scale(GRT)               -0.29113    0.14032  -2.075
##                          Pr(>|t|)   
## (Intercept)               0.59211   
## ConditionNeutral          0.48212   
## scale(Deprivation_Score)  0.01052 * 
## scale(Age)                0.00799 **
## scale(GRT)                0.04297 * 
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9024 on 52 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2553, Adjusted R-squared:  0.198 
## F-statistic: 4.457 on 4 and 52 DF,  p-value: 0.003585

Ok, so we have a non-significant effect of Condition, and significant associations between the outcome and all the other predictors. But which ones of these are equivalent to zero?

We need to choose our ROPE. I will suggest the interval \((-0.2, 0.2)\). 0.2 standard deviations is conventionally considered a small effect size (look back to the chapter on statistical power to see how small a 0.2 sd shift in the mean of a distribution really is). I think it is fair to say that any effect on SSRT of less than 0.2 standard deviations per unit change in the predictor is so small as to be negligible. We get our equivalence tests of all the parameter estimates in m1 as shown below. Note that we have to manually specify our ROPE using range =. If you do not do this the function will supply a default that may not be appropriate.

library(parameters)
## Warning: package 'parameters' was built under R version
## 4.3.3
equivalence_test(m1, range=c(-0.2, 0.2))
## # TOST-test for Practical Equivalence
## 
##   ROPE: [-0.20 0.20]
## 
## Parameter           |         90% CI |  SGPV | Equivalence |     p
## ------------------------------------------------------------------
## (Intercept)         | [-0.19,  0.37] | 0.758 |   Undecided | 0.301
## Condition [Neutral] | [-0.59,  0.24] | 0.512 |   Undecided | 0.525
## Deprivation Score   | [ 0.12,  0.53] | 0.118 |    Rejected | 0.842
## Age                 | [ 0.15,  0.62] | 0.059 |    Rejected | 0.906
## GRT                 | [-0.53, -0.06] | 0.224 |    Rejected | 0.741

So, here we would say that although the effect of Condition is not significantly different from zero (B = =0.17, se = 0.25, t = -0.71, p = 0.48), the test for equivalence to zero with a ROPE of (-0.2, 0.2) is undecided (\(p_{equivalence}\) = 0.53). The other parameter estimates are all significantly different from zero, and the hypotheses that they are equivalent to zero with a ROPE of (-0.2, 0.2) can be rejected.

Note that equivalence tests are based on the 90% confidence interval for the parameter estimate, whereas more commonly you will see 95% confidence intervals for parameter estimates reported.

The results of the equivalence test depend strongly on the ROPE you choose. If you choose a very wide ROPE, you are effectively saying ‘an effect would have to be really big before I counted it as meaningful’. With a wide ROPE, more equivalence tests will tend to become significant. Let’s rerun our equivalence test on m1 with a wider ROPE of \((-0.6, 0.6)\)

equivalence_test(m1, range=c(-0.6, 0.6))
## # TOST-test for Practical Equivalence
## 
##   ROPE: [-0.60 0.60]
## 
## Parameter           |         90% CI |   SGPV | Equivalence |     p
## -------------------------------------------------------------------
## (Intercept)         | [-0.19,  0.37] | > .999 |    Accepted | 0.002
## Condition [Neutral] | [-0.59,  0.24] | 0.979  |    Accepted | 0.046
## Deprivation Score   | [ 0.12,  0.53] | 0.996  |    Accepted | 0.014
## Age                 | [ 0.15,  0.62] | 0.962  |    Rejected | 0.068
## GRT                 | [-0.53, -0.06] | 0.995  |    Accepted | 0.016

Now all of my equivalence tests are significant other than the one for Age (and that one is not far off). Against a ROPE this big, we have positive evidence that all of the parameters other than Age are smaller than the smallest effect we are prepared to consider meaningful.

In my examples, the ROPE has been symmetrical about zero. Though this is usually the case in practice, it does not have to be. There could be cases where for some reason you conclude that the SESOI in one direction is different from the SESOI in the other, and then you would set your range asymmetrically.

19.5 When should you report equivalence tests?

Equivalence testing entered psychological and behavioural science as an addendum to NHST to report in cases where the result was not significant. It was thought of adding extra information to allow the reader to differentiate whether the results provided strong evidence for no meaningful effect, or were just inconclusive as to whether there was an effect or not. Thus, one scenario where you may well want to report equivalence tests is where you have important hypothesis tests whose results are non-significant.

In fact, however, you don’t need to restrict equivalence tests to just this scenario. You can use them more generally to ask: do I have a meaningful effect here, as opposed to just one that is different from zero by an arbitrarily small amount? In other words, equivalence tests ask about practical and theoretical significance, not just narrow statistical significance. There is therefore a case for using equivalence tests very widely in studies based on NHST. I would report both the NHST tests and the equivalence tests, as complements to one another.

Finally, equivalence tests can be used as statistical justifications for treating data sets as comparable. For example, you might want to show that the people in your patient group and the people in your healthy volunteer group were equivalent in terms of age, income, or BMI. The equivalence tests here are used to provide evidence that the groups are not too different on dimensions other than the disease you are studying. Or, equivalence tests could be used to justify pooling two different data sets and doing an analysis of them combined. If the two data sets are equivalent on key variables, it is reasonable to treat them as representing the same underlying population.

However you use them, the results of equivalence tests and the conclusions from them depend entirely on the choice of ROPE. Set a huge ROPE, and everything will be equivalent to zero. Set a narrow ROPE, and nothing will be. You should set your ROPE and your SESOI at the time of your preregistration. This is a prompt to think carefully about what kind of size of effect or association would be theoretically and practically meaningful for the phenomenon you study; and what is a difference so small as to be negligible. You can fall back on arbitrary conventions (such as small effect size is an SMD of 0.2 and a medium one is 0.5). If you can, though, think more specifically. Your can draw on other findings in your area, or projections of what would be needed to make a practical difference. This is a useful exercise to do as it helps clear up your thinking about what you are trying to find out. Setting your SESOI and ROPE can form part of your development of your sample size justification and methodology, and then you can carry them through to the choice of equivalence tests in your analysis strategy.

19.6 Summary

This chapter introduced equivalence testing. We discussed how it relates to and how it differs from more familiar null hypothesis significance testing. We also considered the situations in which you might want to include it as part of your analysis strategy. Finally, we met a simple practical way of getting equivalence tests for your statistical models, using the equivalence_test() function from the R package parameters.

Atkinson, Beth M., Tom V. Smulders, and Joel C. Wallenberg. 2017. “An Endocrine Basis for Tomboy Identity: The Second-to-Fourth Digit Ratio (2D:4D) in Tomboys.” Psychoneuroendocrinology 79 (May): 9–12. https://doi.org/10.1016/j.psyneuen.2017.01.032.
Button, Katherine S., John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” Nature Reviews Neuroscience 14 (5): 365–76. https://doi.org/10.1038/nrn3475.
Daubenmier, Jennifer, Jue Lin, Elizabeth Blackburn, Frederick M. Hecht, Jean Kristeller, Nicole Maninger, Margaret Kuwata, Peter Bacchetti, Peter J. Havel, and Elissa Epel. 2012. “Changes in Stress, Eating, and Metabolic Factors Are Related to Changes in Telomerase Activity in a Randomized Mindfulness Intervention Pilot Study.” Psychoneuroendocrinology 37 (7): 917–28. https://doi.org/10.1016/j.psyneuen.2011.10.008.
Fan, Lei, Joshua M. Tybur, and Benedict C. Jones. 2022. “Are People More Averse to Microbe-Sharing Contact with Ethnic Outgroup Members? A Registered Report.” Evolution and Human Behavior 43 (6): 490–500. https://doi.org/10.1016/j.evolhumbehav.2022.08.007.
Fitouchi, Léo, Jean-Baptiste André, Nicolas Baumard, and Daniel Nettle. 2022. “Harmless Bodily Pleasures Are Moralized Because They Are Perceived as Reducing Self-Control and Cooperativeness.” https://doi.org/10.31234/osf.io/fzv43.
Giner-Sorolla, Roger, Amanda K. Montoya, Alan Reifman, Tom Carpenter, Neil A. Lewis, Christopher L. Aberson, Dries H. Bostyn, et al. 2024. “Power to Detect What? Considerations for Planning and Evaluating Sample Size.” Personality and Social Psychology Review, February, 10888683241228328. https://doi.org/10.1177/10888683241228328.
Goh, Jin X., Judith A. Hall, and Robert Rosenthal. 2016. “Mini Meta-Analysis of Your Own Studies: Some Arguments on Why and a Primer on How.” Social and Personality Psychology Compass 10 (10): 535–49. https://doi.org/10.1111/spc3.12267.
Ho, Rainbow T. H., Jessie S. M. Chan, Chong-Wen Wang, Benson W. M. Lau, Kwok Fai So, Li Ping Yuen, Jonathan S. T. Sham, and Cecilia L. W. Chan. 2012. “A Randomized Controlled Trial of Qigong Exercise on Fatigue Symptoms, Functioning, and Telomerase Activity in Persons with Chronic Fatigue or Chronic Fatigue Syndrome.” Annals of Behavioral Medicine 44 (2): 160–70. https://doi.org/10.1007/s12160-012-9381-6.
Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLOS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.
Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.
Langsrud, Øyvind. 2003. “ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares.” Statistics and Computing 13 (2): 163–67. https://doi.org/10.1023/A:1023260610025.
Lavretsky, H., E.s. Epel, P. Siddarth, N. Nazarian, N. St. Cyr, D.s. Khalsa, J. Lin, E. Blackburn, and M.r. Irwin. 2013. “A Pilot Study of Yogic Meditation for Family Dementia Caregivers with Depressive Symptoms: Effects on Mental Health, Cognition, and Telomerase Activity.” International Journal of Geriatric Psychiatry 28 (1): 57–65. https://doi.org/10.1002/gps.3790.
Nettle, Daniel, Coralie Chevallier, Benoît de Courson, Elliott A. Johnson, Matthew T. Johnson, and Kate E. Pickett. 2024. “Short-Term Changes in Financial Situation Have Immediate Mental Health Consequences: Implications for Social Policy.” Social Policy & Administration. https://doi.org/10.1111/spol.13065.
Nettle, Daniel, and Rebecca Saxe. 2020. “Preferences for Redistribution Are Sensitive to Perceived Luck, Social Homogeneity, War and Scarcity.” Cognition 198 (May): 104234. https://doi.org/10.1016/j.cognition.2020.104234.
Open Science Collaboration. 2015. “PSYCHOLOGY. Estimating the reproducibility of psychological science.” Science (New York, N.Y.) 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.
Paál, Tünde, Thomas Carpenter, and Daniel Nettle. 2015. “Childhood Socioeconomic Deprivation, but Not Current Mood, Is Associated with Behavioural Disinhibition in Adults.” PeerJ 3 (May): e964. https://doi.org/10.7717/peerj.964.

References

Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.
Paál, Tünde, Thomas Carpenter, and Daniel Nettle. 2015. “Childhood Socioeconomic Deprivation, but Not Current Mood, Is Associated with Behavioural Disinhibition in Adults.” PeerJ 3 (May): e964. https://doi.org/10.7717/peerj.964.