Chapter 6 Between-study Heterogeneity

By now, we have already shown you how to pool effect sizes in a meta-analysis. In meta-analytic pooling, we aim to synthesize the effects of many different studies into one single effect. However, this makes only sense if we aren’t comparing Apples and Oranges. For example, it could be the case that while the overall effect we calculate in the meta-analysis is small, there are still a few studies which report very high effect sizes. Such information is lost in the aggregate effect, but it is very important to know if all studies, or interventions, yield small effect sizes, or if there are exceptions.

It could also be the case that even some very extreme effect sizes were included in the meta-analysis, so-called outliers. Such outliers might have even distorted our overall effect, and it is important to know how our overall effect would have looked without them.

The extent to which effect sizes vary within a meta-analysis is called heterogeneity. It is very important to assess heterogeneity in meta-analyses, as high heterogeneity could be caused by the fact that there are actually two or more subgroups of studies present in the data, which have a different true effect. Such information could be very valuable for research, because this might allow us to find certain interventions or populations for which effects are lower or higher. From a statistical standpoint, high heterogeneity is also problematic. Very high heterogeneity could mean that the studies have nothing in common, and that there is no “real” true effect behind our data, meaning that it makes no sense to report the pooled effect at all (Borenstein et al. 2011).

The idea behind heterogeneity

Rücker and colleagues (Rücker et al. 2008) name three types of heterogeneity in meta-analyses:

  1. Clinical baseline heterogeneity. These are differences between sample characteristics of the studies. For example, while one study might have included rather old people into their study, another might have recruited study participants who were mostly quite young.
  2. Statistical heterogeneity. This is the statistical heterogeneity we find in our collected effect size data. Such heterogeneity might be either important from a clinical standpoint (e.g., when we do not know if a treatment is very or only marginally effective because the effects vary much from study to study), or from statistical standpoint (because it dilutes the confidence we have in our pooled effect)
  3. Other sources of heterogeneity, such as design-related heterogeneity.

Point 1. and 3. may be controlled for to some extent by restricting the scope of our search for studies to certain well-defined intervention types, populations, and outcomes.

Point 2., on the other hand, has to be assessed once we conducted the pooling of studies. This is what this chapter focuses on.

Heterogeneity Measures

There are three types of heterogeneity measures which are commonly used to assess the degree of heterogeneity. In the following examples, \(k\) denotes the individual study, \(K\) denotes all studies in our meta-analysis, \(\hat \theta_k\) is the estimated effect of \(k\) with a variance of \(\hat \sigma^{2}_k\), and \(w_k\) is the individual weight of the study (i.e., its inverse variance: \(w_k = \frac{1}{\hat \sigma^{2}_k}\); see infobox in Chapter 4.1.1 for more details).

1. Cochran’s Q

Cochran’s Q-statistic is the difference between the observed effect sizes and the fixed-effect model estimate of the effect size, which is then squared, weighted and summed.

\[ Q = \sum\limits_{k=1}^K w_k (\hat\theta_k - \frac{\sum\limits_{k=1}^K w_k \hat\theta_k}{\sum\limits_{k=1}^K w_k})^{2}\]

2. Higgin’s & Thompson’s I2

\(I^{2}\) (Higgins and Thompson 2002) is the percentage of variability in the effect sizes which is not caused by sampling error. It is derived from \(Q\):

\[I^{2} = max \left\{0, \frac{Q-(K-1)}{Q} \right\}\]

3. Tau-squared

\(\tau^{2}\) is the between-study variance in our meta-analysis. As we show in Chapter 4.2.1, there are various proposed ways to calculate \(\tau^{2}\)

Which measure should i use?

Generally, when we assess and report heterogeneity in a meta-analysis, we need a measure which is robust, and not too easily influenced by statistical power. Cochran’s Q increases both when the number of studies (\(k\)) increases, and when the precision (i.e., the sample size \(N\) of a study) increases. Therefore, \(Q\) and whether it is significant highly depends on the size of your meta-analysis, and thus its statistical power. We should therefore not only rely on \(Q\) when assessing heterogeneity.

\(I^2\) on the other hand, is not sensitive to changes in the number of studies in the analyses. \(I^2\) is therefore used extensively in medical and psychological research, especially since there is a “rule of thumb” to interpret it (Higgins et al. 2003):

  • I2 = 25%: low heterogeneity
  • I2 = 50%: moderate heterogeneity
  • I2 = 75%: substantial heterogeneity

Despite its common use in the literature, \(I^2\) not always an adequate measure for heterogeneity either, because it still heavily depends on the precision of the included studies (Rücker et al. 2008; Borenstein et al. 2017). As said before, \(I^{2}\) is simply the amount of variability not caused by sampling error. If our studies become increasingly large, this sampling error tends to zero, while at the same time, \(I^{2}\) tends to 100% simply because the single studies have a greater \(N\). Only relying on \(I^2\) is therefore not a good option either.

Tau-squared, on the other hand, is insensitive to the number of studies, and the precision. Yet, it is often hard to interpret how relevant our tau-squared is from a practical standpoint.

Prediction intervals (like the ones we automatically calculated in Chapter 4) are a good way to overcome this limitation (IntHout et al. 2016), as they take our between-study variance into account. Prediction intervals give us a range for which we can expect the effects of future studies to fall based on our present evidence in the meta-analysis. If our prediction interval, for example, lies completely on the “positive” side favoring the intervention, we can be quite confident to say that despite varying effects, the intervention might be at least in some way beneficial in all contexts we studied in the future. If the confidence interval includes zero, we can be less sure about this, although it should be noted that broad prediction intervals are quite common, especially in medicine and psychology.


Borenstein, Michael, Larry V Hedges, Julian PT Higgins, and Hannah R Rothstein. 2011. Introduction to Meta-Analysis. John Wiley & Sons.

Rücker, Gerta, Guido Schwarzer, James R Carpenter, and Martin Schumacher. 2008. “Undue Reliance on I 2 in Assessing Heterogeneity May Mislead.” BMC Medical Research Methodology 8 (1). BioMed Central: 79.

Higgins, Julian PT, and Simon G Thompson. 2002. “Quantifying Heterogeneity in a Meta-Analysis.” Statistics in Medicine 21 (11). Wiley Online Library: 1539–58.

Higgins, Julian PT, Simon G Thompson, Jonathan J Deeks, and Douglas G Altman. 2003. “Measuring Inconsistency in Meta-Analyses.” BMJ: British Medical Journal 327 (7414). BMJ Publishing Group: 557.

Borenstein, Michael, Julian PT Higgins, Larry V Hedges, and Hannah R Rothstein. 2017. “Basics of Meta-Analysis: I2 Is Not an Absolute Measure of Heterogeneity.” Research Synthesis Methods 8 (1). Wiley Online Library: 5–18.

IntHout, Joanna, John PA Ioannidis, Maroeska M Rovers, and Jelle J Goeman. 2016. “Plea for Routinely Presenting Prediction Intervals in Meta-Analysis.” BMJ Open 6 (7). British Medical Journal Publishing Group: e010247.