5 Quantifying magnitude

In the last chapter, we compared the “testing” and “estimation” approaches to inference. The “estimation” approach treats the task of statistical inference as using data to estimate the values of unknown quantities, expressed as population parameters. Confidence intervals or credible intervals are placed around the estimates. These intervals convey the “precision” in our estimates.

  • Wider intervals mean more uncertainty and less precision.
  • Narrower intervals mean less uncertainty and more precision.

Estimates that quantify the magnitude of a phenomenon of interest can be referred to as “effect sizes”. In this chapter, we’ll look at some popular kinds of effect sizes, some potential sources of bias in effect sizes, and some critiques of how effect sizes are used in practice.

5.1 The case for effect sizes

  • As we saw in chapter 4, the estimation approach to inference can be seen as an alternative to the testing approach. In which case, what is it about estimation that we might prefer over testing?

A “negative” argument might be that the problems created by testing are so bad that we may as well try estimation. A “positive” argument would lay out the virtues of estimation directly.

  • The main case for effect sizes is that a magnitude of our response variable is usually of interest. For instance, if I tell you that a drug will reduce the duration of your headache by a “statistically significant” amount of time, you will probably want to know what this amount of time is. “Significant” just means “some amount that would be unlikely to occur by chance if the drug didn’t do anything”. That’s not a lot of information. We are going to be more excited about the drug if it will make the duration of our headaches 8 hours shorter than we would if it made the duration 30 minutes shorter.

  • It is, sadly, uncommon to see effect sizes reported in popular media. Most articles written about scientific studies limit themselves to saying that something “worked” or “didn’t work”, or that some outcome got “larger” or “smaller”. The answer to “by how much?” is often left unanswered.

5.3 Problems with effect size

This chapter has put forth the case for using interpretable effect sizes in statistical inference. But, as always, there are some problems we need to be aware of. This section will cover:

  • How selection for significance biases published effect sizes upward
  • How using “standardizing” effect sizes doesn’t automatically make them comparable
  • The “Crud Factor”

5.3.1 Selection for significance biases published effect sizes

“Selection for significance” refers to any procedure where we retain significant results and discard non-significant results.

This takes two common forms:

  • “p-hacking”: data are analyzed in a way that takes advantage of flexibility in analysis. Analyses that produce \(p<0.05\) are retained. No one gets to know how many analyses were tried.
  • “publication bias”: journals and authors choose not to publish papers where the main result of interest is not statistically significant. No one gets to see the studies that “didn’t work”.

In most disciplines, statistically significant results are much more likely to be published in journals than non-significant results. This creates a biased impression of the state of research. Literature reviews only turn up “successes”; there is little to no information on “failures”.

Publication bias has another serious consequence that isn’t as well known. Because significant results tend to have larger effect sizes than non-significant results, selection for significance amounts to censoring smaller estimates, which biases effect sizes upward.

Another way to think of it: we usually use unbiased estimators to estimate parameter values. If you take an unbiased estimator and then condition it on \(p<0.05\), you get a biased estimator.

The amount of bias is greater when power is lower, because more non-significant results are being thrown away.

Gelman and Carlin call this phenomenon a “Type M” (for magnitude) error. Formally, this is the expected amount by which an estimate differs from the parameter value, given that the result is significant. Lower power implies a greater Type M error.

They also define a “Type S” (for sign) error, which is the probability an estimate will have the wrong sign, given that it is significant. Thankfully, Type S errors are only a concern when power is very low.

  • Here is a simulation where population effect size is d = 0.5 (half a standard deviation difference in means), and power is 30%.

  • Red bars are significant results. Under publication bias, grey bars are discarded.

  • Mean of significant results is \(d = 0.89\). Type M error is \(78\%:\)

\[ \frac{0.89-0.5}{0.5}=0.78 \]

  • Fun fact: if the sample mean effect size were equal to the true population mean effect size of d = 0.5, the p-value would be greater than 0.05 and the result would not be statistically significant.

  • It turns out that, when power is less than \(50\%\), the expected value of the test statistic will not reach the threshold for statistical significance..

  • To repeat: If power is less than \(50\%\), THE TRUTH IS NOT STATISTICALLY SIGNIFICANT!!

5.3.2 The tradeoff between avoiding false positives in hypothesis testing and using unbiased methods for reporting effect size estimates

Recall that Fisher’s case for significance testing was they he wanted to avoid treating flukes as though they signified something real. If we require that our statistical estimates attain “significance”, then we will not often report the results of pure random chance as though they signified the existence of real phenomena. The was Fisher’s main argument in support of significance testing.

Fisher’s concern was about what we’d call Type I errors. And, if we are doing work in a field where null hypotheses are plausible and there is well founded concern about Type I errors, his advice might be good. But, we have now seen the tradeoff: if the null hypothesis is false but we only report statistically significant results, we have a Type II error problem that leads to upward bias in reported effect sizes. The only ways to avoid this are:

  • Only report results from very high power studies, so that Type II errors are rare.
  • Do not select for statistical significance.

Both are

5.3.3 Replication studies and registered reports

Can publication bias be avoided? There is an approach that attempts to prevent publication bias known as “registered reports”. To understand this, we’ll first look at the standard approach.

Under the traditional approach to publishing research, the authors of a research paper (a.k.a. “manuscript”) send it a journal for publication. If the journal’s editor thinks the paper might be worth publishing, the editor will assign experts in the relevant field or fields to review it. The reviewers (who are usually anonymous to the manuscript’s authors) will provide feedback to the authors and the editor, and make a recommendation to the editor. Typical recommendations are:

  • “Accept without revision” (this rarely happens)
  • “Accept with minor revisions”
  • “Accept with major revisions”
  • “Revise and resubmit”
  • “Reject”

The editor then takes the feedback and recommendations from the reviewers, and makes a decisions. If the editor agrees that revisions should be made, the editor will send reviewers’ feedback to the authors, who can then implement all of the recommendations, or respond to the editor with their reasons for not implementing some or all of the recommendations. The final decision on whether to publish rests with the journal editor.

This process is called peer review. “Publication bias” is when the decision to publish is influenced by how the results turned out. A common form of publication bias is the phenomenon by which statistically significant results are more likely to be published than are non-significant results.

Registered reports is an alternative to the traditional publication process. From the website:

Registered Reports is a publishing format that emphasizes the importance of the research question and the quality of methodology by conducting peer review prior to data collection. High quality protocols are then provisionally accepted for publication if the authors follow through with the registered methodology.

This format is designed to reward best practices in adhering to the hypothetico-deductive model of the scientific method. It eliminates a variety of questionable research practices, including low statistical power, selective reporting of results, and publication bias, while allowing complete flexibility to report serendipitous findings.

Under registered reports, the decision of whether or not to publish is made prior to the results being known. The principle here is that the value of a scientific study should be assessed based on the importance of the question being asked and the quality of the proposed methodology. If the question is worth asking, aren’t the results worth knowing, however they turn out?

Papers published under this model are very unlikely to be subject to publication bias. The other kinds of papers unlikely to be subject to publication bias are replication studies. These are studies which attempt to replicate previously published results. Because the aim is to assess how well prior science replicates, there should not be an incentive to only publish “successes” (in fact, some have claimed that replication researchers are biased toward publishing failures).

A famous 2016 paper (Gilbert et al. 2016) attempted to replicate 100 studies from top psychology journals, and found that about 40% “successfully replicated” (how replication should be defined is a controversial topic).

The “Many Labs” projects took a different approach: instead of performing single replication attempts for lots of studies, they perform lots of replication attempts for small numbers of studies.

5.3.4 Comparing replication studies to meta-analyses

So, we have statistical theory that says publication bias should bias estimated effect sizes upward, and that this bias gets worse as power gets lower. The questions then are:

  • How bad is the publication bias in a given body of scientific literature?
  • What was the statistical power for the published studies in a given body of scientific literature?

There are meta-analytic tools for assessing publication bias, some of which also attempt to correct this bias. Most analyze some combination of the distributions of p-values, test statistics, standard errors, effect size estimates, and sample sizes.

Here is a paper looking descriptively at p-values and noting how many fall just below 0.05.

Here is a paper comparing many of these methods, and using simulation to see how well they perform under various scenarios

There are meta-analytic tools for assessing average power among a set of studies.

Here is a method known as “z-curve”

A 2020 study compared published meta-analyses (which combined studies potentially subject to publication bias) to published replication studies (in which there is no publication bias; replication studies will report results whether or not they are statistically significant):

https://www.nature.com/articles/s41562-019-0787-z

Here is the abstract:

Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost three times as large as replication effect sizes. We also implement three methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.

How bad of a problem is publication bias and effect size bias? It depends on the field of research, but in some fields the problem is severe.

References

Gilbert, D. T., G. King, S. Pettigrew, and T. D. Wilson. 2016. “Comment on "Estimating the Reproducibility of Psychological Science".” Science 351 (March): 1037–37. https://doi.org/10.1126/science.aad7243.
Simpson, Adrian. 2017. “The Misdirection of Public Policy: Comparing and Combining Standardised Effect Sizes.” Journal of Education Policy 32 (January): 450–66. https://doi.org/10.1080/02680939.2017.1280183.

  1. we will also see that this practice is controversial↩︎