6 NHST and Power
6.1 Null Hypothesis Significant Testing
Null hypothesis significance testing (NHST) is a controversial approach to testing hypotheses. It is the more employed approach in psychological science.
Simply, NHST begins with an assumption, a hypothesis, about some effect in a population. Data is collected that is believes to be representative of that population. Should the data not align with the null hypothesis, it is taken as evidence against the null hypothesis.
There are several concepts related to NHST and misconceptions that need to be addressed to facilitate your research.
6.2 p-values
Prior to exploring p-values, let’s ensure you have a basic understanding of probability notation. First, the notation:
For example, the probability of flipping a coin and getting a heads,
Additionally, I could use the notation:
For example, what if I provided more details in the previous coin example: I say now that the coin is not a fair coin. Well, the original probability will only hold if the coin is fair. That is, p(heads|fair coin) = .5. However, with the new information, p(heads|unfair coin)
Reserve probabilities are not equal. That is,
- p(Canadian|Prime Minister) = 1.00
- p(Prime Minister|Canadian)=.0000000263
A key feature of NHST is that the null hypothesis is assumed to be true. Given this assumption, we can estimate how likely a set of data are. This is what a p-value tells you.
The p-value is the probability of obtaining data as or more extreme than you did, given a true null hypothesis. We can use our notation:
Where D is our data (or more extreme) and
When the p-value meets some predetermined threshold, it is often referred to as statistical significance. This threshold has typically, and arbitrarily been
…beginning with the assumption that the true effect is zero (i.e., the null hypothesis is true), a p-value indicates the proportion of test statistics, computed from hypothetical random samples, that are as extreme, or more extreme, then the test statistic observed in the current study.
or, stated another way:
The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.
Simply, a p-values indicates the probability of your, or more extreme, test statistic, assuming the null:
Reality | |||
---|---|---|---|
No Effect ( |
Effect ( |
||
Research Result | No Effect ( |
Correctly Fail to Reject |
Type II Error ( |
Effect ( |
Type I Error ( |
Correctly Reject |
When we conduct NHST, we are assuming that the null hypothesis is indeed true, despite never truly knowing this. As noted, a p-value indicates the likelihood of your or more extreme data given the null.
6.2.1 If the null hypothesis is true, why would our data be extreme?
In inferential statistics we make inferences about population-level parameters based on sample-level statistics. For example, we infer that a sample mean is indicative of a population mean. In, NHST, we assume that population-level effects or associations (i.e., correlations, mean differences, etc.) are zero, null, nothing at a population level (note: this is not always the case). If we sampled from a population whose true effect or association was
The above shows the distribution of 10,000 correlation coefficients derived from simulated samples from a population that has
The magnitude of the tails is arbitrary, for the most part, but has been set to a standard of 5% (corresponding
The yellow region represents the extreme 1% of the distribution (0.5% per tail). The red region is where the original regions were for
So, we set the criterion of
In short, p-values are the probability of getting a set of data or more extreme given the null. We compare this to a criterion cut-off,
. If our data is very improbable given the null, so much that is is less than our proposed cutoff, we say it is statistically significant.
6.2.2 Misconceptions
Many of these misconceptions have been described in detail elsewhere (e.g., Nickerson, 2000). I visit only some of them.
6.2.2.1 Odds against chance fallacy
The odds against chance fallacy suggests that a p-value indicates the probability that the null hypothesis is true. For example, someone might conclude that if their
Jacob Cohen outlines a nice example wherein he compares
If you want
6.2.2.2 Odds the alternative is true
In NHST, no likelihoods are attributed to hypotheses. All p-values are predicated on
6.2.2.3 Small p-values indicate large effects
This is not the case. P-values depend on other things, such as sample size, that can lead to statistical significance for minuscule effect sizes. For example, one can achieve a high degree of statistical power (discussed below) for a population effect of
Power ( |
d ( |
Alpha ( |
Required Sample to Achieve Power | Cohen’s Effect Size Classification |
---|---|---|---|---|
.8 | .02 | .05 | 39,245 | <Small |
.8 | .05 | .05 | 6,280 | <Small |
.8 | .2 | .05 | 393 | Small |
.99 | .02 | .05 | 91,863 | <Small |
From the table above, if the true population effect was TINY,
6.3 Power
Whereas p-values rest on the assumption that the null hypothesis is true, the contrary assumption, that the null hypothesis is false (i.e., an effect exists), is important for determining statistical power. Statistical power is defined as the probability of correctly rejecting the null hypothesis given an true population effect size and sample size, or more formally:
Statistical power is the probability that a study will find p <
IF an effect of a stated size exists. It’s the probability of rejecting when is true. (Cumming & Calin-Jageman, 2016)
See the following figure for a depiction, where alpha=
We can conclude from this definition that if your statistical power is low, you will not likely reject
Before proceeding to the next example, please note that you will often see the symbol ‘rho’, which is represented by the Greek symbol
Consider a researcher who is interested in the association substance use (SU) and suicidal behaviors (SB) in Canadian high school students. Let’s assume that the true association between SU and SB is
So, it appears that as substance use increases, so do suicidal behaviors. Although we aren’t so omniscient in the real world, the population correlation here is
Before conducting our study, we will conduct a power analysis to determine an appropriate sample size required to adequately power our study. We want to have a good probability of rejecting the null, if it were false (here, we know this is the case). To calculate our required sample size, we require: i) our pwr
package to conduct our power analysis. This package is very useful; you can insert any three of the required four pieces of information to calculate the missing piece. More details on using pwr
are below. For our analysis, we get:
approximate correlation power calculation (arctangh transformation)
n = 84.07364
r = 0.3
sig.level = 0.05
power = 0.8
alternative = two.sided
The results of this suggest that we need a sample of about 84 people (n = 84.07) to achieve our desired power (
We can create a histogram that plots the results of many random studies drawn from our population to determine which meet
First, we will will plot a histogram of all of our calculated p-values.
This graph represents the distribution of correlation coefficients for each of our random samples, which were drawn from the 100,000 high schoolers. Notice how they form a seemingly normal distribution around our true population correlation coefficient,
The results of our correlations suggest that 1610 correlation coefficients were at or beyond the critical value. Do you have any guess what proportion of the total samples that was? Recall that power is the probability that any study will have
6.3.1 What if we couldn’t recruit 84 participants?
Perhaps we sampled from one high school in a small Canadian city and could only recruit 32 participants. How do you think this would affect our power? Consider that smaller sample sizes are likely to give less precise estimations of population parameters (i.e., the histogram above may be more spread out). However, this influences our critical correlation coefficient, so the red regions shift outward. This will reduce power. Let’s rerun our simulation with 2000 random samples of
set.seed(372837)
<- pmap(.l=list(sims=1:2000,
correlations_32 ss=rep(32, times=2000)),
.f=function(sims, ss){
<- sample_n(tbl=data, size=ss, replace = F)
df <- cor.test(df$SU, df$SB)$estimate
r
return(r)
})
<- do.call(rbind, correlations_32) %>%
correlations_32 as.data.frame()
Next we can plot the resultant correlation coefficients. Note that the critical correlation coefficient value for
Hopefully, you can see that despite the distribution still centering around the population correlation of
0 1
1208 792
Our of the 2000 simulated studies, 792 studies yielded statistically significant results (
Formal Power Test
approximate correlation power calculation (arctangh transformation)
n = 32
r = 0.3
sig.level = 0.05
power = 0.3932315
alternative = two.sided
Whoa! Close enough for me. Let’s take it one step further and assume we could only get 20 participants.
Next we can plot the resultant correlation coefficients. Note that the critical correlation coefficient value for
Hopefully, you can see that despite the distribution still centering around our population correlation of .3, the red regions are further outward than the previous two examples. Again, let’s see how many studies resulted in
0 1
1504 496
So, 496 studies yielded statistically significant results (
approximate correlation power calculation (arctangh transformation)
n = 20
r = 0.3
sig.level = 0.05
power = 0.2559237
alternative = two.sided
So, out of 2000 random samples from our population, 24.8% had
6.4 Increasing Power
Power is the function of three components: sample size, hypothesized effect,
Population (rho) | Sample Size | Alpha | Power (1-beta) |
---|---|---|---|
.1 | 20 | .05 | .0670 |
.1 | 32 | .05 | .0845 |
.1 | 84 | .05 | .1482 |
.1 | 200 | .05 | .2919 |
.3 | 20 | .05 | .2559 |
.3 | 32 | .05 | .3932 |
.3 | 84 | .05 | .9776 |
.3 | 200 | .05 | .9917 |
.6 | 20 | .05 | .8306 |
.6 | 32 | .05 | .9657 |
.6 | 84 | .05 | .9999 |
.6 | 200 | .05 | 1.000 |
… | … | … | … |
.1 | 782 | .05 | .8000 |
.05 | 3136 | .05 | .8000 |
6.4.1 Effect Size
As the true population effect reduces in magnitude, your power is also reduced, given constant
6.4.2 Estimating Population Effect Size
There are many ways to estimate the population effect size. Here are some common examples in order of recommendation (i.e., try the higher one’s first):
- Existing meta-analysis results: some recommend using the lower-bound estimate of presenting CI.
- Existing studies that have parameter estimates: some recommend using the lower-bound estimate of the presented CI or halving the effect size of a single study.
- Consider the smallest meaningful effect size based on extant theory.
- Use general effect size determinations that are considered small, medium, and large. Use the one that makes more sense for your theory.
6.4.3 level
Recall from above that our alpha level is a criterion that we select to compare our p-values. Reducing our alpha level will result in a larger correlation coefficient criterion that we deem extreme and, thus, smaller red areas. In short, reducing alpha (i.e., more strict criterion) will decrease power, holding all other things constant.
6.5 Power in R
We will focus on two packages for conducting power analysis: pwr
and pwr2
.
library(pwr)
library(pwr2)
6.5.1 Correlation
Correlation power analysis has four pieces of information. You need any three to calculate the other:
n
is the sample sizer
is the population effect,sig.level
is you alpha levelpower
is power
So, if we wanted to know the required sample size to achieve a power of .8, with a alpha of .05 and hypothesized population correlation of .25:
## You simply leave out the piece you want to calculate
pwr.r.test(r = .25,
power = .8,
sig.level = .05)
approximate correlation power calculation (arctangh transformation)
n = 122.4466
r = 0.25
sig.level = 0.05
power = 0.8
alternative = two.sided
6.5.2 t-test
With same sized groups we use pwr.t.test
. We now need to specify the type as one of ‘two.sample’, ‘one.sample’, or ‘paired’ (repeated measures). You can also specify the alternative hypothesis as ‘two.sided’, ‘less’, or ‘greater’. The function defaults to a two sampled t-test with a two-sided alternative hypothesis. It uses Cohen’s d population effect size estimate (in the following example I estimate population effect to be
pwr.t.test(d = .3,
sig.level = .05,
power = .8)
Two-sample t test power calculation
n = 175.3847
d = 0.3
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
6.5.3 One way ANOVA
One way requires Cohen’s F effect size, which is kind of like the average Cohen’s d across all conditions. Because it is more common for researchers to use
Cohen’s F
pwr.1way.test()
requires the following:
k
= number of groupsf
= Cohen’s Fsig.level
is alpha, defaults to .05power
is your desired power
pwr.anova.test(k = 3,
f = .4,
power = .8,
sig.level = .05)
6.5.4 Alternatives for Power Calculation
6.5.5 G*Power
You can download here.
6.5.6 Simulations
Simulations can be run for typical designs, which you have seen above through our own simulations to demonstrate the general idea of power. For example, we can repeatedly run a t-test on two groups with a specific effect size at the population level. Knowing that Cohen’s d is:
We can use rnorm()
to specify two groups where the difference in means is equal to Cohen’s d and when we keep the SD of both groups to 1.
# One time
<- 20
sample_size <- .4 ## our hypothesized effect is .4
cohens_d t.test(rnorm(sample_size),
rnorm(sample_size, mean=cohens_d), var.equal = T)
Two Sample t-test
data: rnorm(sample_size) and rnorm(sample_size, mean = cohens_d)
t = 0.039978, df = 38, p-value = 0.9683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6019724 0.6262270
sample estimates:
mean of x mean of y
0.2923802 0.2802529
We can use various R capabilities to simulate 10,000 simulations and determine the proportion of studies that conclude that
library(tidyverse)
<- 20
sample_size <- 10000
n_simulations <- .4 ## our hypothesized effect is .4
cohens_d <- pmap(.l=list(sims=1:n_simulations),
dat_sim .f=function(sims){
t.test(rnorm(sample_size),
rnorm(sample_size, mean=cohens_d), var.equal = T)$p.value})
## convert to data frame
<- do.call(rbind, dat_sim) %>%
dat_sim2 as.data.frame() %>%
rename("p"="V1")
This returned a data.frame will 10,000 pvalues from simulations. The results suggest that 2360 samples were statistically significant, indicating 23.6% were statistically significant.
You may be thinking, why do this when I have pwr.t.test
? Well, the rationale for more complex designs is the same. For more complicated designs, it can be difficult to determine the best power calculation to use (e.g., imagine a 4x4x4x3 ANOVA or a SEM). Sometimes it makes sense to run a simulation.
Simulation of SEM in R, which can help with power analysis.
Companion shiny app regarding statistical power can be found here.
6.6 Recommended Readings:
- Cohen, J. (1994). The Earth is round (p < .05).American Psychologist, 49, 997-1003.
- Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. https://doi.org/10.1037/1082-989x.5.2.241
- Pritchard, T. R. (2021). Visualizing Power.