Considerations when designing or interpreting a scientific study.
REQUIRED
- This chapter.
- Video Correlation and Causation from Causing Bullshit.
- From 8:02 to 10:26 in the Correlation Does not Equal Causation video from Crash Course Statistics.
- The whole video on Controlled Experiments from Crash Course Statistics.
Motivating Scenarios:
We want to plan a study or critically evaluate claims made by others.
Learning Goals: By the end of this chapter, you should be able to
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
We previously introduced the major goals of statistics. In order of importance, they are:
So far, we’ve focused primarily on cases where someone else conducted an experiment, and we analyzed their study design and raw data to meet these goals.
In this chapter, we shift to considering how to design a study ourselves. There are two excellent reasons for this:
Over the past few weeks, we’ve delved into the details of statistical methods, calculating sums of squares and determining how often a sample from the null distribution would be as extreme or more extreme than our estimate. This is a noble endeavor—we know that sample estimates can differ from population parameters due to chance, and we aim to remain skeptical to avoid being misled by data.
However, as we focus on significance and uncertainty, we can sometimes lose sight of the bigger picture. Specifically, we need to consider what we can and cannot conclude from a study, even if it yields a significant result. Another way to put this is that while we’ve explored the “black box” of statistical methods in depth, we’ve spent much less time thinking about how to interpret the results. This is a critical gap to address because all our statistical expertise and efforts are wasted if we don’t critically evaluate the implications of a study. We’ve already addressed this some (see section on causal inference) and will continue to do so here and in the next chapter.
When considering the implications of a study, it’s important to reflect on its goals and overall message:
Watch the video (1) for a discussion on the relationship between correlation and causation.
What does this mean? It’s not just about \(r\). Rather, it’s a reminder that statistical associations between two variables, A and B, do not necessarily mean that B caused A. For example, if lm(A ~ B) %>% anova()
yields a p-value < 0.05, we cannot automatically conclude that B caused A. It could be that A causes B, both are caused by a third variable C, or the result could even be a false positive (Figure 2).
Remember, a p-value only tells us how incompatible the data are with the null model, not what is responsible for this incompatibility. Watch the brief video below for a fun example (3). Note that at the end of the video, she discusses the problem of confounding variables.
Experiments offer us a potential way to learn about causation. In a well-executed experiment, we randomly assign individuals to treatment groups, to prevent an association between treatments and any confounding variables. Watch the video below (4) to hear a nice explanation of experiments and their limitations.
Absence of evidence is not evidence of absence.
Say we did an experiment and failed to reject the null. Remember that does not mean that the null is true. There still could be a causal relationship. We Revisit this in our discussion of power.
Experiments are amazing, and are among our best ways to demonstrate causation. Still, we have to be careful in interpreting results from a controlled experiment.
A causal relationship in an experiment does not imply a causal relationship in the real world, even for a true positive with a well-executed experiment. Here are some things to consider:
Treatment severity Remember the toxicology adage, “The dose makes the poison.” When an experimental treatment has an effect we should step back for a minute, and think about whether the level of exposure in the experimental treatment was comparable to what we see in our observational study. If not, it is possible that the treatment was causal in the experiment, but may not be relevant at real world doses.
Comparable effect sizes Say an experimental treatment had an effect – say in an experimental study we find that studying an extra hour for an exam increases test scores by 1.5%. This would show that studying can increase test scores, but would not explain a 15% difference in test scores for students who studies, an average, an hour longer than those in another group.
Interactions Most experiments happen in a controlled setting in a lab. Most published research studies WEIRD (Western, educated, industrialized, rich and democratic) populations etc. We have previously addressed the idea of interactions. So, we might worry if an experimental study is used as causative evidence for a claim concerning a very different context. Similarly, an absence of a causal relationship in an experiment might be misleading if an interaction between the treatment and some other variable which was not studies was the true cause.
Multiple causes In the real world, more than one thing can another. Showing that A causes B in an experiment does not mean that C does not cause B.
Experiments allow us to infer causation because they remove the association between the variable we care about and any confounding variables.
So, we better be sure that we don’t introduce covariates as we do our experiment. For these reasons, the best experiments include
Time heals Whenever I feel terribly sick, I call the doctor, and usually get an appointment the following week. Most of the time I get better before seeing the doctor. I therefore joke that the best way for me to get better is to schedule a doctor appointment. Now let’s think about this – obviously calling the Dr. didn’t heal me. Rather I called the Dr. when I felt my worst, and got better with time. This is because we tend to get better.
Regression to the mean A related concept, known as “Regression to the Mean” is the idea that the most extreme observations in a study are biased estimates of the true parameter values. That’s because being exceptional requires both an expectation of being exceptional (extreme \(\widehat{Y_i}\)) AND a high residual in that same direction (i.e. large positive and large negative residuals for exceptionally large and small values, respectively.
Experimental artifacts The experimental manipulation itself, rather than the treatment we care about could be causing the experimental result. Says we hypothesize that birds are attracted to red, so we glue red feathers onto some birds and see that that increases their mating success. We want to make sure that it is the red color, not the glue or added feathers that drives this elevated attractiveness.
Known treatments are a special kind of experimental artifact. Knowledge of the treatment by either the experimental subject or the experimenter, can introduce a bias. For example, people who think they have gotten treated might prime themselves for improvement. Processes like these are known as a placebo effect (listen to the 7 minute clip from radiolab, below for examples of the placebo and how it may work). Or, if the researchers know the treatment they may subconsciously bias their measurements or the way the treat their subjects.
We can minimize these biases by
Introducing effective controls It’s usually a good idea to have a “do nothing” treatment as a control, but this is not enough. We should also include a “sham treatment” or “placebo” is identical to the treatment in every way but the treatment itself. Taking our bird feathers example, we would introduce a treatment in which we glued blue feathers, and maybe ne where we glued on black feathers, to ensure the color red was responsible for the elevated attractiveness observed in our study.
“Blinding” If possible, we should do all we can to ensure that neither te experimenter nor the subjut knows which treatment they received.
Watch the video below from Crash Course in Statistics for a discussion of controls, placebos and blinding in experimental design.
Experiments are our best way to learn about causation, but we can’t always do experiments. Sometimes they are unethical, other times they are cost-prohibitive, or simply impossible. Do we have any hope of inferring causation in these cases?
The short answer is – sometimes, if we’re lucky. We have already discussed this some previously.
One good way to make causal claims from observational studies is to find matches, or “natural experiments” in which the variable we cared about changed in one set of cases, but did not in paired cases that are similar in every way except this change.
If we cannot make a rigorous causal claim it is best not to pretend we can. Rather, we should honestly describe what we can and cannot infer from our data.
Sampling bias isn’t our only consideration when planning a study, we would also like to increase our precision by decreasing sampling error.
Recall that sampling error is the chance deviation between an estimate and its true parameter which arises because of the process of sampling, and that we can summarize this as the standard deviation of the sampling distribution (aka the standard error). The standard error is something like the standard deviation divided by the square root of the sample size, so we can minimize sampling error by:
Well of course! We learned this as the law of large numbers. Just be sure that our samples are independent… eg avoid pseudoreplication.
We want to minimize, the standard deviation, but how? Well, we only have so much control of this because some variability is natural. Still there are things we can do
We could maximize our power and precision by having an infinitely large sample, but this is obviously silly. We’d be wasting a bunch of time and resources over-doing one study and will miss out on so many others. So, how do we plan a study that is big enough to satisfy our goals, without overdoing it?
We need to think about the effect size we care about and the likely natural variability
We use power analyses to plan appropriate sample sizes for a study. A power analysis basically finds the sample size necessary so that the sampling distribution of your experiment has
The traditional specified power researchers shot for is 80%, but in my opinion that is quite low and aiming for 90% power seems more reasonable.
The are numerous mathematical rules of thumb for power analyses, as well as online plugins e.g. this one from UBC and R packages (pwr
is most popular)
pwr
packageThe pwr package
helps you evlaute power for numerous standard statistical procedures. Say you wanted to design a study that you would analyze with a t-test, and you wanted a ninety percent chance of rejecting the null if the true population value of Cohen’s D was 1.
library(pwr)
pwr.t.test(power = .9,
d = 1, # Effect size (Cohen's d)
sig.level = 0.05, # Significance level
alternative = "two.sided")
Two-sample t test power calculation
n = 22.02109
d = 1
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Or say you wanted to know the power you had to reject the null if the true COhen’s D was 2 and you did and experiment with two groups of five samples.
library(pwr)
pwr.t.test(n = 5,
d = 2, # Effect size (Cohen's d)
sig.level = 0.05, # Significance level
alternative = "two.sided")%>%
tidy()%>%
mutate(n = 5, g = 2, sig_level = 0.05) %>% mutate(power = round(power, digits = 3))%>%gt()
n | sig.level | power | g | sig_level |
---|---|---|---|---|
5 | 0.05 | 0.791 | 2 | 0.05 |
See Chapter 20: Chapter 20 Sample Size Calculations with {pwr} from Higgins (2024) for more.
Often experimental design is more complex than the off-shelf options in the pwr
package. Of course, we could try to find a package better suited to our study, but sometime we will fail. Here I focus on one way we can estimate power and precision – we can simulate!!! There is a bit on new R in here, including writing functions. Enjoy if you like, skim if you don’t care. I also note that there are more efficient ways to code this in R. I ca provide examples if there is enough demand.
Let’s first write our own function to
x
(the minimal value we care about) and a standard deviation of s from a normal distribution.simTest <- function(n1, n2, x, s){
sim_id <- runif(1) # picka random id, in case you want it
sim_dat <- tibble(treatment = rep(c("a","b"), times = c(n1, n2)),
exected_val = case_when(treatment == "a" ~ 0,
treatment == "b" ~ x)) %>%
mutate(sim_val = rnorm(n = n(),mean = exected_val, sd = s))
tidy_sim_lm <- lm(sim_val ~ treatment, data = sim_dat) %>%
broom::tidy() %>%
mutate(n1 = n1, n2 = n2, x = x, s = s, sim_id = sim_id)
return(tidy_sim_lm)
}
We can see the outcome of one random experiment, with a sd of 2, and a difference of interest equal to one, and a sample size of twenty for each treatment.
We probably want to filter for just treatmentb
, because we don’t care about the intercept
filter(one_sim, term == "treatmentb")
# A tibble: 1 × 10
term estimate std.error statistic p.value n1 n2 x s
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 treatm… 1.66 0.654 2.54 0.0154 20 20 1 2
# ℹ 1 more variable: sim_id <dbl>
We can replicate this many times
n_reps <- 500
many_sims <- replicate(simTest(n1 = 20, n2 = 20, x = 1, s = 2), n = n_reps, simplify = FALSE) %>%
bind_rows() %>% # shoving output togther
filter(term == "treatmentb")
many_sims %>% mutate_if(is.numeric,round,digits = 4) %>%DT::datatable( options = list(pageLength = 5, lengthMenu = c(5, 25, 50), scrollX='400px'))
We can summarize this output to look at our power and the standard deviation, and upport and lower 2.5% quantiles to estimate our precision
many_sims %>%
summarise(power = mean(p.value < 0.05),
mean_est = mean(estimate),
sd_est = sd(estimate),
lower_95_est = quantile(estimate, prob = 0.025),
upper_95_est = quantile(estimate, prob = 0.975))
# A tibble: 1 × 5
power mean_est sd_est lower_95_est upper_95_est
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.354 1.02 0.652 -0.259 2.29
We can turn this last bit into a function and try it for a bunch of sample sizes
# A tibble: 4 × 6
n power mean_est sd_est lower_95_est upper_95_est
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 0.204 1.01 0.941 -0.837 2.88
2 20 0.352 1.00 0.644 -0.229 2.21
3 50 0.682 0.990 0.412 0.241 1.81
4 100 0.93 1.00 0.272 0.445 1.51