Chapter 36: Research Design

Considerations when designing or interpreting a scientific study.

REQUIRED
- This chapter.
- Video Correlation and Causation from Causing Bullshit.
- From 8:02 to 10:26 in the Correlation Does not Equal Causation video from Crash Course Statistics.
- The whole video on Controlled Experiments from Crash Course Statistics.

OPTIONAL READING: Power and Sample Size krzywinski2013 for more info!.

Motivating Scenarios:
We want to plan a study or critically evaluate claims made by others.

Learning Goals: By the end of this chapter, you should be able to

Review / Set Up

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

— R.A. Fisher, “Presidential Address to the First Indian Statistical Congress,” 1938.

We previously introduced the major goals of statistics. In order of importance, they are:

So far, we’ve focused primarily on cases where someone else conducted an experiment, and we analyzed their study design and raw data to meet these goals.

In this chapter, we shift to considering how to design a study ourselves. There are two excellent reasons for this:

  1. Many of you will conduct some form of scientific study during your career.
  2. Even if you never conduct a scientific study, learning how to design a good study helps you understand both the reasoning behind and the shortcomings of scientific studies and claims.

Challenges in Inferring Causation

Over the past few weeks, we’ve delved into the details of statistical methods, calculating sums of squares and determining how often a sample from the null distribution would be as extreme or more extreme than our estimate. This is a noble endeavor—we know that sample estimates can differ from population parameters due to chance, and we aim to remain skeptical to avoid being misled by data.

However, as we focus on significance and uncertainty, we can sometimes lose sight of the bigger picture. Specifically, we need to consider what we can and cannot conclude from a study, even if it yields a significant result. Another way to put this is that while we’ve explored the “black box” of statistical methods in depth, we’ve spent much less time thinking about how to interpret the results. This is a critical gap to address because all our statistical expertise and efforts are wasted if we don’t critically evaluate the implications of a study. We’ve already addressed this some (see section on causal inference) and will continue to do so here and in the next chapter.

When considering the implications of a study, it’s important to reflect on its goals and overall message:

Watch the video (1) for a discussion on the relationship between correlation and causation.

Figure 1: Watch this 8 minute video on correlation and causation from Calling Bullshit.

Correlation Does Not Necessarily Imply Causation

What does this mean? It’s not just about \(r\). Rather, it’s a reminder that statistical associations between two variables, A and B, do not necessarily mean that B caused A. For example, if lm(A ~ B) %>% anova() yields a p-value < 0.05, we cannot automatically conclude that B caused A. It could be that A causes B, both are caused by a third variable C, or the result could even be a false positive (Figure 2).

Possible causal relationships underlying significant associations. In this example, we would call C a *confounding variable**.

Figure 2: Possible causal relationships underlying significant associations. In this example, we would call C a *confounding variable**.

Remember, a p-value only tells us how incompatible the data are with the null model, not what is responsible for this incompatibility. Watch the brief video below for a fun example (3). Note that at the end of the video, she discusses the problem of confounding variables.

Figure 3: Watch from 8:02 to 10:26 of the Correlation Doesn’t Equal Causation video from Crash Course Statistics.

Recall that a confounding variable is an unmodeled variable that distorts the relationship between explanatory and response variables, potentially biasing our interpretation.

One weird trick to infer causation

Experiments offer us a potential way to learn about causation. In a well-executed experiment, we randomly assign individuals to treatment groups, to prevent an association between treatments and any confounding variables. Watch the video below (4) to hear a nice explanation of experiments and their limitations.

Figure 4: Watch the first 4 minutes of this video on manipulative experiments from Crash course on statistics. We will watch the rest of this video as we move through these concepts.

When experiments are not enough

False negatives and power

Absence of evidence is not evidence of absence.

— Origin unknown

Say we did an experiment and failed to reject the null. Remember that does not mean that the null is true. There still could be a causal relationship. We Revisit this in our discussion of power.

Cause in experiment \(\neq\) cause in the real world

Experiments are amazing, and are among our best ways to demonstrate causation. Still, we have to be careful in interpreting results from a controlled experiment.

A causal relationship in an experiment does not imply a causal relationship in the real world, even for a true positive with a well-executed experiment. Here are some things to consider:

Minimizing bias in study designs

Experiments allow us to infer causation because they remove the association between the variable we care about and any confounding variables.

So, we better be sure that we don’t introduce covariates as we do our experiment. For these reasons, the best experiments include

Potential Biases in Experiments

Eliminating Bias in Experiments

We can minimize these biases by

Watch the video below from Crash Course in Statistics for a discussion of controls, placebos and blinding in experimental design.

Figure 5: Watch from 4:48 to 7:02 of this video on controls, placebos, and blinding from Crash course on statistics. We will watch the rest of this video as we move through these concepts.

Inferring causation when we can’t do experiments

Experiments are our best way to learn about causation, but we can’t always do experiments. Sometimes they are unethical, other times they are cost-prohibitive, or simply impossible. Do we have any hope of inferring causation in these cases?

The short answer is – sometimes, if we’re lucky. We have already discussed this some previously.

One good way to make causal claims from observational studies is to find matches, or “natural experiments” in which the variable we cared about changed in one set of cases, but did not in paired cases that are similar in every way except this change.

If we cannot make a rigorous causal claim it is best not to pretend we can. Rather, we should honestly describe what we can and cannot infer from our data.

Figure 6: Watch from 10:18 to 11:12 of this video on manipulative experiments from Crash course on statistics. We will watch the rest of this video as we move through these concepts.

Minimizing sampling error

Sampling bias isn’t our only consideration when planning a study, we would also like to increase our precision by decreasing sampling error.

Recall that sampling error is the chance deviation between an estimate and its true parameter which arises because of the process of sampling, and that we can summarize this as the standard deviation of the sampling distribution (aka the standard error). The standard error is something like the standard deviation divided by the square root of the sample size, so we can minimize sampling error by:

Increasing the sample size decreases sampling error.

Well of course! We learned this as the law of large numbers. Just be sure that our samples are independent… eg avoid pseudoreplication.

How to decrease the standard deviation.

We want to minimize, the standard deviation, but how? Well, we only have so much control of this because some variability is natural. Still there are things we can do

Figure 7: Watch from 8:57 to 10:06 of this video on manipulative experiments from Crash course on statistics. We will watch the rest of this video as we move through these concepts.

Planning for power and precision

We could maximize our power and precision by having an infinitely large sample, but this is obviously silly. We’d be wasting a bunch of time and resources over-doing one study and will miss out on so many others. So, how do we plan a study that is big enough to satisfy our goals, without overdoing it?

We need to think about the effect size we care about and the likely natural variability

Estimating an appropriate sample size.

We use power analyses to plan appropriate sample sizes for a study. A power analysis basically finds the sample size necessary so that the sampling distribution of your experiment has

The traditional specified power researchers shot for is 80%, but in my opinion that is quite low and aiming for 90% power seems more reasonable.

The are numerous mathematical rules of thumb for power analyses, as well as online plugins e.g. this one from UBC and R packages (pwr is most popular)

The sample size we start with is rarely the sample size we end with – plants die, people disapear, rna degrades etc etc. Kep this in mind when designing your experiment, and increase your sample size to accomodae the expected number of lost data points.

Example with the pwr package

The pwr package helps you evlaute power for numerous standard statistical procedures. Say you wanted to design a study that you would analyze with a t-test, and you wanted a ninety percent chance of rejecting the null if the true population value of Cohen’s D was 1.

library(pwr)
pwr.t.test(power = .9, 
         d = 1,  # Effect size (Cohen's d)
         sig.level = 0.05, # Significance level
         alternative = "two.sided")

     Two-sample t test power calculation 

              n = 22.02109
              d = 1
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

NOTE: n is number in *each* group

Or say you wanted to know the power you had to reject the null if the true COhen’s D was 2 and you did and experiment with two groups of five samples.

library(pwr)
pwr.t.test(n = 5, 
         d = 2,  # Effect size (Cohen's d)
         sig.level = 0.05, # Significance level
         alternative = "two.sided")%>%
  tidy()%>%
  mutate(n = 5, g = 2, sig_level = 0.05)                                                                                                                           %>% mutate(power = round(power, digits = 3))%>%gt()
n sig.level power g sig_level
5 0.05 0.791 2 0.05

See Chapter 20: Chapter 20 Sample Size Calculations with {pwr} from Higgins (2024) for more.

Simulating to estimate power and precision

Often experimental design is more complex than the off-shelf options in the pwr package. Of course, we could try to find a package better suited to our study, but sometime we will fail. Here I focus on one way we can estimate power and precision – we can simulate!!! There is a bit on new R in here, including writing functions. Enjoy if you like, skim if you don’t care. I also note that there are more efficient ways to code this in R. I ca provide examples if there is enough demand.

Let’s first write our own function to

simTest <- function(n1, n2, x, s){
  sim_id  <- runif(1) # picka random id, in case you want it
  sim_dat <- tibble(treatment   = rep(c("a","b"), times = c(n1, n2)),
                    exected_val = case_when(treatment == "a" ~ 0,
                                            treatment == "b" ~ x)) %>%
    mutate(sim_val     = rnorm(n = n(),mean = exected_val, sd = s))
  tidy_sim_lm <- lm(sim_val ~ treatment, data = sim_dat) %>%
    broom::tidy() %>%
    mutate(n1 = n1, n2 = n2, x = x, s = s, sim_id = sim_id)
  return(tidy_sim_lm)
}

We can see the outcome of one random experiment, with a sd of 2, and a difference of interest equal to one, and a sample size of twenty for each treatment.

one_sim <- simTest(n1 = 20, n2 = 20, x = 1, s = 2)
one_sim                                                                                                                                                               %>% mutate_if(is.numeric,round,digits = 4) %>%DT::datatable( options = list( scrollX='400px'))

We probably want to filter for just treatmentb, because we don’t care about the intercept

filter(one_sim, term == "treatmentb")
# A tibble: 1 × 10
  term    estimate std.error statistic p.value    n1    n2     x     s
  <chr>      <dbl>     <dbl>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 treatm…     1.66     0.654      2.54  0.0154    20    20     1     2
# ℹ 1 more variable: sim_id <dbl>

We can replicate this many times

n_reps <- 500
many_sims <- replicate(simTest(n1 = 20, n2 = 20, x = 1, s = 2), n = n_reps, simplify = FALSE) %>%
  bind_rows() %>% # shoving output togther 
  filter(term == "treatmentb")

many_sims                                                                                                                                                      %>%                                               mutate_if(is.numeric,round,digits = 4) %>%DT::datatable( options = list(pageLength = 5, lengthMenu = c(5, 25, 50), scrollX='400px'))

We can summarize this output to look at our power and the standard deviation, and upport and lower 2.5% quantiles to estimate our precision

many_sims %>% 
  summarise(power        = mean(p.value < 0.05),
            mean_est     = mean(estimate),
            sd_est       = sd(estimate),
            lower_95_est = quantile(estimate, prob = 0.025),
            upper_95_est = quantile(estimate, prob = 0.975))
# A tibble: 1 × 5
  power mean_est sd_est lower_95_est upper_95_est
  <dbl>    <dbl>  <dbl>        <dbl>        <dbl>
1 0.354     1.02  0.652       -0.259         2.29

We can turn this last bit into a function and try it for a bunch of sample sizes

# A tibble: 4 × 6
      n power mean_est sd_est lower_95_est upper_95_est
  <dbl> <dbl>    <dbl>  <dbl>        <dbl>        <dbl>
1    10 0.204    1.01   0.941       -0.837         2.88
2    20 0.352    1.00   0.644       -0.229         2.21
3    50 0.682    0.990  0.412        0.241         1.81
4   100 0.93     1.00   0.272        0.445         1.51

Quiz

Figure 8: Quiz on research design here

Higgins, Peter D. R. 2024. Reproducible Medical Research with r. Bookdown.

References