Chapter 32 Reflect Review and Relax

Motivating scenarios: We are taking stock of where we are in the term. Thinking about stats and science, and making sure we understand the material to date.

Required reading / Viewing:

The Science of Doubt. link. By Michael Whitlock.

32.1 Review / Setup

So much of stats aims to learn the TRUTH.
We focus so much on our data and how to measure uncertainty around estimates and (in)compatibility of data with a null model. We will review and solidify this, but
Recognize that so much beyond sampling error can mislead us.

32.2 How science goes wrong

Watch the video below. Whe you do, consider these types of errors that accompanny science. You should be able to think about htese and as good questions about them.

Fraud.
Wrong models.
Experimental design error.
Communication error.
Statistician error.
Harking.
Coding error.
Technical Error.
Publication bias.

You should have something to say about

The “replication crisis,” and
If/why preregistration of studies is a good idea.

Figure 32.1: Watch this hour long video on The science of Doubt by Michael Whitlock.

A brief word on publication bias Scientists are overworked and have too much to do. They get more rewards for publishing statistically significant results, so those are usually higher on the to do list. This results in the file drawer effect in which non-significant results are less likely to be submitted for publication.

I simulate this below, and the have a web app (basically this code dressed up in sliders) for you to use to explore this.

# Set it up
sample_sizes      <- c(2,4,6,8,12,16,24,32,48,64,96,128,192,250, 384,500,768, 1000)
replicates        <- 10000
total_experiments <- length(sample_sizes) * replicates
exp_id            <- factor(1:total_experiments)
mu                <- .2
sigma             <- 1
  
# Simulate
sim_dat <- tibble(exp_id      =   factor(1:total_experiments) ,
       sample_size = rep(sample_sizes, each = replicates)) %>%
  uncount(weights = sample_size, .remove = FALSE)          %>%
  mutate(sim_val  = rnorm(n = n(), mean = mu, sd = sigma))  

# Summarize and hypothesis test
sim_summary <- sim_dat %>%
  group_by(exp_id) %>%
  summarise(n = n(),
            mean_val = mean(sim_val),
            se       = sd(sim_val) / sqrt(n),
            t        = mean_val / se,
            p_val    = 2 * pt(q = abs(t), df = n-1, lower.tail = FALSE),
            reject   = p_val < 0.05) %>%
  group_by(n) %>%
  mutate(power = mean(reject))

# plot
 sim_plot<- ggplot(sim_summary,  aes(x = power, y = mean_val,label = n))+
            stat_summary(aes(color = reject),
                         geom = "text", size =3,
                         position = position_nudge(y = .02, x = -.015),
                         show.legend = FALSE) +
            stat_summary(aes(color = reject),geom = "point",
                         show.legend = FALSE)            +
            stat_summary(aes(color = reject), geom = "line")+
            stat_summary(geom = "line", color = "black")  +
            annotate(x = .5, y = mu+.02, geom = "text",label = "mean of all sims" , size = 2)  +
            theme_light()+
            labs(title = "Significant results are biased. Numbers show n")
        ggplotly(sim_plot)

Figure 32.2: Testing the null that the mean equals zero, when we know the true mean is 0.2.

Interact with the app below (basically this code with widgets allowing you to) to see how this biases our estimates.

I find this stuff fascinating If you want more, here are some good resources.

Videos from calling bullshit – largely redundant with video above): 7.2 Science is amazing, but…, 7.3 Reproducibility, 7.4 A Replication Crisis, 7.5 Publication Bias, and 7.6 Science is not Bullshit.

The replication crisis
- Estimating the reproducibility of psychological science (Collaboration 2015) link,
- A glass half full interpretation of the replicability of psychological (Leek, Patil, and Peng 2015) link,
- The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies (Maxwell 2004) link.
P-hacking The Extent and Consequences of P-Hacking in Science (Head 2015) link.
*The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time∗ link.

32.3 Review

You should be pretty comfortable with the ideas of

Parameters vs Estimates
Sampling and what can go wrong
Null hypothesis significance testing
Common test statistics
- F
- t
Calculating Sums of Squares
Interpreting stats output like that below

ToothGrowth <- mutate(ToothGrowth, dose = factor(dose))
tooth_lm <- lm(len ~ supp * dose, data = ToothGrowth)


summary(tooth_lm)

## 
## Call:
## lm(formula = len ~ supp * dose, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.20  -2.72  -0.27   2.65   8.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13.230      1.148  11.521 3.60e-16 ***
## suppVC         -5.250      1.624  -3.233  0.00209 ** 
## dose1           9.470      1.624   5.831 3.18e-07 ***
## dose2          12.830      1.624   7.900 1.43e-10 ***
## suppVC:dose1   -0.680      2.297  -0.296  0.76831    
## suppVC:dose2    5.330      2.297   2.321  0.02411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746 
## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16

anova(tooth_lm)

## Analysis of Variance Table
## 
## Response: len
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## supp       1  205.35  205.35  15.572 0.0002312 ***
## dose       2 2426.43 1213.22  92.000 < 2.2e-16 ***
## supp:dose  2  108.32   54.16   4.107 0.0218603 *  
## Residuals 54  712.11   13.19                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(tooth_lm, type = "II")

## Anova Table (Type II tests)
## 
## Response: len
##            Sum Sq Df F value    Pr(>F)    
## supp       205.35  1  15.572 0.0002312 ***
## dose      2426.43  2  92.000 < 2.2e-16 ***
## supp:dose  108.32  2   4.107 0.0218603 *  
## Residuals  712.11 54                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

32.4 Quiz

Reflection questions on [canvas]

References

Collaboration, Open Science. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251). https://doi.org/10.1126/science.aac4716.

Head, Luke AND Lanfear, Megan L. AND Holman. 2015. “The Extent and Consequences of p-Hacking in Science.” PLOS Biology 13 (3): 1–15. https://doi.org/10.1371/journal.pbio.1002106.

Leek, Jeffrey T., Prasad Patil, and Roger D. Peng. 2015. “A Glass Half Full Interpretation of the Replicability of Psychological Science.” http://arxiv.org/abs/1509.08968.

Maxwell, Scott E. 2004. “The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies.” Psychological Methods 9 (2): 147–63. https://doi.org/10.1037/1082-989x.9.2.147.