# Chapter 32 Reflect Review and Relax

Motivating scenarios: We are taking stock of where we are in the term. Thinking about stats and science, and making sure we understand the material to date.

The Science of Doubt. link. By Michael Whitlock.

## 32.1 Review / Setup

• So much of stats aims to learn the TRUTH.

• We focus so much on our data and how to measure uncertainty around estimates and (in)compatibility of data with a null model. We will review and solidify this, but

• Recognize that so much beyond sampling error can mislead us.

## 32.2 How science goes wrong

Watch the video below. Whe you do, consider these types of errors that accompanny science. You should be able to think about htese and as good questions about them.

• Fraud.
• Wrong models.
• Experimental design error.
• Communication error.
• Statistician error.
• Harking.
• Coding error.
• Technical Error.
• Publication bias.

You should have something to say about

• The “replication crisis,” and
• If/why preregistration of studies is a good idea.

Figure 32.1: Watch this hour long video on The science of Doubt by Michael Whitlock.

A brief word on publication bias Scientists are overworked and have too much to do. They get more rewards for publishing statistically significant results, so those are usually higher on the to do list. This results in the file drawer effect in which non-significant results are less likely to be submitted for publication.

I simulate this below, and the have a web app (basically this code dressed up in sliders) for you to use to explore this.

# Set it up
sample_sizes      <- c(2,4,6,8,12,16,24,32,48,64,96,128,192,250, 384,500,768, 1000)
replicates        <- 10000
total_experiments <- length(sample_sizes) * replicates
exp_id            <- factor(1:total_experiments)
mu                <- .2
sigma             <- 1

# Simulate
sim_dat <- tibble(exp_id      =   factor(1:total_experiments) ,
sample_size = rep(sample_sizes, each = replicates)) %>%
uncount(weights = sample_size, .remove = FALSE)          %>%
mutate(sim_val  = rnorm(n = n(), mean = mu, sd = sigma))

# Summarize and hypothesis test
sim_summary <- sim_dat %>%
group_by(exp_id) %>%
summarise(n = n(),
mean_val = mean(sim_val),
se       = sd(sim_val) / sqrt(n),
t        = mean_val / se,
p_val    = 2 * pt(q = abs(t), df = n-1, lower.tail = FALSE),
reject   = p_val < 0.05) %>%
group_by(n) %>%
mutate(power = mean(reject))

# plot
sim_plot<- ggplot(sim_summary,  aes(x = power, y = mean_val,label = n))+
stat_summary(aes(color = reject),
geom = "text", size =3,
position = position_nudge(y = .02, x = -.015),
show.legend = FALSE) +
stat_summary(aes(color = reject),geom = "point",
show.legend = FALSE)            +
stat_summary(aes(color = reject), geom = "line")+
stat_summary(geom = "line", color = "black")  +
annotate(x = .5, y = mu+.02, geom = "text",label = "mean of all sims" , size = 2)  +
theme_light()+
labs(title = "Significant results are biased. Numbers show n")
ggplotly(sim_plot)

Figure 32.2: Testing the null that the mean equals zero, when we know the true mean is 0.2.

Interact with the app below (basically this code with widgets allowing you to) to see how this biases our estimates.

I find this stuff fascinating If you want more, here are some good resources.

Videos from calling bullshit – largely redundant with video above): 7.2 Science is amazing, but…, 7.3 Reproducibility, 7.4 A Replication Crisis, 7.5 Publication Bias, and 7.6 Science is not Bullshit.

• The replication crisis

• Estimating the reproducibility of psychological science link,
• A glass half full interpretation of the replicability of psychological link,
• The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies link.
• P-hacking The Extent and Consequences of P-Hacking in Science link.

• *The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time∗ link.

## 32.3 Review

You should be pretty comfortable with the ideas of

• Parameters vs Estimates
• Sampling and what can go wrong
• Null hypothesis significance testing
• Common test statistics
• F
• t
• Calculating Sums of Squares
• Interpreting stats output like that below
ToothGrowth <- mutate(ToothGrowth, dose = factor(dose))
tooth_lm <- lm(len ~ supp * dose, data = ToothGrowth)

summary(tooth_lm)   
##
## Call:
## lm(formula = len ~ supp * dose, data = ToothGrowth)
##
## Residuals:
##    Min     1Q Median     3Q    Max
##  -8.20  -2.72  -0.27   2.65   8.27
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    13.230      1.148  11.521 3.60e-16 ***
## suppVC         -5.250      1.624  -3.233  0.00209 **
## dose1           9.470      1.624   5.831 3.18e-07 ***
## dose2          12.830      1.624   7.900 1.43e-10 ***
## suppVC:dose1   -0.680      2.297  -0.296  0.76831
## suppVC:dose2    5.330      2.297   2.321  0.02411 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746
## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16
anova(tooth_lm)
## Analysis of Variance Table
##
## Response: len
##           Df  Sum Sq Mean Sq F value    Pr(>F)
## supp       1  205.35  205.35  15.572 0.0002312 ***
## dose       2 2426.43 1213.22  92.000 < 2.2e-16 ***
## supp:dose  2  108.32   54.16   4.107 0.0218603 *
## Residuals 54  712.11   13.19
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(tooth_lm, type = "II")   
## Anova Table (Type II tests)
##
## Response: len
##            Sum Sq Df F value    Pr(>F)
## supp       205.35  1  15.572 0.0002312 ***
## dose      2426.43  2  92.000 < 2.2e-16 ***
## supp:dose  108.32  2   4.107 0.0218603 *
## Residuals  712.11 54
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 32.4 Quiz

Reflection questions on [canvas]

### References

Collaboration, Open Science. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251). https://doi.org/10.1126/science.aac4716.
Head, Luke AND Lanfear, Megan L. AND Holman. 2015. “The Extent and Consequences of p-Hacking in Science.” PLOS Biology 13 (3): 1–15. https://doi.org/10.1371/journal.pbio.1002106.
Leek, Jeffrey T., Prasad Patil, and Roger D. Peng. 2015. “A Glass Half Full Interpretation of the Replicability of Psychological Science.” http://arxiv.org/abs/1509.08968.
Maxwell, Scott E. 2004. “The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies.” Psychological Methods 9 (2): 147–63. https://doi.org/10.1037/1082-989x.9.2.147.