1.23 Good practices in data analysis (X)

1.23.1 Reproducibility & Replicability

Replication vs. reproduction (terminology!)
A crisis… (e.g. Open Science Collaboration 2015)
Revealing errors (e.g. Psychoticism) but just as important…
… the criteria of “good” evidence change⁴
Initiatives such as…
- ROpenSci
- Center for Open Science + Open Science Framework
- Pre-registration (Pros & Cons)
- Harvard dataverse
Replication, replication (King 1995)
Reproducable in 100 years!
Ideally all stages of the workflow
Finish reproduction files after publication
Cache estimations (some contain randomness)!
Use open-source software (normative/practical)
Try out this template for R Markdown (Bauer 2018)

1.23.2 Reproducibility & Replicability

Replication, replication (King 1995)
Replication vs. reproduction (terminology!)
- Reproduce with study’s data and replicate with new data!
A crisis… (e.g. Open Science Collaboration 2015)
Revealing errors (e.g. Psychoticism) but just as important…
… the criteria of “good” evidence change⁵
Initiatives such as…
- ROpenSci
- Center for Open Science + Open Science Framework
- Pre-registration (Pros & Cons)
- Harvard dataverse

1.23.3 Why reproducability?

Access:
- Taxpayers (= researchers) pay for research → should have access
- Better all humans → human progress! (Sci-hub controversy)
- Implies relying on open-source software
- Access in 100 years.. will STATA still exist?
Memory
- You will forget what you did.. think of others..
- Reproducable document helps you trace your steps
- Ideally all stages of workflow
Errors
- Manual steps (e.g. manual copy/paste) introduces errors
- Reproducible documents allow for automatization (counter argument?)
Efficiency
- Automatization → paper revisions much faster

1.23.4 Reproducability: My current approach

Every researcher has his own optimized setup..
Mine is summarized in Writing a Reproducible Paper in R Markdown
Tools: R and Rmarkdown?
Final product (e.g. scientific article, statistical report) produced by single .rmd file

1.23.5 Reproducability in practice

One folder for all files/project
- …folder can be zipped and be shared/uploaded
- I prefer no subfolders
Filenames should be logic
- Main file with text/code: “paper.rmd”, “report.rmd”
- Data files: "data_xxxxxx.*"
- Image files: "fig_xxxxxx.*"
- Tables files: "table_xxxx.*"
- etc.
Important: Use document outline in R studio: Ctrl + Shift + O
Name rchunks according to what they do or produce
- “fig-…” for chunks producing figures
- “table-…” for chunks producing tables
- “model-…” for chunks producing model estimates
- “import-…” for chunks importing data
- “recoding-…” for chunks in which data is recoded
Use “really” informative variable names
- Q: What do you think does the variable trstep measure? + How could we call this variable instead?
- In part this happens automatically, because those names are used in tables etc.
Use unique identifiers in the final document, e.g. for models “M1”, “M2” etc.
- These should also appear in the published paper
- …it will help others if you do the same for figures, tables etc.
ALWAYS store the raw data, even if you scrape it from websites (they might disappear)

References

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” Am. Polit. Sci. Rev. 110 (3): 512–29.

Bauer, Paul. 2018. “Writing a Reproducible Paper in R Markdown,” May.

Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Polit. Res. Q. 52 (3): 647–74.

King, Gary. 1995. “Replication, Replication.” PS, Political Science & Politics 28 (3): 444–52.

King, Gary. 2010. “A Hard Unsolved Problem? Post-Treatment Bias in Big Social Science Questions.” In Hard Problems in Social Science” Symposium, Harvard University. scholar.harvard.edu.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.

See for instance the discussion surrounding the use of p-values/statistical significance(e.g. Gill 1999) and current discussion about post-treatment bias (e.g. King 2010; Acharya, Blackwell, and Sen 2016).↩
See for instance the discussion surrounding the use of p-values/statistical significance(e.g. Gill 1999) and current discussion about post-treatment bias (e.g. King 2010; Acharya, Blackwell, and Sen 2016).↩