1.23 Good practices in data analysis (X)
1.23.1 Reproducibility & Replicability
- Replication vs. reproduction (terminology!)
- A crisis… (e.g. Open Science Collaboration 2015)
- Revealing errors (e.g. Psychoticism) but just as important…
- … the criteria of “good” evidence change4
- Initiatives such as…
- ROpenSci
- Center for Open Science + Open Science Framework
- Pre-registration (Pros & Cons)
- Harvard dataverse
- Replication, replication (King 1995)
- Reproducable in 100 years!
- Ideally all stages of the workflow
- Finish reproduction files after publication
- Cache estimations (some contain randomness)!
- Use open-source software (normative/practical)
- Try out this template for R Markdown (Bauer 2018)
1.23.2 Reproducibility & Replicability
- Replication, replication (King 1995)
- Replication vs. reproduction (terminology!)
- Reproduce with study’s data and replicate with new data!
- A crisis… (e.g. Open Science Collaboration 2015)
- Revealing errors (e.g. Psychoticism) but just as important…
- … the criteria of “good” evidence change5
- Initiatives such as…
- ROpenSci
- Center for Open Science + Open Science Framework
- Pre-registration (Pros & Cons)
- Harvard dataverse
1.23.3 Why reproducability?
- Access:
- Taxpayers (= researchers) pay for research → should have access
- Better all humans → human progress! (Sci-hub controversy)
- Implies relying on open-source software
- Access in 100 years.. will STATA still exist?
- Memory
- You will forget what you did.. think of others..
- Reproducable document helps you trace your steps
- Ideally all stages of workflow
- Errors
- Manual steps (e.g. manual copy/paste) introduces errors
- Reproducible documents allow for automatization (counter argument?)
- Efficiency
- Automatization → paper revisions much faster
1.23.4 Reproducability: My current approach
- Every researcher has his own optimized setup..
- Mine is summarized in Writing a Reproducible Paper in R Markdown
- Tools: R and Rmarkdown?
- Final product (e.g. scientific article, statistical report) produced by single .rmd file
1.23.5 Reproducability in practice
- One folder for all files/project
- …folder can be zipped and be shared/uploaded
- I prefer no subfolders
- Filenames should be logic
- Main file with text/code: “paper.rmd”, “report.rmd”
- Data files: "data_xxxxxx.*"
- Image files: "fig_xxxxxx.*"
- Tables files: "table_xxxx.*"
- etc.
- Important: Use document outline in R studio: Ctrl + Shift + O
- Name rchunks according to what they do or produce
- “fig-…” for chunks producing figures
- “table-…” for chunks producing tables
- “model-…” for chunks producing model estimates
- “import-…” for chunks importing data
- “recoding-…” for chunks in which data is recoded
- Use “really” informative variable names
- Q: What do you think does the variable trstep measure? + How could we call this variable instead?
- In part this happens automatically, because those names are used in tables etc.
- Use unique identifiers in the final document, e.g. for models “M1”, “M2” etc.
- These should also appear in the published paper
- …it will help others if you do the same for figures, tables etc.
- ALWAYS store the raw data, even if you scrape it from websites (they might disappear)
References
Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” Am. Polit. Sci. Rev. 110 (3): 512–29.
Bauer, Paul. 2018. “Writing a Reproducible Paper in R Markdown,” May.
Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Polit. Res. Q. 52 (3): 647–74.
King, Gary. 1995. “Replication, Replication.” PS, Political Science & Politics 28 (3): 444–52.
King, Gary. 2010. “A Hard Unsolved Problem? Post-Treatment Bias in Big Social Science Questions.” In Hard Problems in Social Science” Symposium, Harvard University. scholar.harvard.edu.
Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.
See for instance the discussion surrounding the use of p-values/statistical significance(e.g. Gill 1999) and current discussion about post-treatment bias (e.g. King 2010; Acharya, Blackwell, and Sen 2016).↩
See for instance the discussion surrounding the use of p-values/statistical significance(e.g. Gill 1999) and current discussion about post-treatment bias (e.g. King 2010; Acharya, Blackwell, and Sen 2016).↩