1.23 Good practices in data analysis (X)

1.23.1 Reproducibility & Replicability

1.23.3 Why reproducability?

  • Access:
    • Taxpayers (= researchers) pay for research → should have access
    • Better all humans → human progress! (Sci-hub controversy)
    • Implies relying on open-source software
    • Access in 100 years.. will STATA still exist?
  • Memory
    • You will forget what you did.. think of others..
    • Reproducable document helps you trace your steps
    • Ideally all stages of workflow
  • Errors
    • Manual steps (e.g. manual copy/paste) introduces errors
    • Reproducible documents allow for automatization (counter argument?)
  • Efficiency
    • Automatization → paper revisions much faster

1.23.4 Reproducability: My current approach

1.23.5 Reproducability in practice

  • One folder for all files/project
    • …folder can be zipped and be shared/uploaded
    • I prefer no subfolders
  • Filenames should be logic
    • Main file with text/code: “paper.rmd”, “report.rmd”
    • Data files: "data_xxxxxx.*"
    • Image files: "fig_xxxxxx.*"
    • Tables files: "table_xxxx.*"
    • etc.
  • Important: Use document outline in R studio: Ctrl + Shift + O
  • Name rchunks according to what they do or produce
    • “fig-…” for chunks producing figures
    • “table-…” for chunks producing tables
    • “model-…” for chunks producing model estimates
    • “import-…” for chunks importing data
    • “recoding-…” for chunks in which data is recoded
  • Use “really” informative variable names
    • Q: What do you think does the variable trstep measure? + How could we call this variable instead?
    • In part this happens automatically, because those names are used in tables etc.
  • Use unique identifiers in the final document, e.g. for models “M1”, “M2” etc.
    • These should also appear in the published paper
    • …it will help others if you do the same for figures, tables etc.
  • ALWAYS store the raw data, even if you scrape it from websites (they might disappear)


  1. See for instance the discussion surrounding the use of p-values/statistical significance(e.g. Gill 1999) and current discussion about post-treatment bias (e.g. King 2010; Acharya, Blackwell, and Sen 2016).

