Chapter 3 Statistical & Data Science Practice

First, an excellent list of data science resources:

3.1 Introduction

How does one approach a statistics or data science project?

3.1.1 A theory of data analysis

Roger Peng, 2018-12-11, “The Role of Theory in Data Analysis”

Stephanie C. Hicks and Roger Peng, 2019-03-18, “Elements and Principles of Data Analysis”, arXiv.org (Hicks and Peng 2019a)

Stephanie C. Hicks and Roger Peng, 2019-04-26, “Evaluating the Success of a Data Analysis”, arXiv.org (Hicks and Peng 2019b)

3.1.2 The function of data science: solving business problems

Emily Robinson, 2017-09-27, Managing Business Challenges In Data Science

3.1.3 Design thinking >> data science context

Roger Peng, 2019-01-09, How Data Scientists Think - A Mini Case Study

3.1.4 Opinionated Analysis Development

An over-arching structure of what a project could (or should?) look like can be boiled down into three features: it is

  1. Reproducible and auditable

  2. Accurate

  3. Collaborative

Hilary Parker, 2017-08-30, Opinionated Analysis Development

3.2 General practice and workflow

Jenny Bryan on workflow:

Jenny Bryan, Project-oriented workflow

Jenny Bryan, Ode to the here package

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

Jenny Bryan (2017) Workflow: you should have one, Keynote talk at EARL London 2017.

Kara Woo (2020-02-13) What They Forgot to Teach You About R at rstudio::conf 2020–all the materials from workshop delivered by Kara Woo, Jim Hester, and Jenny Bryan

other authors

Keiran Healy, The Plain Person’s Guide to Plain Text Social Science {pdf}

Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) “Ten Simple Rules for Effective Statistical Practice”. PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961

Ray Li (2016) “7 habits of highly effective data analysis”. Dataconomy.com, 2016-02-16.

Noble, William Stafford (2009-07-31) A Quick Guide to Organizing Computational Biology Projects, PLoS Comput Biol 5(7): e1000424.

3.3 Version Control with Git & GitHub

Jenny Bryan, the STAT 545 TAs, Jim Hester Happy Git and GitHub for the useR

Mine Çetinkaya-Rundel (2019-07-15) R & GitHub sitting in a tree…

3.4 R-specific workflow

3.4.1 Agile

Edwin Thoen, Agile Data Science with R

3.4.2 other

Gabe Becker (2017) Enhancing reproducibility, comparability and discoverability of results in multi-analyst settings, presentation at EARL (Enterprise Applications of R Language), San Francisco, June 5-7, 2017

  • this came to my attention via the Not So Standard Deviations podcast, episode 40 “It’s the CDs All Over Again” (2017-06-13). The discussion of Gabe Becker’s presentation begins at ~13’ 10”.

    • some of Hilary Parker’s observations: the topic doesn’t get much air time, his talk takes wide view of issues in an organization, trade-offs in collaborative environment (not one analyst); in multi-analyst system (e.g. where both a researcher/biologist and a statistician are both working with the same data) have to reconcile results, there might be parallel studies where results need to be reconciled, >> this creates a need for the data to be created in similar environments.

    • Concerns: reproducibity, comparability, discoverability (finding the results), and empowerment. Data scientists skew to emppowerment! But the concerns are in tension–you can increase reproducibility but at cost of empowerment, etc.

    • “Most organizations aren’t making that judgement call ahead of time…any organization … if data scientists were there early, you’re going to be skewed to agility and high empowerment. Because that’s like what we want! … Data scientists are allergic to process…having any template for results, or … even having to put things into a specific tool.”

    • Different systems need different constraints.

    • Roger Peng: most organizations don’t realize that they are explicitly making these trade-offs, can’t maximize all. Have to make a choice, and that’s very unsatisfactory for people.

Emily Robinson, Red Flags in Data Science Interviews

Noam Ross, Reproducibility in an Office World: A Brief History of Failures and the Odd Success, presentation to R-Ladies NYC, 2018-11-06

3.5 R packages supporting robust workflow

R project workflows: a GitHub repository compiled by jdblichak

  • “This is a non-exhaustive list of R code that assists with the workflow of R projects. Some are R packages, others are Git repositories meant to be used as templates, and some are guides. Here”workflow” is broadly defined, and incldues one or more of the following: file organization, dependency management (e.g. like GNU Make), or report generation.”

3.5.1 {drake}

CRAN: drake: A Pipeline Toolkit for Reproducible Computation at Scale – “A general-purpose computational engine for data analysis, drake rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date. Not every execution starts from scratch, there is native support for parallel and distributed computing, and completed projects have tangible evidence that they are reproducible. Extensive documentation, from beginner-friendly tutorials to practical examples and more, is available at the reference website https://docs.ropensci.org/drake/ and the online manual https://books.ropensci.org/drake/.”

3.5.2 {janitor}

{janitor} – “has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.”

CRAN: janitor: Simple Tools for Examining and Cleaning Dirty Data

GitHub: sfirke/janitor

3.5.4 {usethis}

“usethis is a workflow package: it automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects”

CRAN: usethis: Automate Package and Project Setup – “Automate package and project setup tasks that are otherwise performed manually. This includes setting up unit testing, test coverage, continuous integration, Git, ‘GitHub’, licenses, ‘Rcpp’, ‘RStudio’ projects, and more.”


3.6 Reproducible research

Or:

reproducibleresearch.net

Emily Riederer, 2019-08-29, Resource Round-Up: Reproducible Research Edition – annotated bibliography of some essential resources on the topic.

Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.), The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences – online book with 31 case studies of reproducible research workflows.

Roger Peng (2014-06-06) The Real Reason Reproducible Research is Important

Roger Peng, 2015-06-15, “The reproducibility crisis in science: A statistical counterattack”, Significance

Rich FitzJohn, Matt Pennell, Amy Zanne, Will Cornwell (2014-06-09) Reproducible research is still a challenge (at rOpenSci)

Melissa Assel, MS; Andrew J. Vickers, PhD (2018-02-06) “Statistical Code for Clinical Research Papers in a High-Impact Specialist Medical Journal”, Annals of Internal Medicine

  1. Orozco, C. Bontemps, E. Maigné, V. Piguet, A. Hofstetter, A. Lacroix,
  2. Levert, J.M. Rousselle (2018-07) “How To Make A Pie: Reproducible Research for Empirical
    Economics & Econometrics”

Jeffrey M. Perkel, “A toolkit for data transparency takes shape”, Nature, 2018-08-20.

Daniel Barron (2018-08-13) How Freely Should Scientists Share Their Data?, Scientific American blog

Karl Broman (2019-02-17) Collaboratingreproducibly, slides for talk given at the AAAS meeting in Washington, DC. (See also https://github.com/kbroman/Talk_AAAS2019)

3.6.1 Reproducible research with R

R for Reproducible Scientific Analysis – “an introduction to R for non-programmers using gapminder data”, part of the Software Carpentries

Jeremy Anglin, Reproducible analysis with knitr, R Markdown, and RStudio

Ben Marwick, 20 July 2017, Reproducible Research Compendia via R packages, presentation at Berlin R Users

Sharla Gelfand (2020-01-30) Don’t repeat yourself, talk to yourself! Repeated reporting in the R universe – presentation at rstudio::conf 2020

  • see also [{redoc}], “a package to enable a two-way R Markdown Microsoft Word workflow. It generates Word documents that can be de-rendered back into R Markdown, retaining edits on the Word document, including tracked changes.”

  • see also this list

3.6.2 Reproducible data

Greg Finak (2018-09-18) Building Reproducible Data Packages with DataPackageR

Luis Darcy Verde Arregoitia (2018) Good practices for sharing analysis-ready data in mammalogy and biodiversity research, Hystrix It. J. Mamm. 2018;29(2):155–161

3.6.3 spreadsheets: the anti-reproducible research

Karl Broman and Kara Woo, “Data organization in spreadsheets” (Broman and Woo 2017).

Data Organization in Spreadsheets for Social Scientists: Formatting problems – DataCarpentry lesson

Jenny Bryan’s spreadsheets talk given May & June 2016 reframes Shotwell as “Spreadsheets: a dystopian moonscape of unrecorded user actions.”

Ignasi Bartomeus and F Rodriguez-Sanchez, Non-reproducible workflows: a horror movie :

- with more at [Reproducibilidad](http://ecoinfaeet.github.io/2016/07/06/reproducibilidad/)

Gordon Shotwell, 2017-02-02, R for Excel Users

Luis A. Apiolaza, 2017-11-11, Reducing friction in R to avoid Excel

3.7 Collaboration

Amit Bhattacharyya, 2017-11-01, Become a Better Statistician by Actively Collaborating (at Amstatnews)

Peter Seibel, 2017-11-19, Repo style wars: mono vs multi

3.8 Data Informed or Data Driven? Data Science at Work

Ricardo Bion, Robert Chang, and Jason Goodman (2017-08-23) How R Helps Airbnb Make the Most of Its Data

The Stitch Fix Algorithms Tour

Behavioural Insights Team (UK), 2017-12-14, Using Data Science in Policy

3.9 Agile practice

Agile Scrum Guide, BC Government DevEx

3.10 Data Quality & Context Compatibility

Or, do your data really mean what you hope they do?

Jacob Harris (2014) “Distrust your data”, 2014-05-24.

Roger Peng (2018) Context Compatibility in Data Analysis, 2018-05-30

3.11 File storage and naming conventions

Sustainability of Digital Formats: Planning for Library of Congress Collections

Jenny Bryan, naming things, Reproducible Science Workshop, 2015


3.12 Data practice

Karl Broman, data organization

Karl Broman and Kara Woo, “Data organization in spreadsheets” (Broman and Woo 2017).

3.12.1 Versioned data

Daniel Falster, Richard G FitzJohn, Matthew W. Pennell, William K. Cornwell (2017-11-10) Versioned data: why it is needed and how it can be achieved (easily and cheaply)


3.13 Coding practice

Coding etiquette, a guide to writing clear, informative and easy-to-use #RStats code by /@/CodingClub

Joel Lee, 2017-12=22, The Weirdest Programming Principles You’ve Never Heard Of

3.13.1 Style guides for code

Hadley Wickham, The tidyverse style guide

Hadley Wickham, “Style Guide” chapter from Advanced R

Henrik Bengtsson, _R Coding Conventions (RCC) - a draft (2009)

Google’s R Style Guide

3.13.2 Version control

Jenny Bryan and the STAT 545 TAs, Happy Git and GitHub for the useR

John D. Blischak, Emily R. Davenport, Greg Wilson (2016-01-19) “A Quick Introduction to Version Control with Git and GitHub”, PLoS Computational Biology.

3.13.3 Documentation

Sébastien Rochette, 2019-07-10, Rmd first: When development starts with documentation

3.13.4 Literate programming

The practice of explaining the program logic in a natural language”; it goes beyond what might be called “documentation”. In the world of R, the RMarkdown functionality within RStudio, including R notebooks, is a way to program in this manner.

The concept was introduced by Donald Knuth in 1984; the original article is > Knuth, Donald E. (1984). “Literate Programming” (PDF). The Computer Journal. British Computer Society. 27 (2): 97–111.

3.13.5 Functions in R

Colin Fay, Playing with R, infix functions, and pizza

3.13.6 Naming variables

“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton

Andy Lester, “The World’s Two Worst Variable Names”

3.13.7 Clean coding

Robert C. Martin, 2008, Clean Code: A Handbook of Agile Software Development, Prentice Hall.

Robert C. Martin, 2011, The Clean Coder: A Code of Conduct for Professional Programmers, Prentice Hall.

3.13.8 Further reading

Aimee Gott, 2015-07-16, Developing a R Validation Framework, BaselR

Dani Marillas, 2017-01-25, “Don’t document your code. Code your documentation.”

3.14 General research practice


3.14.1 things that are no doubt useful and/or interesting but don’t really fit anywhere in the existing typology

Louisa Smith , epi quals study calendar


References

Broman, Karl, and Kara Woo. 2017. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.
Hicks, Stephanie C., and Roger D. Peng. 2019a. “Elements and Principles of Data Analysis.” arXiv.org. https://arxiv.org/abs/1903.07639v1.
———. 2019b. “Evaluating the Success of a Data Analysis.” arXiv.org. https://arxiv.org/abs/1904.11907.