Data-driven publishing: Reproducible research with R, Quarto, and Github

Kyiv School of Economics

Author

Arthur Small

Published

June 1, 2022

Course Description

This course introduces tools and concepts of literate programming. Literate programming is an approach to creating documents that smoothly integrates data, code, and narrative writing.

All analysts need to present their results in multiple formats: articles, slide decks, web sites, and so on. Traditional workflows for creating and publishing documents rely heavily on manual workflows, e.g., copy-and-paste. Traditional workflows are poorly suited to data-intensive analytic projects.

This course will provide an introduction to an entirely different and better approach to scientific and technical publishing that is code-driven and reproducible. Reproducible workflows are appropriate to creating and presenting the results of data analysis. They are increasingly indispensable for professionals working in data science and allied data-intensive, analytic roles.

The course will emphasize hands-on project development. On the first day, each student will create and publish a professional webpage for themselves. By the end of the course, students will create and present a project that integrates data analysis in a real application, preferably related to the reconstruction of Ukraine. Prizes will be awarded for the top 3 presentations.


Why you should care

If you do any work in business, economics, data science, or any affiliated professional fields, the core of your work will likely involve the production of documents, presentations, articles and other such reports. Many of these reports will focus on . This will be at the core of your professional work. Not to put too fine a point on it: this is a lot of what you will do all day.

The core idea is to encode all the steps in the document-creation workflow in code, rather than in copy-and-paste steps.

Literate programming is especially suited to producing technical articles and reports, or reports that depend on the analysis of data. It is also very well-suited to producing series of deliverables that change with data.

Literate programming provides a foundation for practices of reproducible research.

In addition to learning theory, each student will undertake a semester-long project in analysis of time series data. The goal of this project is to produce a serious professional report.

The course coverage and the sequence of assignments are structured to walk the student through the process of creating such an analytic report. Three areas of practice are integrated:

1. Theory and methods of time series statistics and forecasting.

2. Coding for data analysis.

3. Literate programming and reproducible research.

Ideally, this project will relate closely to the student’s own dissertation research, professional practice, or other domain application that interests them. My hope is that these projects could form the basis for subsequent research papers, dissertation chapters, or other professional work products, for interested students.

The course will, therefore, be structured primarily as a workshop: the ultimate goal is to help you to create a professionally presented report. Our workflow will, therefore, be subject to revision, according to my judgement of how best to use our time to help you produce a professional report.

Prerequisites

  1. Probability and statistics: Students should have taken at least one rigorous intermediate-level course in probability and statistics. They should be comfortable with the representation of uncertain information in the form of probability distributions, with conditional probabilities, linear models, regression using ordinary least squares, and with other such foundational concepts.

    Required prerequisite: SYS 4021 Linear Statistical Models, or equivalent. A rigorous prior introduction to probability and statistics with linear models is required. If in doubt, check with the instructor.

  2. Coding for data analysis: To carry out the data analysis, students should be able to code in a general-purpose language such as R or Python. Experience cleaning, wrangling, and structuring data will be especially helpful. Coding experience is required.

    Recommended prerequisite or co-requisite: SYS 2202 Data and Information Engineering.

    For this course, we will work in R. R is an excellent language for statistics and data science. It provides a number of specialized packages for working with time series data and generating forecasts.

Prior exposure to tools of literate programming and reproducible research is not required. Experience creating documents using R Markdown, Jupyter Notebooks, or Quarto will be helpful and will shorten the spin-up time needed to produce reports. Experience with using git and Github for version control will also be helpful.

About these notes

This document contains class notes and other materials related to SYS 5581 Time Series and Forecasting at the University of Virginia. It is not intended to be a complete text.

The course outline is divided into two major sections. First, we will introduce the theory, with examples. In the later part of the semester, we will focus on workshopping your projects in progress.

Readings and references

Time series

FPP3 = Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3

TFS = [these notes]

For more advanced treatments of time series, see Brockwell and Davis (2016).

Statistics with R

Data science with R, general

R4DS = Wickham, Hadley, and Garrett Grolemund, R for Data Science

TSDS = Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng, Tidyverse Skills for Data Science

Tibshirani, Ryan, Statistics 36-350 Statistical Computing, Carnegie-Mellon University, Fall 2019

Other references

A variety of other references to resources on time series and forecasting are gathered in the Zotero library for this course.

Acknowledgements

These notes are organized using the bookdown package (Xie 2022), which was built on top of R Markdown and knitr (Xie 2015).