27.1 What does provenance mean?

The word provenance has been used in the art world for a long time. The value of a work of art can depend on its origins and history having been fully authenticated—whether anyone likes it or not. Other subject areas have taken over the term and the provenance of datasets is talked about now too: their origin and source, the people and organisations involved, how the data were collected, their reliability and quality, and how they have been processed and used (cf. Meng (2021)). This has become more important as the ease with which data can be shared has increased. Lee Wilkinson told me that he included Anderson’s iris dataset with his Systat software in the early 1980s by copying it in by hand. Forty years on, hugely larger datasets can be transferred across the web with a single click.

Provenance is about establishing a dataset’s credibility and is an essential support for being able to reproduce results of others. Datasets do not emerge fully formed as they are collected. Checks are carried out, corrections made, various weightings and transformations are applied, some data and variables may be dropped and derived variables added. Huebner et al. (2020) describes a review of how Initial Data Analysis (IDA) was reported in 25 observational medical studies published by five medical journals in 2018. Their conclusion was that reporting of IDA was sparse and unsystematic. The PISA studies comparing the performance of school children in different countries are a different example. Complex weighting schemes are used to reflect the sampling methods for each country to make results comparable. Without detailed help explaining how the data were restructured it would be impossible to reproduce what has been done.

Data arise in many different ways, ranging from meticulously carried out scientific experiments to voluntary online surveys. At the latter extreme, all sorts of fanciful statistics are generated that have little or no meaning or value. At the former extreme it is still necessary to check how and under what conditions the experiments were carried out and whether results may be generalised beyond the bounds of that particular study. Testing a theory on a small group of students is at best an indication of more general validity, however conscientiously the research is carried out.

Stigler (2019) discusses the importance of investigating the source and processing of classical datasets. A classical dataset in his definition is one that “has been collected for some scientific or commercial purpose, and has been employed for instruction or exposition by several people”. There are a few used in this book and the pre-checking carried out is described below. For around half the book’s datasets users can download newer data and carry out fresh analyses themselves. Recent data are both more useful and more motivating than old data.

As an example of datasets collated and offered by public institutions, there is a valuable collection of demographic data by country available on the Human Mortality database (University of California & Demographic Research (2022)). Data are provided for many countries for many years, there are comments on the likely quality of the data for particular country-year combinations and corresponding advice to use with care. The Gapminder Foundation provides extensive data on many different countries over the years. For the life expectancy data used in Chapter 2 they provide accompanying information on how they estimated values for isolated calamitous years.