28.6 Putting datasets together and reproducibility

28.6.1 Reading in and merging datasets

Datasets come in different shapes, sizes, and formats. Modern software can usually read all of them, but some conversion may be necessary. Spreadsheets offer flexible structures with multiple sheets, and may include background information, annotations, and other non-standard structures. These need special treatment, as with the UN population dataset used in Chapter 8. Only the first and second sheets were required and the initial lines of the first had to be edited. Spreadsheets can be very useful for initial recording of data because of the notes and background information that can be included. Public organisations often distribute data in spreadsheet format. Journal authors sometimes make their data available in spreadsheets. There are disadvantages. Spreadsheets may include summary data along with case data. Additional text and multi-line headers may be employed. Different versions, subsets, aggregations of a dataset may be provided in individual sheets of a spreadsheet without being linked. In other words, if a data value were to be amended, it would have be be changed separately in each sheet in which it appeared (e.g., the spreadsheet used in Chapter 19).

If data are needed from more then one dataset, the datasets must usually be merged. Chapter 8 combined the chess ratings dataset with files for country populations, country codings, and regions. This increased the number of variables from 12 to 20, although many were unused and could have been dropped. The number of players included in this merged dataset dropped from 362502 to 191097, as only active players were considered. Finally the ratings data for two separate years were merged. The 2020 dataset had 362502 players and 12 variables, while the 2015 dataset had 227960 players and 12 variables. The merged dataset that included all players from both years had 364012 players.

In Chapter 4 the voting results file for the 1912 Democratic Primary was summarised by state and merged with files for the numbers of each state’s electoral votes in 1912 and 2020, and with a file for state populations in 2018. The unsummarised data file was also separately merged with a file of estimated times at which the ballots took place.

Mapped data reveal spatial patterns. Shape files for areas can be merged with data files. This was done in Chapters 7 (Bertin’s French workforce data), 9 (Gay Rights survey), 25 (Titanic), and 26 (German election 2021).

28.6.2 Reproducibility

For scientific publications it would be ideal to have authors provide code as well as data, both code for their data wrangling and for their analyses. Only base data would have to be made available, as other versions could be reconstructed with the code. This would improve the chances of being able to reproduce study results, providing a record of what had been done. Of course, that record would only be understandable to readers who could follow the coding language.

Reproducing published work is easier than it was, but is still not satisfactory, as various surveys in different fields have pointed out (e.g., Gabelica et al. (2022)). Two examples arose in this book. In Chapter 22 numerous tables of percentage success rates of facial recognition software were reported, but the raw data were not made available. Making the reasonable assumption that the underlying results must have been integers, it was possible to reconstruct much of the data from the percentages reported. In the supplementary material to the article on tests for malaria discussed in Chapter 19, there was a multi-sheet Excel spreadsheet giving the raw data and the data used in the article’s plots and tables. The sheets were not linked and formulae not reported. With a little work the connections could be reconstructed. With other datasets (that were not therefore not included in the book) insufficient information was provided or essential details were missing.

Main points

Data cleaning is a continuing process through all of an investigation.
Datasets may be subset, grouped or aggregated in many different ways, providing many different views.
Layers of related data structures are a good way of building graphics.
Wrangling is an essential support for graphical analysis.