27.2 What problems arise with datasets?

Real datasets bring real problems. The original sources may be unknown. It may be unclear how the data were collected. The datasets may have unusual and awkward structures. They may be badly formatted. Variables may be ambiguously or just poorly defined. There may be duplicate variables and cases. There may be missing values, errors, inconsistencies, roundings, uncertainty.

Graphical displays can help identify many of these problems and sometimes, although not always, help to fix them. That is one important reason for using real datasets for all examples in this book, rather than invented or simulated datasets. Even if problems cannot be fixed, it is advantageous to be aware of them. Interpretations of results have to be qualified by how reliable and sound the underlying data are judged to be. Knowing the data and their possible weaknesses strengthens an analyst’s position.

27.2.1 Data definitions

Sometimes the meaning of variables recorded is obvious, sometimes not. Sometimes the meaning seems obvious, but turns out not to be. In most countries the number of votes won by a party in an election is clear. In Germany everyone has two votes, the first for an individual, who is usually a member of a party, and the second for a party. There is a complicated system of determining how many seats each party then receives and there have been numerous attempts over the past few years to find a satisfactory solution that does not involve the total number of seats increasing after each election. Ireland has an electoral system based on proportional representation, so that you can talk about how many first preference votes a party receives, but this may not match the number of seats the party wins.

German car sales data does not include details for models with less than 5 sales in a year unless they are in the Luxury or Sportscar segments. Crime statistics cover reported crimes, but not all crimes. It is always advisable to have precise definitions for variables in a dataset.

During the Covid pandemic it was difficult to compare the rates of illness and death in different countries partly because of the difficulties in collecting the data and partly because of the different definitions used.

In any dataset collected over time, definitions change. Rules of sporting events change, conditions change, even countries change, as in the Olympics dataset.

27.2.2 Data collection

Even if definitions of variables are agreed upon and unproblematic, it may be impractical to collect the information accurately. Much height and weight information is self-reported, as are eating habits and exercising regimes.

A dataset that is designed for one purpose may not be suitable for another. The number of tickets sold for a soccer game does not tell you how many people were there. Season ticket holders from far away may choose to only go to the top games. The number of cars registered may not be the same as the number of cars sold, as dealers may be encouraged to register more cars than they sell.

Data may have been collected at different times or by different people. Experimenters may improve their technique as they gain experience with a new method. Both these factors are relevant in considering Newcomb’s data on the speed of light.

Car speedometers may or may not give accurate figures. To counteract possible complaints about the accuracy of the estimates of speed made by police cameras in Germany, a figure of 3 km/hr is subtracted from the actual estimate up to speeds of 100 km/hr and 3% is subtracted from higher speeds.

Results in clinical trials can be complicated by unexpected events (Sackett & Gent (1979)). If someone in the treatment arm of a trial for heart disease is run over by a bus, is that anything to do with the trial?

How questions are asked can influence answers to surveys (e.g., Thau et al. (2021) and several of the papers cited in it). In one of the episodes of the BBC comedy programme “Yes, Prime Minister”, the civil servant, Sir Humphrey Appleby gave an example of how opinion poll questions and their order may affect results (and can be found on YouTube by searching for “yes prime minister opinion polls”).

27.2.3 Data availability

Many datasets are available in or one of its associated packages. Most have been polished or prepared before being published in and while many were real datasets they were not as real as they might have been. Sometimes the documentation available for them is vague and incomplete.

Big datasets need preparatory work. The stages of checking, editing, and reorganising data—getting data ready for graphics and analysis—are discussed in more detail. The data wrangling code developed for these tasks is also included on the book’s webpage. Several datasets are being continually publicly updated (e.g., Gapminder, movie ratings, fuel efficiency, the Comrades ultramarathon, chess ratings). Readers can download more recent versions and use the code on the book’s webpage, possibly with minor amendments, to obtain new results. Estimates of how much of a project is spent in cleaning up data vary considerably, but everyone agrees it takes a lot of time and effort. Knowing which graphics to draw and how they can be drawn is not enough; the data must be prepared in an appropriate form. This involves data restructuring on top of data cleaning.

Although much work may be needed to clean datasets, it is essential to keep a copy of the dataset in its original form and keep all the coding used in cleaning. Auditors or others may have to be shown exactly what was done and analysts need to be able to remind themselves some time later of what they did. Rerunning cleaning code on the original dataset often takes little time, confirms that the original is being used, removing the need to store intermediate versions of the data. New information may arise, invalidating some of the data cleaning and requiring new approaches. Original datasets have be kept available in their original form.

Knowing why, how, and by whom datasets were collected is a key part of their value. Sources and contexts should be included in supporting information and reviewed. Standards have become better in general in this regard with Kevin Wright’s package agridat (Wright (2022)) and Jeremy Singer-Vine’s Data is Plural collection (Singer-Vine (2020)) being positive examples. Yet there is always room for improvement.