27.4 How do invented, simulated, and real datasets compare?

Datasets can be invented, simulated or real. Invented datasets were very common in older textbooks before many had access to computers. The datasets had to be small and the calculations simple, so that everything could be done by hand or perhaps with the help of a calculator. Some researchers claimed this was a good thing as students were closer to their data. Shayle Searle reminisced to me that in his Statistics course at Cambridge in the early 1950s this was the reason Dennis Lindley gave for not allowing students to use the old hand calculators that were then available. Times change.

Unfortunately, invented datasets are still found in some textbooks, sometimes with flippant background descriptions. Attempting to make statistics attractive to students by using superficial examples with supposedly amusing subjects will not necessarily encourage them to take actual statistics seriously.

Simulated datasets are often used in research to investigate the performance of methods on data with known structure. That can be informative. Using simulated data for exercises is not so helpful. Simulated datasets are just that—simulated—and rarely reflect the issues and problems that arise with real datasets, be they missing values, outliers, inconsistencies, errors or whatever. More importantly, there is no way to use context or additional data to check features discovered in simulated data. Simulated datasets are used in graphical inference to produce line-ups for checking whether a particular graphic display from a real dataset could reasonably be said to have the feature identified (Wickham et al. (2011)). This is an interesting theoretical approach, requiring users to compare groups of similar graphics looking for one that may be different.

Using real datasets means more work. You have to find out if you are permitted to use them, you have to check them, and you have to understand them. And, of course, you should cite the source properly to give credit where credit is due. It is a form of plagiarism to use data without acknowledgement. Many apparently simple datasets have only been put together with enormous effort. Think of the experiments on the speed of light by Newcomb and Michelson, think of the data on Darwin’s finches collected by the Grants over many years, think of the researchers measuring penguins on the Palmer Archipelago on the North of Antarctica. Data collection may take a lot of time and effort, so may data preparation. This work should be acknowledged.

What is real about real datasets can be an issue. Many datasets that are publicly available have been cleaned, edited and filtered, even simplified. One example is the datasets on measurements of the speed of light. Another is the Titanic dataset where there are many versions and most are rehashes of early summaries. Modern research has provided more and better detail.

One of the most famous experiments in physics was Millikan’s oil drop experiment to determine the charge on an electron. Millikan carried out his experiments before the First World War and according to his notebooks did not report all his results, excluding those where he believed the experiments to have been flawed in some way (Holton (1978)). At around the same time, Ehrenhaft carried out experiments with the same aim and, in contrast to Millikan, found a broad range of electron charges (Ehrenhaft (1910)). Ehrenhaft apparently reported all his experimental results, but scientific opinion sided with Millikan. Not all data are equal and you cannot treat all data equally.

Datasets are sometimes treated as objective truths and given too much respect. On the other hand, they are sometimes unjustifiably maligned as unreliable and worthless. The fact that the phrase “the camera doesn’t lie” is not always true does not mean that all photographs are invalidated. It just means that you should take into account the choice of subject, the point of view, the weather, the lighting, the colouring, the lens and so on. A good dataset includes clearly defined variables, reliable data, and a verifiable provenance. No dataset may be perfect, but many are informative.

Main points

It is essential to know the provenance of a dataset, its source, whether it has been edited and amended, and if so, how.
Large, documented datasets are much more readily available now than in the past. They require extensive checking.
Each dataset has its own issues. There is no uniform approach that will work for each one.
Real datasets are better than invented or simulated ones.