10.3 Checking the data with graphics
Checking whether other missions had differing times for their crews in the dataset uncovered a number of inconsistencies. It was not enough to use the mission title, as sometimes these included multiple flights, so the dataset was grouped by the variables mission_title
, ascend_shuttle
, in_orbit
, descend_shuttle
, year_of_mission
. This grouped together many correctly, but failed to group all missions properly as occasionally different codings (upper/lower case, hyphens or not) were used and there were errors. A boxplot of the biggest differences in mission times for the 538 groups found is displayed in Figure 10.8. There are four gross differences and plenty of smaller ones. The biggest difference concerns the American/Russian trio who went to the ISS in 2001. The time for Susan Helms is, correctly, over 160 days, while the time for James Voss is 163 hours. Yuri Usachyov gets a similar time to Susan Helms, but is not in the same group, as the dataset reports he returned on a different spacecraft.
Ideally, all typos and errors should be corrected. The very first line of the dataset raises concern. Yuri Gagarin, who went up in space in April 1961, cannot have gone up in Vostock 1, been in orbit in Vostock 2, and come down in Vostock 3! It is said of John Hartigan, that when he discussed datasets at Yale seminars, he might take 20 minutes to examine the first line. This dataset is an example of why that approach can be valuable.
It would be informative to use exact flight dates, not just the years, and the dates are available on the Wikipedia page for human spaceflights (Wikipedia (2020b)). The dataset illustrates some of the issues that can arise with datasets that are made publicly available and how graphics can help to identify and deal with them. It also shows that even when there may be individual data problems, a dataset can still reveal much interesting information.
Answers Many more men than women have gone up in space and the men have been older. Initial flights were short. Since the space station programmes began, people have stayed much longer in space. Most of them have been Russians or Americans, but this is changing.
Further questions How many spaceflights have individuals made? Are there differences between the groups of U.S.S.R./Russians and Americans?
Graphical takeaways
- Graphics uncover problems in datasets, be they typos, errors or more serious issues. (§10.2 and §10.3)
- Scatterplots are excellent, but it is well to check for duplicate pairs of values and draw additional complementary plots. (Figure 10.2)
- Colours should be used consistently for the same categories. Different palettes should be used for other groupings to avoid confusion. (Figures 10.2 and 10.3 v. Figure 10.5 or Figure 10.7)
- Log (or other) transformations are effective for skewed distributions, especially with interpretable scaling. (Figure 10.6)
- Faceting disentangles overlapping groups in scatterplots. (Figure 10.7)