Before getting our hands dirty — with code, rather than with paint — we briefly reflect on some ways of justifying and evaluating visualizations.
4.2.1 Why visualize?
An urge for visualizing data is not a new phenomenon. Although our ability for generating visualizations has greatly increased with the ubiquity of computers, people have always drawn sketches and diagrams for understanding natural and statistical phenomena (Friendly, 2008).
But why should we visualize data? Most people intuitively assume that visualizations help our understanding of data by illustrating or emphasizing certain aspects, facilitating comparisons, or clarifying patterns that would otherwise remain hidden. Precisely justifying why visualizations may have these benefits is much harder (see Streeb et al., 2019, for a comprehensive analysis). And it would be naive to assume that visualizations are always helpful or appropriate. Instead, we can easily think of potential problems caused by visualizations and claim that they frequently distract from important aspects, facilitate misleading comparisons, or obscure and hide patterns in data. Thus, visualizations are representations that can be good or bad on many levels. Creating effective visualizations requires a mix of knowledge and skills that include aspects of human perception, psychology, design, and technology.
Whenever theoretical justifications are hard, we can use an example that proves the existence of cases in which visualizations help. The following code snippets define four R objects:
# Get 4 sets of data: <- ds4psy::get_set(1) a_1 <- ds4psy::get_set(2) a_2 <- ds4psy::get_set(3) a_3 <- ds4psy::get_set(4)a_4
Each of these four objects provides a small data frame (containing 11 rows and 2 columns named
y). The following functions allow us to examine object
# Inspect a_1: # print object (data frame) a_1 #> x y #> p01 10 8.04 #> p02 8 6.95 #> p03 13 7.58 #> p04 9 8.81 #> p05 11 8.33 #> p06 14 9.96 #> p07 6 7.24 #> p08 4 4.26 #> p09 12 10.84 #> p10 7 4.82 #> p11 5 5.68 dim(a_1) # a table with 11 cases (rows) and 2 variables (columns) #>  11 2 str(a_1) # see table structure #> 'data.frame': 11 obs. of 2 variables: #> $ x: num 10 8 13 9 11 14 6 4 12 7 ... #> $ y: num 8.04 6.95 7.58 8.81 8.33 ... names(a_1) # names of the 2 column (variables) #>  "x" "y"
With some training in statistics, we might be tempted to summarize the values of both variables
y (by computing their means or standard deviations), or assess their relationship to each other (as a correlation or linear regression).
For the first set
a_1, the corresponding values are as follows:
# Analyzing a_1: mean(a_1$x) # mean of x #>  9 mean(a_1$y) # mean of y #>  7.500909 sd(a_1$x) # SD of x #>  3.316625 sd(a_1$y) # SD of y #>  2.031568 cor(x = a_1$x, y = a_2$y) # correlation between x and y #>  0.8162365 lm(y ~ x, a_1) # linear model/regression: y by x #> #> Call: #> lm(formula = y ~ x, data = a_1) #> #> Coefficients: #> (Intercept) x #> 3.0001 0.5001
Rather than doing this only for
a_1, we could examine all four sets in this way.
Table 4.1 provides an overview of the corresponding summary statistics:
What would we conclude from Table 4.1? Given that the means, standard deviations, correlations, and best fitting lines of a linear regression are identical, it is likely that we would assume that the underlying datasets are very similar or identical. Perhaps they even describe the same individuals, but are just presented in different orders?
As this is an example deliberately constructed to make a point, the likely conclusion turns out to be wrong. In fact, the datasets are not identical, but show some systematic differences. Importantly, this is easy to see when visualizing the raw data points. Figure 4.1 shows four scatterplots that show the x-y coordinates of each subset as points:
Inspecting the scatterplots of Figure 4.1 allows us to see what is going on here:
The four sets differ in many aspects, but are constructed in a way that their summary statistics are identical.
The data used in this particular example is known as Anscombe’s quartet (Anscombe, 1973) and is included in R as
anscombe in the datasets package (R Core Team, 2021). (The
get_set() function of ds4psy only extracts each subset as a data frame.)
More generally, Figure 4.1 illustrates that detecting and seeing similarities or differences crucially depends on the kinds of measures or visualization used: Whereas some ways of describing the data reveal similarities (e.g., means, or linear regression curves), others reveal differences (e.g., the distribution of raw data points).11
4.2.2 Evaluating visualizations
Having accepted that visualizations can be useful, we can shift our attention from justifying to judging visualizations. This raises an new issue:
- How can we create good visualizations?
Although it is relatively simple to spot flaws or misleading elements in many visualizations, providing general principles for good visualizations is challenging.
The lack of formal theory bedevils good graphics.
The only way to make progress is through training in principles and through experience in practice.
Quite often, creating a visualization seems to proceed in a trial-and-error fashion. A data analyst starts by selecting some type of visualization and then adds and tweaks features until a satifying result is obtained.
The most salient (or “obvious”) features of graphs are their aesthetic properties:
- sizes and widths
- endoding and file formats
Being an artificial object that is designed to serve particular purposes, good questions for evaluating a given visualization include:
- What is the message and the audience for which this visualization was designed?
- How does it convey its message? Which aesthetic features does it use? Which perceptual or cognitive operations does it require or enable?
- How successful is it? Are there ways in which its functionality could be improved?
Thus, evaluating a visualization does not only depend on the visualization itself, but on its relation to the data, its message, and its audience.
A matter of ER: Match of a visualization to a particular task (message) and person (audience, viewer). Again, multiple levers for effecting change (e.g., design or training).
The good, the bad and the ugly
Often easier to detect and enumerate mistakes than to provide a set of principles that leads to good visualizations. Distinguish various ways of being bad.
Chartjunk and other crimes
Reduce non-data ink and chartjunk (see Wikipedia).
The term chartjunk was coined by Edward Tufte in his book The Visual Display of Quantitative Information (2001). Tufte wrote:
The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new.
The purpose of decoration varies – to make the graphic appear more scientific and precise,
to enliven the display, to give the designer an opportunity to exercise artistic skills.
Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.
Edward R. Tufte (1983)
4.2.3 Types of graphs
There are many different types of visualizations. Many compare or contrast values on one dimensions, other combine multiple variables.
Is there a taxonomy of visualizations by tasks and type of comparisons?
4.2.4 Plotting in base R
Basic plots (Section 4.3) use pre-packaged plotting functions for specific visualizations:
- Hands-on instructions on visualizing data in base R.
- Distinguishing between different types of visualizations.
- Adding aesthetics: Color, shape, size, etc.
Complex plots (Section 4.4) allows combining various elements into more elaborate visualizations:
- Composing plots as programs/scripts
- Preparing the canvass prior to plotting
- Adding elements
- Adjusting plotting parameters
Two important constraints for this chapter are:
Visualizations often require transforming data into a specific format or shape. We ignore this here, but will return to this topic when wrangling data.
We occasionally use colors, but do not cover how to specify and select colors in R. In the following, we occasionally use colors and color functions from the unikn package (see Appendix D: Using colors for details).
A deeper point made by this example is that assessments of similarity are always relative to some dimension or standard: The sets are similar or even identical with respect to their summary statistics, but different with respect to their x-y-coordinates. Whenever the dimension or standard of comparison is unknown, a statement of similarity is ill-defined.↩︎