3.2 Reflections

Goal of this section: Before getting our hands dirty (not with paint, but with graphical code), we want to reflect on some ways of justifying and evaluating visualizations.

3.2.1 Why visualize?

An urge for visualizing data is not a new phenomenon. Although our ability for generating visualizations has greatly increased with the ubiquity of computers, people have always drawn sketches and diagrams for understanding natural and statistical phenomena (Friendly, 2008).

But why should we visualize data? Most people intuitively assume that visualizations help our understanding of data by illustrating or emphasizing certain aspects, facilitating comparisons, or clarifying patterns that would otherwise remain hidden. Precisely justifying why visualizations may have these benefits is much harder (see Streeb, El-Assady, Keim, & Chen, 2019, for a comprehensive analysis). And it would be naive to assume that visualizations are always helpful or appropriate. Instead, we can easily think of potential problems caused by visualizations and claim that they frequently distract from important aspects, facilitate misleading comparisons, or obscure and hide patterns in data. Thus, visualizations are representations that can be good or bad on many levels. Creating effective visualizations requires a mix of knowledge and skills that include aspects of human perception, psychology, design, and technology.

An example

Whenever theoretical justifications are hard, we can use an example that proves the existence of cases in which visualizations help:

# Get 4 sets of data:
a_1 <- ds4psy::get_set(1)
a_2 <- ds4psy::get_set(2)
a_3 <- ds4psy::get_set(3)
a_4 <- ds4psy::get_set(4)

Examine sets:

# Inspect a_1:
a_1         # print tibble
#>      x     y
#> p01 10  8.04
#> p02  8  6.95
#> p03 13  7.58
#> p04  9  8.81
#> p05 11  8.33
#> p06 14  9.96
#> p07  6  7.24
#> p08  4  4.26
#> p09 12 10.84
#> p10  7  4.82
#> p11  5  5.68
dim(a_1)    # a table with 11 cases (rows) and 2 variables (columns) 
#> [1] 11  2
str(a_1)    # see table structure
#> 'data.frame':    11 obs. of  2 variables:
#>  $ x: num  10 8 13 9 11 14 6 4 12 7 ...
#>  $ y: num  8.04 6.95 7.58 8.81 8.33 ...
names(a_1)  # names of the 2 column (variables) 
#> [1] "x" "y"

Obtaining basic statistics:

For the first subset a_1, the corresponding values are as follows:

# Analyzing a_1:
mean(a_1$x)  # mean of x
#> [1] 9
mean(a_1$y)  # mean of y
#> [1] 7.500909

sd(a_1$x)    # SD of x
#> [1] 3.316625
sd(a_1$y)    # SD of y
#> [1] 2.031568

cor(x = a_1$x, y = a_2$y)  # correlation between x and y
#> [1] 0.8162365
lm(y ~ x, a_1)             # linear model/regression: y by x
#> 
#> Call:
#> lm(formula = y ~ x, data = a_1)
#> 
#> Coefficients:
#> (Intercept)            x  
#>      3.0001       0.5001

See overview in Table 3.1:

Table 3.1: Summary statistics of the four subsets (rounded to two decimals).
nr n mn_x mn_y sd_x sd_y r_xy intercept slope
1 11 9 7.5 3.32 2.03 0.82 3 0.5
2 11 9 7.5 3.32 2.03 0.82 3 0.5
3 11 9 7.5 3.32 2.03 0.82 3 0.5
4 11 9 7.5 3.32 2.03 0.82 3 0.5

Figure 3.1 shows four scatterplots that shows the x-y coordinates of each subset:

Scatterplots of the four subsets.

Figure 3.1: Scatterplots of the four subsets.

Shift from why to evaluating visualizations: From justifying to judging…

3.2.2 Evaluating visualizations

How can we create good graphical displays?

The lack of formal theory bedevils good graphics.
The only way to make progress is through training in principles and through experience in practice.

(Unwin, 2008, p. 77)

Quite often, creating a visualization seems to proceed in a trial-and-error fashion. A data analyst starts by selecting some type of visualization and then adds and tweaks features until a satifying result is obtained.

The most salient (or “obvious”) features of graphs are their aesthetic properties:

  • colors
  • shapes
  • sizes and widths
  • endoding and file formats

Being an artificial object that is designed to serve particular purposes, good questions for evaluating a given visualization include:

  • What is the message and the audience for which this visualization was designed?
  • How does it convey its message? Which aesthetic features does it use? Which perceptual or cognitive operations does it require or enable?
  • How successful is it? Are there ways in which its functionality could be improved?

Thus, evaluating a visualization does not only depend on the visualization itself, but on its relation to the data, its message, and its audience.

A matter of ER: Match of a visualization to a particular task (message) and person (audience, viewer). Again, multiple levers for effecting change (e.g., design or training).

The good, the bad and the ugly

Often easier to detect and enumerate mistakes than to provide a set of principles that leads to good visualizations. Distinguish various ways of being bad.

Chartjunk and other crimes

Reduce non-data ink and chartjunk (see Wikipedia).

The term chartjunk was coined by Edward Tufte in his book The Visual Display of Quantitative Information (2001). Tufte wrote:

The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new.
The purpose of decoration varies – to make the graphic appear more scientific and precise,
to enliven the display, to give the designer an opportunity to exercise artistic skills.
Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.

Edward R. Tufte (1983)

The notion is similar to Adolf Loos’s claim that — in architecture — ornament is a crime (see Wikipedia) and the design principle that form follows function (see Wikipedia).

3.2.3 Types of graphs

There are many different types of graphs. Many compare or contrast values on one dimensions, other combine multiple variables.

Taxonomy of graphs by tasks and type of comparisons.