2.1 Introduction

“Above all else show the data.”
Edward R. Tufte (2001)

2.1.1 Motivation

Why should we visualize data? Consider the following examples (which you can copy and evaluate to follow along):

# Load data:
ans <- tibble::as_tibble(with(datasets::anscombe, data.frame(x = c(x1, x2, x3, x4), 
                                                             y = c(y1, y2, y3, y4), 
                                                             nr = gl(4, nrow(anscombe)))))

# Split data into 4 subsets:
a_1 <- ans[ans$nr == 1, 1:2]
a_2 <- ans[ans$nr == 2, 1:2]
a_3 <- ans[ans$nr == 3, 1:2]
a_4 <- ans[ans$nr == 4, 1:2]

This code creates a data table (or “tibble”) ans and splits it into 4 separate subsets (a_1 to a_4). We can use our basic knowledge of R to examine the 4 subsets, starting with a_1:

# Checking a_1:
dim(a_1)    # a table with 11 rows (cases) and 2 columns (variables)
#> [1] 11  2
names(a_1)  # names of the 2 column (variables) 
#> [1] "x" "y"

Checking a_2, a_3 and a_4 in the same way reveals that all 4 subsets have the same shape (i.e., each object contains 11 rows and 2 columns) and the same variable names.

Basic statistics

What do psychologists normally do to understand data? One possible answer is: They use statistics to summarize and understand it. Consequently, we could compute some basic statistical properties for each set:

  • the averages of both variables (e.g., means of \(x\) and of \(y\));
  • their standard deviations (\(SD\) of \(x\) and of \(y\));
  • the correlations between \(x\) and \(y\);
  • a linear model (predicting \(y\) by \(x\)).

For a_1 the resulting values are as follows:

# Analyzing a_1:
mean(a_1$x)  # mean of x
#> [1] 9
mean(a_1$y)  # mean of y
#> [1] 7.500909

sd(a_1$x)    # SD of x
#> [1] 3.316625
sd(a_1$y)    # SD of y
#> [1] 2.031568

cor(x = a_1$x, y = a_2$y)  # Correlation between x and y
#> [1] 0.8162365
lm(y ~ x, a_1)             # Linear model/regression: y by x
#> 
#> Call:
#> lm(formula = y ~ x, data = a_1)
#> 
#> Coefficients:
#> (Intercept)            x  
#>      3.0001       0.5001

Practice

Compute the same measures for the other three subsets (a_2 to a_4). Do they seem similar or different?

Same stats, but…

The following Table 2.1 shows that — except for some minor discrepancies (in the 3rd decimals) — all 4 subsets have identical statistical properties:

Table 2.1: Statistics of the 4 subsets (rounded to 2 decimals).
nr n mn_x mn_y sd_x sd_y r_xy intercept slope
1 11 9 7.5 3.32 2.03 0.82 3 0.5
2 11 9 7.5 3.32 2.03 0.82 3 0.5
3 11 9 7.5 3.32 2.03 0.82 3 0.5
4 11 9 7.5 3.32 2.03 0.82 3 0.5

Does this imply that the 4 subsets (a_1 to a_4) are identical? Although our evidence so far may suggest so, this impression is deceiving. In fact, each of the 4 subsets contains different values. This could be discovered by examining the actual values of all 4 subsets (e.g., by printing and comparing the 4 tables). By contrast, visually inspecting the data makes it far easier to see what is really going on. Figure 2.2 shows 4 scatterplots that shows the x-y coordinates of each subset:

Scatterplots of the 4 subsets in the `anscombe` dataset.

Figure 2.2: Scatterplots of the 4 subsets in the anscombe dataset.

This example illustrates that highly similar statistics can stem from very different datasets. Consequently, we should never just rely on a statistical analysis. Statistics is a useful tool for testing scientific hypotheses, but to fully understand data, we should always strive to visualize it in meaningful ways.

2.1.2 Objectives

After working through this chapter, you should be able to:

  1. explain why we should always aim for a transparent visualization of data;
  2. know the basic structure of a ggplot command;
  3. distinguish between various types of plots (e.g., scatterplots, histograms, bar plots, line graphs);
  4. use different geoms to create these plots;
  5. adjust the aesthetic properties (e.g., colors, shapes) of plots;
  6. adjust the axes, legends, and titles of plots.

2.1.3 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 3: Data visualization of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 3 of r4ds is still recommended.

Based on this background, we examine some essential commands of the ggplot2 package in the context of examples. Please do the following to get started:

  • Create an R script (.R) or an R Markdown (.Rmd) document (see Appendix E and the templates linked in Section E.2) and load the R packages of the tidyverse and the ds4psy package.

  • Structure your document by inserting headings, meaningful comments, and empty spaces between different parts. Here’s an example that shows how your initial file could look:

## Visualizing data | ds4psy
## Your Name | 2020 March 24
## ----------------------------

## Preparations: ----------

library(tidyverse)
library(ds4psy)

## 1. Topic: ----------

# etc.

## End of file (eof). ----------  
  • Save your file (e.g., as 02_visualize.R or 02_visualize.Rmd in the R folder of your current project) and remember saving it regularly as you keep adding content to it.

To learn to use the ggplot2 package, we first need to understand the structure of ggplot() calls.

References

Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz