9 Visualize with ggplot2

This chapter introduces data visualization with the R package ggplot2 (Wickham, Chang, et al., 2023).

Please note: As this chapter is still incomplete, please study Chapter 2: Visualizing data of the ds4psy book (Neth, 2023a) instead.

Preflections

i2ds: Preflexions

  • What are common elements of visualizations?

  • What is the relation between data and those (functional) elements?

  • What are aesthetic features of visualizations?

9.1 Introduction

The ggplot2 package (Wickham, Chang, et al., 2023) and the corresponding book ggplot2: Elegant graphics for data analysis (Wickham, 2016) provide an implementation of The Grammar of Graphics (Wilkinson, 2005), which develops a systematic way of thinking about — or a language and philosophy of — data visualization. The notion of a grammar is one that we are familiar with (e.g., when studying a foreign language), but its exact meaning remains difficult to define. Wilkinson (2005) notes that a grammar provides the rules that make languages expressive. Knowing the grammar of a language allows us to combine elementary concepts into well-formed sentences. Similarly, learning the grammar of graphics will allow us turning data into visualizations.

Learning how to use ggplot2is ---\ just like learning R\ --- a journey, rather than a destination. Hence, we should not be surprised if some concepts remain somewhat obscure for a while. Fortunately, there is no need to understand all aboutggplot()` to create awesome visualizations with it.

9.1.0.1 Terminology

Distinguish between ggplot2 and ggplot():

  • ggplot and ggplot2 denote R packages (in its version ggplot2 3.4.4), whereas

  • ggplot() is the main function of those packages that generates a visualization.

Beyond this technical distinction, the grammar of graphics includes many new terms:

  • mapping variables to parameters (e.g., axes, groups)
  • distinguish a range of geoms (e.g., areas, bars, lines, points)
  • aesthetic features (e.g., colors, fonts) and descriptive elements (e.g., caption, labels, legend, titles)
  • combining multiple geoms into layers and viewing different facets of a visualization

9.1.1 Contents

9.1.2 Data and tools

This chapter primarily uses the ggplot2 package:

Additionally, we use data from the datasets and ggplot2 packages.

9.2 Essentials of ggplot2

Note basic structures of ggplot() expressions and explain corresponding elements:

Mapping data to visual elements

When creating visualizations, the main regularity that beginners tend to struggle with is to define the mapping between data and elements of the visualization. The term mapping is a relational concept that essentially specifies what goes where. The what part typically refers to some part of the data, whereas the where part refers to some aspect or part of the visualization.

9.2.1 Minimal template

A minimal template of a ggplot() command can be reduced to the following structure:

# Minimal ggplot template:
ggplot(<DATA>) +             # 1. specify data set to use
  <GEOM_fun>(aes(<MAPPING>)  # 2. specify geom + mappings 

The minimal template includes the following elements:

  • <DATA> is a data frame or tibble that contains the data that is to be plotted.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that is specified in aes(<MAPPING>). (A mapping specifies a relation between two entities. Here, the mapping specifies the correspondence of variables to graphical elements, i.e., what goes where.)

  • A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized

    1. in the aesthetic mapping (when varying visual features according to data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).

Data format

The <DATA> provided to the data argument of the ggplot() function must be rectangular table (i.e., a data.frame or tibble). Beyond this basic type, ggplot() assumes that the data is formatted in a specific ways (using factors and in so-called “long format”). Essentially, this format ensures that some variables describe or provide handles on the values of others. At this point, we do not need to worry about this and just work with existing sets of data that happen to be in the right shape.

Geoms

Basic types of visualizations in ggplot2 involve so-called geometric objects (geoms), which are accessed via functions <GEOM_fun>.

As we began with histograms and scatterplots in Chapter 8, we will begin with geom_histogram() and geom_points(). However, as ggplot2 currently contains over 50 different geoms, these provide only an introductory glimpse on the available options.

Histograms

In Chapter 8, we created our first histogram for a vector of numeric values x as follows:

x <- rnorm(n = 500, mean = 100, sd = 10)
hist(x)

In this example, the data x consisted of a single vector. However, as ggplot() requires its data to be in tabular form, we use data.frame() to convert it into a data frame with one variable x:

# Convert vector x into df:
df <- data.frame(x)
head(df)
#>           x
#> 1 116.13249
#> 2  99.97044
#> 3 109.64114
#> 4  93.20812
#> 5  70.96911
#> 6  95.42659

Now we can fill in the minimal template and use the geom_histogram() function for creating a histogram.

ggplot(data = df) + 
  geom_histogram(aes(x = x))

It is interesting to study the commonalities and differences of the two basic histograms created. Using ggplot() seems a little harder than hist(x), but embeds creating a histogram in a visualization framework that is much more flexible and powerful than the hist() function.

Studying the documentation of geom_histogram() (and geom_bar(), on which geom_histogram() is based) reveals hidden complexity. A difficulty is that both geoms note x and y as required aesthetics, but we succeeded by only providing x. The reason for this lies in the fact that they y-values of our histogram were automatically computed by a default argument stat = "bin". For continuous variables, this counts the number of values that are within a specific interval (a so-called bin). Internally, the histogram above is actually computed as a bar chart:

ggplot(data = df) + 
  geom_bar(aes(x = x), stat = "bin", position = "stack")

Whereas some visualizations merely show existing data values, many visualizations first need to compute something (e.g., count the frequency of values within a specific interval). Specifying how to compute what is the purpose of the stat function. The relations between geoms and corresponding stat functions is probably the most tricky part of ggplot2. As long as we are using geoms with their default statistics, we do not need to worry about stat. But when visualizations go wrong with ggplot2, it is often due to a mismatch between a geom and a stat function.

Studying ?geom_histogram() also reveals additional parameters that we can use to adjust our histogram. Whereas hist() used a breaks parameter to adjust the number of categories, geom_histogram() provides two corresponding parameters (called bins and binwidth). The following examples also specify some colors (by setting col and fill to named R colors):

ggplot(data = df) + 
  geom_histogram(aes(x = x), 
                 bins = 10, col = "black", fill = "deepskyblue")

ggplot(data = df) + 
  geom_histogram(aes(x = x), 
                 binwidth = 10, col = "white", fill = "hotpink")

Note that we specified all additional parameters (numeric values for bins or binwidth, and the color values of col and fill) outside of the aes() function. As we will see, it is sometimes possible to use parameters inside of aes(), but then they are used as variables, rather than as constants (i.e., fixed values).

Scatterplots

In Chapter 8, we created a scatterplot for a vector of numeric values x and y as follows:

# Data:
x <- 11:43
y <- c(sample(5:15), sample(10:20), sample(15:25))

# Scatterplot (of points):
plot(x = x, y = y,
     main = "A positive correlation")

Using ggplot2:

df <- data.frame(x, y)

ggplot(df) + 
  geom_point(aes(x = x, y = y))

A scatterplot from the mpg data:

In base R (and a transparent color my_col definition using the unikn package):

# Define some color (from unikn, with transparency): 
my_col <- unikn::usecol(unikn::Bordeaux, alpha = 1/4)

# With aesthetics (see ?par):
plot(x = mpg$displ, y = mpg$hwy, type = "p", 
     col = my_col, pch = 16, cex = 1.5,
     main = "A basic scatterplot", 
     xlab = "Displacement", ylab = "MPG on highway"
     )

We usually want to add labels and titles, as well as modify other aesthetic features of visualizations.

Using ggplot():

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), size = 2, col = my_col) +
  labs(x = "Displacement", y = "MPG on highway", 
       title = "A basic scatterplot")

Note:

  • Option for grouping cases/observations via mappings

9.2.2 Generic template

A generic template for creating a visualization with additional bells and whistles has the following structure:

# Generic ggplot template: 
ggplot(data = <DATA>) +                 # 1. specify data set to use
  <GEOM_fun>(mapping = aes(<MAPPING>),  # 2. specify geom + mappings 
             <arg_1 = val_1, ...) +     # - optional arguments to geom
  ...                                   # - additional geoms + mappings
  <FACET_fun> +                         # - optional facet function
  <LOOK_GOOD_fun>                       # - optional themes, colors, labels...

The generic template includes the following elements (beyond the <DATA> and <GEOM_fun> of the minimal template):

  • An optional <FACET_fun> uses one or more variable(s) to split a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjust the visual features of plots (e.g., by adding titles and text labels, color scales, plot themes, or setting coordinate systems).

Additional topics:

  • Aesthetic properties and coordinate systems
  • Combining multiple geoms into layers
  • Different facets of a plot

When using multiple geoms (in layers): Specify common mappings globally, rather than locally.

9.3 Conclusion

9.3.1 Summary

9.3.2 Resources

i2ds: Links to resources, etc.

Books and book chapters

The two main references on ggplot2 and its history are Wilkinson (2005) and Wickham (2016).

Introductory chapters on include:

As the ggplot package is a key component and pre-cursor of the so-called tidyverse dialect or movement (Wickham et al., 2019), corresponding textbook provide good introductions to ggplot2:

Online resources

Inspiration and tools for using ggplot2:

Cheatsheets

Here are some pointers to related Posit cheatsheets:

  • Data visualization with ggplot2
Data visualization with ggplot2 from Posit cheatsheets.

Figure 9.1: Data visualization with ggplot2 from Posit cheatsheets.

The corresponding online reference provides an overview of key ggplot2 functionality.

9.3.3 Preview

We now learned to create visualizations in base R (in Chapter 8) and the ggplot2 package. Irrespective of the tools we use, colors are an important aesthetic for making more informative and pleasing visualizations. Chapter 10 on Using colors introduces the topic of color representation and show us how to find and manipulate color palettes.

9.4 Exercises

i2ds: Exercises

For exercises on using ggplot2,

9.4.1 Advanced ggplot expressions

The following ggplot() expressions are copied from the documentation of the corresponding geoms. Run the code, inspect the result, and then try to explain how they work:

  1. A facet of histograms:
ggplot(economics_long, aes(value)) +
  facet_wrap(~variable, scales = 'free_x') +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))