9 Visualize with ggplot2

This chapter introduces data visualization with the R package ggplot2 (Wickham, Chang, et al., 2024).

Please note: As this chapter is improving, but still incomplete.

Preflections

i2ds: Preflexions

  • What are common elements of visualizations?

  • What is the relation between data and those (functional) elements?

  • What are aesthetic features of visualizations?

9.1 Introduction

The ggplot2 package (Wickham, Chang, et al., 2024) and the corresponding book ggplot2: Elegant graphics for data analysis (Wickham, 2016) provide an implementation of The Grammar of Graphics (Wilkinson, 2005), which develops a systematic way of thinking about — or a language and philosophy of — data visualization. The notion of a “grammar” is one that we are familiar with (e.g., when studying a foreign language), but its exact meaning remains difficult to define. Wilkinson (2005) notes that a grammar provides the rules that make languages expressive. The essence of a grammar is to specify how elementary components can be combined to create well-formed systems. Thus, knowing the grammar of a language allows us to combine elementary concepts and words into sentences. Similarly, learning the grammar of graphics will allow us turning data into visualizations.

Learning how to use ggplot2 is — just like learning R — a journey, rather than a destination. Hence, we should not be surprised if some concepts and details remain somewhat obscure for a while. Fortunately, there is no need to understand all about ggplot() to create awesome visualizations with it.

9.1.0.1 Terminology

Distinguish between ggplot2 and ggplot():

  • ggplot and ggplot2 denote R packages (in its version ggplot2 3.5.1), whereas

  • ggplot() is the main function of those packages that generates a visualization.

Beyond this technical distinction, the grammar of graphics includes many new terms:

  • mapping data variables to visual aspects or dimensions (e.g., axes, groups)

  • distinguish a range of geoms (i.e., geometric objects, e.g., areas, bars, lines, points) that transform data via statistics (stat arguments or stat_* functions)

  • aesthetic features (e.g., colors, shapes, sizes) and descriptive elements (e.g., text captions, labels, legend, titles)

  • combining graphical elements into layers and viewing different facets of a visualization

9.1.1 Contents

This chapter provides an introduction to the ggplot2 package (Wickham, Chang, et al., 2024). It covers some basic types of visualizations and shows how they can be improved by adding aesthetic features (e.g., colors, labels, and themes) and advanced functions (e.g., facets, layers, and extensions). Overall, this will provide us with a powerful toolbox for creating informative and beautiful visualizations.

9.1.2 Data and tools

This chapter primarily uses the functions of the ggplot2 package:

but also some related packages:

library(patchwork)  # for combining and arranging plots
library(unikn)      # for colors and color functions 

In addition to using data from the datasets and ggplot2 packages, we use the penguins dataset from the palmerpenguins package (Horst et al., 2022):

library(palmerpenguins)  # for penguins data
Meet the penguins of the Palmer Archipelago, Antarctica. (Artwork by @allison_horst.)

Figure 9.1: Meet the penguins of the Palmer Archipelago, Antarctica. (Artwork by @allison_horst.)

9.2 Essentials of ggplot2

Before we can draw our first plots, we will introduce a minimal template for ggplot() commands and explain some related terminology.

9.2.1 Minimal template

Generally speaking, a plot takes some <DATA> as input and creates a visualization by mapping data variables or values to (parts of) geometric objects.

A minimal template of a ggplot() command can be reduced to the following structure:

# Minimal ggplot template:
ggplot(<DATA>) +             # 1. specify data set to use
  <GEOM_fun>(aes(<MAPPING>)  # 2. specify geom + variable mapping(s) 

The minimal template includes the following elements:

  • The <DATA> is a data frame or tibble that contains all data that is to be plotted and is shaped in suitable form (see below).
    Its variable names are the levers by which the data values are being mapped to the plot.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that is specified in aes(<MAPPING>). A mapping specifies a relation between two entities. Here, the mapping specifies the correspondence of variables to graphical elements, i.e., what goes where.

  • A geom’s visual appearance is controlled by aesthetics (e.g., colors, shapes, sizes, …) and can be customized by keyword arguments (e.g., color, fill, shape, size…). There are two general ways and positions to do this:

    1. within the aesthetic mapping (when varying visual features as a function of data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).

Note that the functions that make up a ggplot() expression (which are typically positioned on separate lines) are connected by the + operator, rather than some other pipe operator.

9.2.2 Terminology

An obstacle to many technologies is that insiders tend to converse in special terms that appear to obscure rather than reveal insight. In this respect, ggplot2 is no exception. Fortunately, the number of needed terms is limited and the investment is worthwhile.

Two abstract notions that are relevant in the context of the ggplot2 package are geoms and mapping. An implicit requirement of ggplot() is that the to-be-plotted data must be in the right format (shape).

Geometric objects

Basic types of visualizations in ggplot2 involve geometric objects (so-called geoms), which are accessed via dedicated functions (<GEOM_fun>). When first encountering ggplot2, it makes sense to familiarize ourselves with basic geom functions that create basic types of visualizations. Just like other R functions, geoms require specific input arguments to work. As we get more experienced, we will realize that geoms can be combined to create more complex plots and can invoke particular computations (so-called stats).

Mapping data to visual elements

When creating visualizations, the main regularity that beginners tend to struggle with is to define the mapping between data and elements of the visualization. The notion of a mapping is a relational concept that essentially specifies what goes where. The what part typically refers to some part of the data (e.g., a variable), whereas the where part refers to some aspect or part of the visualization (e.g., an axis, geometric object, or aesthetic feature).

Data format

The <DATA> provided to the data argument of the ggplot() function must be rectangular table (i.e., a data.frame or tibble). Beyond this basic type, ggplot() assumes that the data is formatted in a specific ways (in so-called “long” format, using factor variables to describe measurement values). Essentially, this format ensures that some variables describe or provide handles on the values of others. At this point, we do not need to worry about this and just work with existing sets of data that happen to be in the right shape. (We will discuss corresponding data transformations in Chapter 14 on Tidying data.)

The data used in the subsequent examples is copied from the penguins object of the palmerpenguins package (Horst et al., 2022). We assign this data to an R object pg and inspect it:

# Data:
pg <- palmerpenguins::penguins

# Inspect data:
dim(pg)
#> [1] 344   8

# Compact structure:
str(pg)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#>  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

# Print some cases:
set.seed(100)  # for reproducible randomness
s <- sample(1:nrow(pg), size = 10)
knitr::kable(pg[s, ], caption = "10 random cases (rows) of the `penguins` data.")
Table 9.1: 10 random cases (rows) of the penguins data.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 45.2 15.8 215 5300 male 2008
Adelie Biscoe 45.6 20.3 191 4600 male 2009
Gentoo Biscoe 50.1 15.0 225 5000 male 2008
Adelie Torgersen NA NA NA NA NA 2007
Chinstrap Dream 49.7 18.6 195 3600 male 2008
Chinstrap Dream 49.8 17.3 198 3675 female 2009
Adelie Dream 40.3 18.5 196 4350 male 2008
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Gentoo Biscoe 47.3 15.3 222 5250 male 2007
Chinstrap Dream 43.2 16.6 187 2900 female 2007

The table shows the names of the 8 variables in our pg data, which are rather self-explanatory. For instance, the levels of the factor variables species and island can be used to group the other values (e.g., measurements of penguin physiology). Note that each row of data refers to one observation of a penguin and the data contains some missing (NA) values on some variables.

Do not worry if some of these terms remain unclear at this point. The following sections will provide plenty of examples that — hopefully — further explain and illustrate their meaning.

9.2.3 Plotting distributions

In Chapter 8, we used histograms and the hist() function to visualize the distribution of variable values (see Section 8.2.1). The corresponding geom function in ggplot2 is geom_histogram(). The data to be used is pg and the only aesthetic mapping required for geom_histogram() is to specify a continuous variable whose values should be mapped to the \(x\)-axis. Let’s use the flipper_length_mm variable for this purpose and create our first visualization with ggplot() (Figure 9.2):

# Basic histogram: 
ggplot(data = pg) + 
  geom_histogram(mapping = aes(x = flipper_length_mm))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_bin()`).
A basic histogram showing a distribution of variable values (created by ggplot2).

Figure 9.2: A basic histogram showing a distribution of variable values (created by ggplot2).

Note that we succeeded in creating our first histogram in ggplot2. This visualization is rather basic, but includes the bars of a histogram on a grey background with white grid lines (its signature theme_grey() is based on Tufte, 2006) and two axes with appropriate labels. As with the hist() function (from the base R graphics package), the default behavior of the geom_histogram() function is to categorize the values of the specified variable in discrete bins and display the counts of values per bin (as a bar chart).

Note also that evaluating our ggplot() command printed a message and a warning. Whereas the warning is due to our flipper_length_mm variable containing 2 missing (NA) values, the message suggests that we could specify a numeric value to the bins or to the binwidth parameters to override the default setting of bins = 30.

Just as we can use a natural language to say the same thing in different ways, the grammar of graphics allow for considerable flexibility in creating the same visualization. For instance, we can omit argument names of R functions, as long as the arguments (here data and x) are unambiguous and can move aesthetic mappings to the first line of the ggplot() expression. As a consequence, the following variants all create the same visualization:

# Basic histogram variants: 

# A: explicit version:
ggplot(data = pg) + 
  geom_histogram(mapping = aes(x = flipper_length_mm), bins = 30) +
  theme_grey()

# B: short version:
ggplot(pg) + 
  geom_histogram(aes(flipper_length_mm))

# C: moving aesthetic mapping to the 1st line:
ggplot(pg, aes(flipper_length_mm)) + 
  geom_histogram(bins = 30)

Adding colors, text labels, and themes

Before discovering more features of ggplot2, we should learn to improve its default visualizations. The basic histogram of Figure 9.2 is informative, but can be embellished by adding colors, more informative text labels, and choosing a different theme. Colors that do not vary by a data variable can be set as constants (i.e., outside the aes() function) to color-related arguments of the current geom. For the bars of geom_histogram(), the color argument refers to the border of the bars, whereas the bars themselves are colored by a fill argument. We can set these arguments to any of the 657 named R colors, available by evaluating colors(). (More complex color settings involving data variables and color scales will be introduced below.) The best way to change default text labels is by using the labs() function, which allows setting a range of labels by intuitive argument names. A good visualization should usually have a descriptive title and informative labels for its x- and y-axes. The theme() function of ggplot2 allows re-defining almost any aesthetic aspect of a plot. Rather than specifying all of them manually, we can choose one of the theme_*() functions that come with ggplot2.

An improved version of Figure 9.2 can be created as follows (Figure 9.3):

# Adding colors, labels and themes:
ggplot(pg) + 
  geom_histogram(aes(x = flipper_length_mm), binwidth = 2, 
                 color = "grey20", fill = "deepskyblue") + 
  labs(title = "Distribution of penguin flipper lengths",   
       x = "Flipper length (in mm)", y = "Frequency") + 
  theme_bw()
A histogram showing a distribution of values (with colors, labels, and a theme).

Figure 9.3: A histogram showing a distribution of values (with colors, labels, and a theme).

The code for Figure 9.3 shows that ggplot() commands can be viewed as a sequence of sub-commands, joined by the + operator. A neat feature of ggplot2 is that plots can be stored as R objects and then modified later. For instance, the code of the previous chunk could be decomposed into two steps:

# Adding colors, labels and themes:
pg_1 <- ggplot(pg) + 
  geom_histogram(aes(x = flipper_length_mm), binwidth = 2, 
                 color = "grey20", fill = "deepskyblue")
# pg_1  # basic plot with default settings

pg_2 <- pg_1 + 
  labs(title = "Distribution of penguin flipper lengths",  
       x = "Flipper length (in mm)", y = "Frequency") + 
  theme_bw()
pg_2  # annotated plot (with labels and modified theme)
(ref:fig-ggplot-hist-2b)

Figure 9.4: (ref:fig-ggplot-hist-2b)

When storing a plot as an R object, evaluating the object prints the plot to the visualization area of RStudio. Here, pg_1 provides the basic histogram (plus two color constants), and pg_2 adds text labels and changes the default plot theme. Given the vast range of possible modifications, the best practice and strategy for working with ggplot2 is to first get the basic mechanics of the plot right (i.e., by adjusting geoms and variable mappings) before adding further bells and whistles for creating a more appealing visualization (e.g., by selecting aesthetics, text labels, or themes). The modular structure of ggplot2 objects supports this strategy.

Grouping observations by mapping variables to aesthetics

Noting that “colors that do not vary by a data variable can be set as constants” (in the previous section) raises the question what other functions colors could serve. A prominent function is to distinguish between different groups of observations. This would require that a color element of our visualization is mapped to the levels of a categorical variable. We can easily add this by moving a color argument into the aesthetic mapping function aes() and assigning it to a categorical variable of our data. For instance, the following code maps the factor variable species to the fill color of the histogram bars:

# Grouping by mapping aesthetics (fill color) to a data variable (species):
pg_3 <- ggplot(pg) +
  geom_histogram(aes(x = flipper_length_mm, fill = species), binwidth = 2, 
                 color = "grey10", linewidth = .20)
pg_3

Note that moving fill = species into the aes() function had two effects: First, the counts of observations (penguins) that were expressed in the bars are color-coded for the three different species of penguins. (Note the difference to the constant color = "grey10" setting, which lies outside the scope of the aes() function.) Additionally, a legend that describes the mapping of colors to species appeared to the right of the plotting area. This is very useful default behavior, but we may want to adjust the aesthetic properties (e.g., the fill color) to custom colors.

Adding color scales

Changing colors that are mapped to data variables typically requires specifying a color scale. The range of color scale functions and corresponding palettes can be confusing and usually requires a lookup of the scale_color_* function.

A popular option is to use one of the palettes of the RColorBrewer package (Neuwirth, 2022) that come pre-packaged with ggplot2. The Brewer scales provide sequential, diverging and qualitative color palettes (see https://colorbrewer2.org for more information). Looking up ?scale_colour_brewer reveals that its qualitative scales are labeled as “Accent”, “Dark2”, “Paired”, “Pastel1”, “Pastel2”, “Set1”, “Set2”, and “Set3”. As we aim to change the fill colors, we can select the corresponding palettes by specifying scale_fill_brewer(), e.g.,

# Grouping by aesthetics (and using a different color scale):
ggplot(pg) +
  geom_histogram(aes(x = flipper_length_mm, fill = species), binwidth = 2, 
                 color = "grey10", linewidth = .20) + 
  scale_fill_brewer(palette = "Accent")

When aiming to create a range of visualizations in a uniform style, it is advisable to define one or more palettes of custom colors. There are many R functions and packages supporting this task. We will use the unikn package, as it combines pleasing colors with useful color functions:

library(unikn)  # for colors and color functions

# seecol(pal_unikn_pref)  # view a (categorical) color palette

# Using the penguin species colors (from Figure 9.1):
my_cols <- usecol(pal = c("orange2", "orchid4", "turquoise4"), alpha = .67)

# Using unikn colors:
my_cols <- usecol(pal = c(Seeblau, Pinky, Seegruen), alpha = .67)  # 3 specific colors
my_cols <- usecol(pal = pal_unikn_pref, alpha = .67)               # entire color palette

The usecol() function allows defining a color palette (of a variable length n ) and add transparency (by setting the alpha parameter to a value from 0 to 1). Using color transparency is a primary way to prevent overplotting (see Chapters 8 on Visualize in R and Chapter 10 on Using colors for more details and examples).

As we saved our plot as pg_3 above, we can add labels, apply our new custom color palette, and change the default theme as follows (Figure 9.5):

# Adding labels, color scale, and theme (to an existing plot):
pg_4 <- pg_3 +
  labs(title = "Distribution of penguin flipper lengths (by species)", 
       x = "Flipper length (in mm)", y = "Frequency", fill = "Species:") + 
  scale_fill_manual(values = my_cols) + 
  theme_unikn()
pg_4
A histogram showing a distribution of values and color-coding a categorical variable.

Figure 9.5: A histogram showing a distribution of values and color-coding a categorical variable.

Having seen some basic ggplot() commands, we should practice what we have learned so far.

Practice

Here are some practice tasks for plotting distributions:

  1. Playing with parameters: Re-create the basic histogram of Figure 9.2 and vary the bins or binwidth parameters.

    • What happens to the values on the \(y\)-axis when varying the parameters and why?
    • What happens when we change the variable mapping from x to y?
    • Which binwidth parameter corresponds to a value of bins = 30?
  2. Bill shapes by island: Create a (series of) histogram(s) to show the distribution of bill depth and bill length for penguins by island.

  3. Alternative distributions: Study the documentation to geom_histogram() and explore its alternatives geom_density() and geom_freqpoly().

    • Create a histogram, density plot, and frequency polygon to show the distribution of body mass (for the 3 species of penguins).
    • What does the \(y\)-axis of a density plot show?

9.2.4 Plotting summaries

In addition to plotting distributions, a common type of visualization aims to show a summary of one or more variables. While there are many ways of doing this, we will focus on bar charts and box plots.

Bar charts

A bar chart seems simple, but is actually a quite complicated plot. To realize this, we use a ggplot() expression for our pg data and geom_bar(), mapping the factor variable species to its \(x\)-axis (Figure 9.6):

# A basic bar chart: Showing counts:
ggplot(pg) +
  geom_bar(aes(x = species))
A basic bar chart (showing counts of cases).

Figure 9.6: A basic bar chart (showing counts of cases).

Figure 9.6 illustrates that geom_bar() does not simply plot given data values, but instead performs some computation. In ggplot2, geoms that compute stuff are linked to so-called stat (for statistics). By default, geom_bar groups observations into the categories specified by the variable levels mapped to x and then counts the number of cases per category. The following expression is a more explicit version of the previous code chunk (and would create the exact same plot as Figure 9.6):

# Explicate stat:
ggplot(pg) +
  geom_bar(aes(x = species), stat = "count")

The relation between geoms and stats

We have seen that geom_histogram() categorized observations in our data into groups (bins) and counted their frequency (Figure 9.2). Similarly, geom_bar() automatically counted the observations in the levels of a variable mapped to x (Figure 9.6).

This illustrates a hidden complexity in creating visualizations: Many types of visualizations require computations or transformations of the input data. If we provide raw data values to a ggplot2() command, the geoms aim to guess which transformation we desire by linking geoms to stat options (and corresponding functions).

The details of possible relations between geoms and stat options are difficult to understand. Rather than aiming to explain them here, we can only emphasize that geoms that compute values are linked to statistical functions that can also be invoked directly. When asking for ggplot2 advice online, experts often provide nifty solutions that perform quite complicated data transformations in variable mappings. Here are some examples that are — spoiler alert — likely to confuse you:

  • We can instruct ggplot2 to count observations by mapping ..count.. to a variable:
# Compute counts (in y mapping):
ggplot(pg) +
  geom_bar(aes(x = species, y = ..count..))
A bar chart counting the number of penguin observations by species.

Figure 9.7: A bar chart counting the number of penguin observations by species.

  • Instead of assigning y to ..count.., we can also ask for proportions (but then also need to specify the group level):
# Compute proportions (in y and group mapping):
ggplot(pg) +
  geom_bar(aes(x = species, y = ..prop.., group = 1))
A bar chart computing proportions of penguin species.

Figure 9.8: A bar chart computing proportions of penguin species.

  • In case this cryptic code does not suffice to confuse you, we can even omit the geom_ function altogether and directly ask for the summary of a given variable mapping (and specify the geom as an argument of the stat_summary() function):
# Compute a bar chart of means (by using stat_summary):
ggplot(pg, aes(x = species, y = body_mass_g)) +
    stat_summary(fun = mean, geom = "bar")
A bar chart computing penguin’s mean body mass by species from data (without an explicit geom function).

Figure 9.9: A bar chart computing penguin’s mean body mass by species from data (without an explicit geom function).

Do not worry if the last three examples remain rather confusing at this point! They are shown here only to illustrate the intimate connection between data visualization and transformation. Actually, computing values from data in visualization commands may be convenient and powerful, but is also error-prone and intransparent. A better way of creating visualizations is to first compute all values that we are interested in (e.g., some measures of central tendency and variability) and then visualize these values. Fortunately, novice users of ggplot2 only need to know that some geoms provide stat options and choose an appropriate one (e.g., "count" vs. "identity") if the default option fails.23

Better bar charts

Rather than relying on intransparent data transformations, a better way to create informative bar charts is to explicitly compute all values that we aim to visualize. This may require more effort, but also provides more control and is more transparent and reproducible.

As an example, we re-create the basic counts of Figure 9.6 and the mean chart of Figure 9.9 in a different way. Interestingly, doing so will not require geom_bar(), but rather geom_col().

  1. We first compute the counts of observations and means of the body_mass_g variable for the pg data:
library(tidyverse)

# Compute summaries for body_mass_g of penguins per species:
tb <- pg %>% 
  group_by(species) %>%
  summarise(n = n(),
            mean_body_mass = mean(body_mass_g, na.rm = TRUE)
            )

# Print tb: 
knitr::kable(tb, caption = "A table of computed summary values.", digits = 2)
Table 9.2: A table of computed summary values.
species n mean_body_mass
Adelie 152 3700.66
Chinstrap 68 3733.09
Gentoo 124 5076.02

Note that we used a dplyr pipe to group, count and summarize cases (see Chapter 13 on Transforming data), but could have used base R functions to solve the same task.

We now can easily re-create Figure 9.6. But as our summary table tb already contains the desired counts (in a variable named n), we no longer want geom_bar() to count anything. Thus, we could map the y variable of geom_bar() to n and switch off the geom’s default counting behavior (by specifying stat = "identity"):

ggplot(tb) +
  geom_bar(aes(x = species, y = n), stat = "identity")

An easier way to achieve the same is to replace geom_bar() by geom_col(), as the latter uses stat = "identity" by default:

ggplot(tb) +
  geom_col(aes(x = species, y = n))

Why is using geom_col() a better way of creating bar charts? The main reason is that the computation of the values in tb was entirely under our control and fully transparent. A related benefit is that we can easily re-create Figure 9.9 by only changing the variable mapping of y:

ggplot(tb) +
  geom_col(aes(x = species, y = mean_body_mass))

This code is arguably much simpler than the one that created Figure 9.9. Note also that the amount of data supplied to the corresponding ggplot() functions is vastly different: Whereas pg is a raw data table of 344 rows and 8 columns, tb is only a small summary table of 3 rows and 3 columns. However, smaller is not always better. If we ever wanted to visualize other aspects of the data, the data argument provided as input to the ggplot() function must provide corresponding variables. To illustrate this point, we will further refine the aesthetic settings of our original bar chart (showing the counts of penguins by species in Figure 9.6) by the sex variable. As we have dropped this variable from tb, we need to use the original pg data for this purpose (or alternatively include sex in tb).

Adjusting aesthetics

As we have seen above, we can easily adjust colors, text labels, and themes to create prettier and more informative bar charts:

ggplot(pg) +
  geom_bar(aes(x = species, fill = species), stat = "count") + 
  labs(title = "Number of penguins by species", 
       x = "Species", y = "Frequency", fill = "Species:") + 
  scale_fill_manual(values = my_cols) + 
  theme_unikn()

Rather than mapping x and fill to the same variable, we could set fill to a different variable of our pg data. For instance, let’s see what happens when we set fill to sex:

# Add sub-category (by mapping fill to sex):
ggplot(pg) +
  geom_bar(aes(x = species, fill = sex)) + 
  scale_fill_manual(values = my_cols) +
  theme_unikn()

By mapping another variable to the fill color of geom_bar() we revealed additional information about our data. Note also that the sub-categories of each bar are stacked. The reason is that the position argument of geom_bar() is set to "stack" by default. As the color mapping of the sex variable may be somewhat un-intuitive, we also create an alternative color vector:

# Choose 3 colors (to be mapped to sex):
my_3_cols <- my_cols[c(2, 1, 8)]

# Show sub-category (as stacked bars):
ggplot(pg) +
  geom_bar(aes(x = species, fill = sex), position = "stack") + 
  scale_fill_manual(values = my_3_cols) +
  theme_unikn()

An alternative position setting is "dodge":

# Show sub-category (as dodged bars):
ggplot(pg) +
  geom_bar(aes(x = species, fill = sex), position = "dodge") +
  scale_fill_manual(values = my_3_cols) + 
  theme_unikn()

However, if not all categories contain the same sub-categories (here sex values), the width of the bars may need further adjustments.

Box plots

When aiming to visualize summary information of a continuous variable by the levels of some categorical variable, a good alternative is provided by a box plot.

ggplot(pg) + 
  geom_boxplot(aes(x = species, y = body_mass_g), fill = "gold")

Mapping fill color to an additional variable:

ggplot(pg) + 
  geom_boxplot(aes(x = species, y = body_mass_g, fill = island))

Overall, investing into manual data transformation and computations adds control and transparency to our visualizations and simplifies the code. As an example, we have shown that bar charts showing means of some variable can be created by using geom_col() rather than by using geom_bar(). However, when transforming data to be plotted we must make sure that the data supplied as input to gglot() contains all the variables and values that we want to visualize.

Better bar plots are often column plots: Pre-compute the values to display. If we had pre-computed the counts, we could map them to y and specify stat = "identity".

A good alternative to many bar charts — if they provide mean information — is provided by box plots.

Practice

Here are some practice tasks on plotting summaries in bar charts or box plots:

  1. Understanding geoms: Using the summary table tb from above, explain the result of the following command:
ggplot(tb) +
  geom_bar(aes(x = species))
  1. Flipping coordinates: Evaluate the following expression and explain its result.
ggplot(pg) +
  geom_bar(aes(x = species)) + 
  coord_flip()
- How can an identical plot be created without using `coord_flip()`?
  1. Simple bar charts: Create a bar plot for the pg data showing the counts of penguins observed on each island.

  2. Misleading settings: Explain the output of the following command and find a better solution.

    • Why is it misleading?
    • How could it be fixed?
# Adding a factor variable:
ggplot(pg, aes(x = species, y = body_mass_g, fill = sex)) +
    stat_summary(fun = mean, na.rm = TRUE, geom = "bar", position = "stack")
  1. Create a boxplot that shows the mean flipper length of penguins on each of the three islands.

    • Add aes(fill = island)) to geom_boxplot() and explain the result.
    • Change the fill aesthetic of geom_boxplot() to aes(fill = species)) and explain the result.
ggplot(pg) + 
  geom_boxplot(aes(x = island, y = body_mass_g, fill = island))

# same as:
ggplot(pg) + 
  geom_boxplot(aes(x = island, y = body_mass_g, fill = island))

# Fill color by species:
ggplot(pg) + 
  geom_boxplot(aes(x = island, y = body_mass_g, fill = species))

9.2.5 Plotting relations

Another common type of plot visualizes the relationship between two or more variables. Important types of plots that do this include scatterplots and visualizations of lines or trends. This section will introduce corresponding ggplot2 geoms.

Scatterplots

Scatterplots visualize the relation between two variables for a number of observations by corresponding points that are located in 2-dimensional space. Assuming two orthogonal axes (typically \(x\)- and \(y\)-axes), a primary variable is mapped to the \(x\)-axis, and a secondary variable is mapped to the \(y\)-axis of the plot. The points representing the individual observations then show the value of \(y\) as a function of \(x\).24

As an example of a simple scatterplot, we aim to solve the following task:

  • Visualize the relationship between body mass and flipper length for (the 3 species of) penguins.

Solving this task in ggplot2 is simple and straightforward. We provide our pg data to ggplot() and select the geometric object geom_point() with the aesthetic mappings x = body_mass_g and y = flipper_length_mm (Figure 9.10):

ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm))
A basic scatterplot using geom_point(), but suffering from overplotting.

Figure 9.10: A basic scatterplot using geom_point(), but suffering from overplotting.

Overall, this basic scatterplot suggests a positive and possibly linear correlation between penguin’s body mass (mapped to the values on the \(x\)-axis) and their flipper length (mapped to the values of the \(y\)-axis). However, the example also illustrate a typical problem of scatterplots: When many points are clustered near each other or even at the same locations, they overlap or obscure each other — a phenomenon known as overplotting. There are many ways of preventing overplotting in ggplot2. In the context of scatterplots, a popular strategy against overplotting consists in using colors, color transparency, or grouping points into clusters by changing their aesthetic features.

The aesthetic features of points include colors, sizes, and symbol shapes. As we have seen for other geoms, we can map either constant values or variables to aesthetic features of geom_point() (Figure 9.11):

sp_01 <- ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm,  # essential mappings 
                 col = species, shape = species           # aesthetic variables
                 ),                                       # vs. 
             alpha = .50, size = 2                        # aesthetic constants
             )
sp_01
A scatterplot using geom_point() and an aesthetic grouping variable (species).

Figure 9.11: A scatterplot using geom_point() and an aesthetic grouping variable (species).

Note that Figure 9.11 mapped two aesthetic features (col and shape) to a variable (species), whereas two others (alpha and size) were mapped to constant values. The effect of this difference is that the species variable is used to group the geom’s visual elements (i.e., varying point color and shape by the different types of species), whereas their color transparency and size is set to constant values.

Finally, we can further improve our previous plot by choosing custom colors, text labels, and choosing another theme. Since Figure 9.11 was saved as an R object (sp_01), we can adjust the previous plot by adding labels, color scales, and theme functions (Figure 9.12):

sp_01 + 
  labs(title = "Penguin's flipper length by body mass (by species)", 
       x = "Body mass (in g)", y = "Flipper length (in mm)", 
       col = "Species:", shape = "Species:") + 
  scale_color_manual(values = my_cols) + 
  theme_bw()
Adjusting our scatterplot’s text labels, color scale, and theme.

Figure 9.12: Adjusting our scatterplot’s text labels, color scale, and theme.

As before, tweaking aesthetics and adding text labels to the initial plot improved our visualization by making it both prettier and easier to interpret. (We will later see that faceting — i.e., splitting a plot into several sub-plots — is another way of preventing overplotting in ggplot.)

Practice

Here are some practice tasks on plotting relationships in scatterplots, lines or trends:

  1. Bill relations: What is the relation between penguin’s bill length and bill depth?

    • Create a scatterplot to visualize the relationship between both variables for the pg data.
    • Does this relationship vary for different species of penguins?
  2. Scattered penguins: The following code builds on our previous scatterplot (Figure 9.11, saved above as sp_01), but maps the aesthetic feature shape to the data variable island, rather than to species.

    • Evaluate the code and the explain the resulting scatterplot.
    • Criticize the plot’s trade-offs: What is good or bad about it?
    • Try improving the plot so that the different types of species and island become more transparent.
ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm, # essential mappings 
                 col = species, shape = island),         # aesthetic variables vs. 
             alpha = .50, size = 2)                      # aesthetic constants
# Possible solutions:

# Good: Mapping 2 variables means that there are many things to see
# Bad:  Complexity makes some things hard to see.

# Possible solutions:

# A: Tweaking aesthetics to improve visibility: ---- 
ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm,  # essential mappings 
                 col = species, shape = island            # aesthetic variables
                 ),                                       # vs. 
             alpha = .40, size = 5                        # aesthetic constants
             ) +
  scale_color_manual(values = my_cols) + 
  theme_minimal()

# B: Using 3 facets: ---- 
ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm,  # essential mappings 
                 col = species, shape = island            # aesthetic variables
                 ),                                       # vs. 
             alpha = .50, size = 2                        # aesthetic constants
             ) +
  facet_wrap(~island)

# C: Using 3 x 3 faceting: ----
ggplot(pg) +
  geom_point(aes(x = body_mass_g, y = flipper_length_mm,  # essential mappings 
                 col = species, shape = island            # aesthetic variables
                 ),                                       # vs. 
             alpha = .50, size = 2                        # aesthetic constants
             ) +
  facet_grid(species~island)
  1. Plotting mathematical functions: Figure 9.18 visualizes three mathematical functions.

    • Try re-creating each line using geom_function() (without restraining the range of \(x\)-values).
    • Try re-creating Figure 9.18 using stat_function() (with the same ranges of \(x\)-values).
Plotting three mathematical functions (i.e., two linear and one quadratic function).

Figure 9.18: Plotting three mathematical functions (i.e., two linear and one quadratic function).

  1. Penguin lines: Create a line plot that uses the pg data to show the development of penguin’s mean body mass by island over the observed period of three years. Note that the steps required for this task are analog to those leading to Figure 9.13 (above):

    • Create a small summary table that contains all desired variables.
    • Use this table to create a basic line plot.
    • Tweak the line plot (by adjusting its scales, labels, and theme) to provide a clear view of the “development” over time.
    • Turn your line plot into a bar plot.
  1. Bill trends: Add trend lines to your scatterplot showing the relation between penguin’s bill length and bill depth (from 1 above).

    • Add trend lines both to the overall scatterplot and to the version distinguishing three species.
    • Explore the effects of different method arguments.

Having learned to use ggplot2 to visualize distributions (e.g., by using geom_histogram() or geom_density()), summaries (geom_bar(), geom_col(), or geom_boxplot()) or relations as sets of points (geom_point()) or lines (geom_function(), geom_line(), geom_smooth()), we are ready to discover its more advanced features.

9.3 Advanced features of ggplot2

Using more advanced features of ggplot2 requires a more general template and some additional terminology:

  • layers are multiple levels of the same plot (behind/before each other)
  • facets are multiple variants of the same plot (beside/next to each other)

The additional topics mentioned in this section are:

  • Combining layers of geoms and abstracting aesthetic mappings
  • Splitting up plots into facets
  • Using alternative coordinate systems
  • Combining and saving plots

We will conclude this section by mentioning further ggplot2 extensions.

9.3.1 Generic template

A generic template for creating a visualization with additional bells and whistles has the following structure:

# Generic ggplot template: 
ggplot(data = <DATA>) +                 # 1. specify data set to use
  <GEOM_fun>(mapping = aes(<MAPPING>),  # 2. specify geom + mappings 
             <arg_1 = val_1, ...) +     # - optional arguments to geom
  ...                                   # - additional geoms + mappings
  <FACET_fun> +                         # - optional facet function
  <LOOK_GOOD_fun>                       # - optional themes, colors, labels...

The generic template includes the following elements (beyond the <DATA> and <GEOM_fun> of the minimal template):

  • Multiple <GEOM_fun> yield layers of geometric elements.

  • An optional <FACET_fun> uses one or more variable(s) to split a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjust the visual features of plots (e.g., by adding titles and text labels, color scales, plot themes, or setting coordinate systems).

9.3.2 Layers of geoms

Geoms can be combined with each other.

When using multiple geoms (in layers):

  • Specify common mappings globally, rather than locally.
  • Consider the order of geoms: Later geoms appear on top of earlier geoms.

Examples:

Three examples of combining layers of geoms:

  • Combine raw data with distributions and summaries
  • Combine bar or line plots with error bars (and annotations)
  • Combine scatterplots with trends and distribution info (rugs)

9.3.3 Faceting

Plots showing a lot of data or multiple geoms can become rather crowded. When geoms no longer suffice, we can easily split plots into sub-plots (so-called panels).

Explicit grouping by splitting plot into subplots/panels: Faceting.

Reconsidering the histogram of Figure 9.5 (saved as pg_4 above):

# Using a color-coded histogram (defined above):
pg_4

Modifying the overall histogram (of Figure 9.3) by mapping the fill color to the species variable (in Figure 9.5) revealed additional information, but also made the frequency counts harder to interpret (as some bars included counts from only one, others from two or three different species). We can disentangle both aspects (frequency of values and species) by splitting the visualization into panels by using the facet_wrap() function:

# Explicit grouping/splitting by 1 faceting variable:
pg_4 + 
  facet_wrap(~species)

The three panels allow comparing frequency counts within and between species (on the \(y\)-axis shared by all panels), as well as their relative positions (on an identical \(x\)-axis for all panels). Note that we provided the species variable as an argument to facet_wrap() in the formula notation ~species (i.e., preceded by the squiggly tilde symbol ~) and that the automatic panel headings render the legend for the fill color redundant (so that we could remove it by adding theme(legend.position = "none")).

We can extend the strategy of splitting a visualization by a variable into panels to additional variables. The facet_grid() function creates a matrix of panels

# Explicit grouping/splitting by a faceting grid:
pg_4 + 
  facet_grid(island~species) 
A histogram split into 3 x 3 facets.

Figure 9.19: A histogram split into 3 x 3 facets.

Note that facets split a visualization into sub-plots that use the same axes. This both de-clutters plots and allows comparisons across rows or columns.

A rather complex task:

  • Visualize the relationship between bill length and bill depth for the 3 species and islands (and sex).

Consider using automatic grouping by aesthetics or by explicitly setting group to a factor variable.

Solution: Group by island and species (and sex):

ggplot(pg, aes(x = bill_length_mm, y = bill_depth_mm, color = sex)) + 
  facet_grid(island~species) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  # geom_rug() + 
  scale_color_manual(values = usecol(c(Pinky, Seeblau, Seegruen), alpha = .50)) + 
  labs(x = "Bill length (mm)", y = "Bill depth (mm)", color = "Sex:") + 
  theme_unikn()

9.3.4 Even more features

In case you have not been convinced yet, here are some additional features that make ggplot2 the greatest thing since sliced bread:

  • Alternative coordinate systems

  • Combining plots: Extend notion of a grid from faceting to several independent plots

Combining plots

Above, we have occasionally saved the output of ggplot() expressions as R objects. This allows for getting the basic plot right (i.e., selecting geoms and mapping aesthetics) before adding more bells and whistles (e.g., colors, labels, and a theme).

Another good reason for saving plots as R objects is to combine multiple plots later. Combining plots differs from faceting, as the combined plot do not need to share the same axes and coordinate system. Instead, we can combine and arrange arbitrary plots into the sub-panels of a compound figure.

In this section, we provide examples using the patchwork R package (Pedersen, 2024), but the ggpubr, cowplot and gridExtra packages provide similar functionality. For instance, if we wanted to re-capitulate the journey from our first histogram (Figure 9.2 above) to our final version of it (Figure 9.5) we could re-create and save the former as an R object pg_0 and combine it with our final histogram, which was saved as R object pg_4 (Figure 9.20):

# Re-create basic histogram (from above), but store it as pg_0:
pg_0 <- ggplot(data = pg) + 
  geom_histogram(mapping = aes(x = flipper_length_mm))

library(patchwork)  # for combining plots

# Combine 2 plots:
# pg_0 + pg_4  # beside each other
pg_0 / pg_4    # above each other
Combining two ggplot2 plots (using the patchwork package).

Figure 9.20: Combining two ggplot2 plots (using the patchwork package).

Using the gridExtra package (Auguie, 2017), we could have achieved the similar results by the grid.arrange() function:

library(gridExtra)  # for combining plots

# Combine 2 plots:
# gridExtra::grid.arrange(pg_0, pg_4, nrow = 1)
gridExtra::grid.arrange(pg_0, pg_4, nrow = 2)

However, a neat aspect of patchwork is that the height or width of sub-plots are automatically scaled to the same size.

When combining multiple plots, we usually want to arrange, annotate, or tag them, so that we can easily see and refer to their components. The patchwork package provides rich options for laying out and annotating plots. The following example also shows that it usually pays off to use a uniform color scheme and theme when combining plots (Figure 9.21):

# Create 3 plots (with common colors and theme): 
bx_1 <- ggplot(pg) +  
  geom_boxplot(aes(x = species, y = body_mass_g, fill = species)) + 
  labs(x = "Species", y = "Body mass") +
  scale_fill_manual(values = my_cols) + 
  theme_bw() + 
  theme(legend.position = "none")

bx_2 <- ggplot(pg) + 
  geom_boxplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  labs(x = "Species", y = "Flipper length") +
  scale_fill_manual(values = my_cols) + 
  theme_bw() + 
  theme(legend.position = "none")

st_1 <- ggplot(pg, aes(x = body_mass_g, y = flipper_length_mm, col = species)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Body mass", y = "Flipper length") +
  scale_color_manual(values = my_cols) + 
  theme_bw() + 
  theme(legend.position = "none")


# Combine plots:
patch_plot <- (bx_1 | bx_2) / st_1  # 2 plots above 1 wide plot

# Annotate: Title(s) and caption
patch_plot <- patch_plot + 
  plot_annotation(title = "Body mass and flipper length in penguins",
                  caption = "Note: Nice, but not too surprising.")

# Tag (basic):
# patch_plot + 
#   plot_annotation(tag_levels = 'A')  # Options: '1', 'a' 'A', 'i' 'I'

# Tag (nested layout): 
patch_plot[[1]] <- patch_plot[[1]] + plot_layout(tag_level = 'new')
patch_plot + plot_annotation(tag_levels = c('A', '1'))
Combining annotated and tagged plots (using the patchwork package).

Figure 9.21: Combining annotated and tagged plots (using the patchwork package).

Note that Figure 9.21 omitted all color legends to save space. As a consequence, the scatterplot and linear trend lines of Panel B would not be interpretable when shown in isolation, but the mapping of colors to the three penguin species is explained by the boxplots shown as Panels A1 and A2.

See Chapter 9: Arranging plots of the ggplot 2 book (3e) for more patchwork examples.

Saving plots

Extensions

A powerful aspect of ggplot2 is that it can be extended by other packages, which can provide all kinds of elements, including geoms, themes, or fonts. Here are some examples:

  • ggridges provides a geom_density_ridges() that allows visualizations of variable distributions on different levels:
# install.packages('ggridges')
library(ggridges)

ggplot(pg, aes(x = flipper_length_mm, y = species, fill = species)) + 
  # facet_wrap(~species) + 
  geom_density_ridges() + 
  scale_fill_manual(values = my_cols) + 
  labs(title = "Distributions of penguin flipper length by species", 
       x = "Flipper length (mm)", y = "Species", fill = "Species:") + 
  theme_unikn()

Themes:

  • ggthemes for many additional and fancy themes:
# install.packages('ggthemes')
library(ggthemes)

pg_4 +
  theme_fivethirtyeight()

Fonts:

  • The extrafont package provides additional fonts for plotting (e.g., to mimics the visual style of the popular XKCD comic)

Practice

  1. Layering geoms: Layers of geoms

  2. Faceting: Use facets to split Figure 9.17 into three subplots showing the trends and points for each species (and remove the obsolete legend).

Since Figure 9.17 was saved as an R object tp_03 (above), we can easily add faceting by species (by adding facet_wrap()) and remove the plot legend (by a corresponding theme() function):

tp_03 + 
  facet_wrap(~species) + 
  theme(legend.position = "none")
A faceted version of Figure 9.17.

Figure 9.22: A faceted version of Figure 9.17.

  1. Coordinate systems
  1. Combining plots: Use the histograms showing the distributions of bill depth and bill length for penguins by island (from practising distributions, Task 2 above) and combine them into a single plot.

    • Print both plots side-by-side (i.e., in two columns of a single row).
    • If both plots show the same legend, remove one of them to only show one legend (on the right).

Saving plots

This concludes our glimpse into the more advanced features of ggplot2.

9.4 Conclusion

As ggplot2 currently contains over 50 different geoms, the ones discussed in this chapter provide only an introductory glimpse of the available options.

The true power of ggplot2 results from its modular and extensible structure: It provides a set of tools that can be flexibly combined to create many different visualizations.

9.4.1 Summary

The R package ggplot2 provides a comprehensive toolbox for producing data visualizations. Unlike the collection of functions in base R graphics, ggplot2 uses a conceptual framework based on the grammar of graphics (Wilkinson, 2005). This allows us to construct a graph from composable elements, instead of being limited to a predefined set of charts.

Learning ggplot2 first involves getting a grasp on its terminology (e.g., aesthetic mappings, geoms, themes, layers, and facets) and its way of combining functions to create visualizations. A smart strategy when creating visualizations with ggplot2 for some data is to first select appropriate geoms and adjust variable mappings, before tuning aesthetics, labels, and themes.

9.4.2 Resources

i2ds: Links to resources, etc.

Books and book chapters

The two main references on ggplot2 and its history are Wilkinson (2005) and Wickham (2016).

Introductory chapters on ggplot2 include:

As the original ggplot package was a pre-cursor of the so-called tidyverse dialect or movement (Wickham et al., 2019), corresponding textbooks provide good introductions to ggplot2:

Online resources

One of the best starting points for learning ggplot2 is https://ggplot2.tidyverse.org/ and its vignettes:

Note also the helpful FAQ sections in the articles of https://ggplot2.tidyverse.org

Helpful insights into the relation between geoms and stats are provided by the following article:

Further inspirations and tools for using ggplot2 include:

Cheatsheets

Here are some pointers to related Posit cheatsheets:

  • Data visualization with ggplot2
Data visualization with ggplot2 from Posit cheatsheets.

Figure 9.23: Data visualization with ggplot2 from Posit cheatsheets.

The corresponding online reference provides an overview of key ggplot2 functionality.

9.4.3 Preview

We now learned to create visualizations in base R (in Chapter 8) and the ggplot2 package. Irrespective of the tools we use, colors are an important aesthetic for making more informative and pleasing visualizations. Chapter 10 on Using colors introduces the topic of color representation and show us how to find and manipulate color palettes.

9.5 Exercises

i2ds: Exercises

Basic exercises

9.5.1 Histograms

In Chapter 8, we created our first histogram for a vector of numeric values x as follows:

x <- rnorm(n = 500, mean = 100, sd = 10)
hist(x)
  • Re-create an analog histogram in ggplot2.
  • What are the similarities and differences to the base R version?
  • Add some aesthetics and labels to improve your histogram.
  • Discuss relation between (and the use of stats in) histograms and bar charts.

Solution

In this example, the data x consisted of a single vector. However, as ggplot() requires its data to be in tabular form, we use data.frame() to convert it into a data frame with one variable x:

# Convert vector x into df:
df <- data.frame(x)
head(df)
#>           x
#> 1 105.55735
#> 2 106.71259
#> 3  90.51432
#> 4 111.84809
#> 5  94.10383
#> 6 114.64747

Now we can fill in the minimal template and use the geom_histogram() function for creating a histogram.

ggplot(data = df) + 
  geom_histogram(aes(x = x))

It is interesting to study the commonalities and differences of the two basic histograms created. Using ggplot() seems a little harder than hist(x), but embeds creating a histogram in a visualization framework that is much more flexible and powerful than the hist() function.

Studying the documentation of geom_histogram() (and geom_bar(), on which geom_histogram() is based) reveals hidden complexity. A difficulty is that both geoms note x and y as required aesthetics, but we succeeded by only providing x. The reason for this lies in the fact that they y-values of our histogram were automatically computed by a default argument stat = "bin". For continuous variables, this counts the number of values that are within a specific interval (a so-called bin). Internally, the histogram above is actually computed as a bar chart:

ggplot(data = df) + 
  geom_bar(aes(x = x), stat = "bin", position = "stack")

Whereas some visualizations merely show existing data values, many visualizations first need to compute something (e.g., count the frequency of values within a specific interval). Specifying how to compute what is the purpose of the stat function. The relations between geoms and corresponding stat functions is probably the most tricky part of ggplot2. As long as we are using geoms with their default statistics, we do not need to worry about stat. But when visualizations go wrong with ggplot2, it is often due to a mismatch between a geom and a stat function.

Studying ?geom_histogram() also reveals additional parameters that we can use to adjust our histogram. Whereas hist() used a breaks parameter to adjust the number of categories, geom_histogram() provides two corresponding parameters (called bins and binwidth). The following examples also specify some colors (by setting col and fill to named R colors):

ggplot(data = df) + 
  geom_histogram(aes(x = x), 
                 bins = 10, col = "black", fill = "deepskyblue")

ggplot(data = df) + 
  geom_histogram(aes(x = x), 
                 binwidth = 10, col = "white", fill = "hotpink")

Note that we specified all additional parameters (numeric values for bins or binwidth, and the color values of col and fill) outside of the aes() function. As we will see, it is sometimes possible to use parameters inside of aes(), but then they are used as variables, rather than as constants (i.e., fixed values).

9.5.2 Bar charts

  1. Simple bar charts: Create a bar chart showing the average flipper length for each species of penguins.

  2. Complex bar charts: Transform the pg data to compute means, SE values, and corresponding confidence intervals for the flipper_length_mm variable, then use geom_bar() and geom_errorbar() to plot these means with confidence intervals.

  • The following function allows computing the standard error (SE) of a variable:
# SE formula:
std_err <- function(x, na.rm = FALSE) {

  # Handle NA values:
  if (na.rm){
    
  nr_na <- sum(is.na(x))
  
  if (nr_na > 0){
  
    x <- stats::na.omit(x)
    
    message(paste0("Removed ", nr_na, " NA values."))
  }
  
  }

  # Compute SE:   
  sqrt(stats::var(x, na.rm = na.rm)/length(x))
  
} # std_err().

# Check:
std_err(pg$body_mass_g)
#> [1] NA
std_err(pg$body_mass_g, na.rm = TRUE)
#> [1] 43.36473
std_err(pg$flipper_length_mm, na.rm = TRUE)
#> [1] 0.7603704

# Compute summaries for flipper_length_mm of penguins per species:
tb_2 <- pg %>% 
  group_by(species) %>%
  summarise(n = n(),
            mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
            se_flipper_length = std_err(flipper_length_mm, na.rm = TRUE), 
            mn_conf_min = mean_flipper_length - 1.96 * se_flipper_length,
            mn_conf_max = mean_flipper_length + 1.96 * se_flipper_length
            ) 
tb_2
#> # A tibble: 3 × 6
#>   species       n mean_flipper_length se_flipper_length mn_conf_min mn_conf_max
#>   <fct>     <int>               <dbl>             <dbl>       <dbl>       <dbl>
#> 1 Adelie      152                190.             0.532        189.        191.
#> 2 Chinstrap    68                196.             0.865        194.        198.
#> 3 Gentoo      124                217.             0.585        216.        218.

# Bar plots with error bars:
ggplot(tb_2, aes(x = species, y = mean_flipper_length, fill = species)) +
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = mn_conf_min, ymax = mn_conf_max), width = .50) + 
  scale_fill_manual(values = my_cols) +
  theme_unikn()

ggplot(tb_2, aes(x = species, y = mean_flipper_length, fill = species)) +
  geom_col() +
  geom_errorbar(aes(ymin = mn_conf_min, ymax = mn_conf_max), width = .50) +
  scale_fill_manual(values = my_cols) +
  theme_unikn()

9.5.3 Scatterplots

In Chapter 8, we created a scatterplot for a vector of numeric values x and y as follows:

# Data:
x <- 11:43
y <- c(sample(5:15), sample(10:20), sample(15:25))

# Scatterplot (of points):
plot(x = x, y = y,
     main = "A positive correlation")
  • Re-create an analog scatterplot in ggplot2.
  • Create a scatterplot of fuel consumption on the highway (hwy) by engine displacement (displ) for the mpg data (in ggplot2::mpg) in both base R and ggplot2.

Solution

Using ggplot2:

df <- data.frame(x, y)

ggplot(df) + 
  geom_point(aes(x = x, y = y))

A scatterplot from the mpg data: Note that we usually want to add labels and titles, as well as modify other aesthetic features of visualizations.

  • In base R (and a transparent color my_col definition using the unikn package):
# Define some color (from unikn, with transparency): 
my_col <- unikn::usecol(unikn::Bordeaux, alpha = 1/4)

# With aesthetics (see ?par):
plot(x = mpg$displ, y = mpg$hwy, type = "p", 
     col = my_col, pch = 16, cex = 1.5,
     main = "A basic scatterplot", 
     xlab = "Displacement", ylab = "MPG on highway"
     )
  • Using ggplot2:
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), size = 2, col = my_col) +
  labs(x = "Displacement", y = "MPG on highway", 
       title = "A basic scatterplot") + 
  theme_minimal()

9.5.4 Line plots

In this exercise, you will create line plots for data tracking the development of five trees over time. Create some line plots using the Orange data (from base R’s datasets package):

  1. Inspect the Orange data and extract (or filter) the lines for one tree to plot a line of its circumference by age.

  2. Create an analog plot to shows the growth of all five trees (as five different lines).

  3. Adjust the line plot of 3. so that it is legible (i.e., its lines are distinguishable) in black-and-white print.

  1. Adjust your line plot (of 2. or 3.) so that an additional box plot shows the average growth of the five trees. (Hint: The group aesthetic for geom_boxplot() will be different from the group aesthetic of geom_line().)

  2. Use the same Orange data to illustrate the related geoms geom_path(), geom_step(), and geom_smooth(). What are their similarities or differences to geom_line()?

Solution

  • ad 1: Inspecting the Orange data:
# Data:
as_tibble(Orange)
#> # A tibble: 35 × 3
#>    Tree    age circumference
#>    <ord> <dbl>         <dbl>
#>  1 1       118            30
#>  2 1       484            58
#>  3 1       664            87
#>  4 1      1004           115
#>  5 1      1231           120
#>  6 1      1372           142
#>  7 1      1582           145
#>  8 2       118            33
#>  9 2       484            69
#> 10 2       664           111
#> # ℹ 25 more rows

# Note: Tree is a factor variable with a strange order of levels:
Orange$Tree
#>  [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5
#> Levels: 3 < 1 < 5 < 2 < 4

# Relevel Tree factor:
Orange$Tree <- factor(Orange$Tree, levels = 1:5)
  • Plot for 1 tree:
# Filter 1 tree:
tree_1 <- Orange[Orange$Tree == 1, ]
tree_1 <- dplyr::filter(Orange, Tree == 1)

# Plot 1 line:
ggplot(tree_1) +
    geom_line(aes(x = age, y = circumference), color = Seeblau, linewidth = 1)
  • ad 2: Plotting lines for all trees is simple and illustrates the power of grouping variables by aesthetics: Map color to the Tree variable and move color = Tree into the aes() function (see Figure 9.24):
# Plot lines (and adjust color scale):
ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, group = Tree, color = Tree), linewidth = 1) +
    geom_point(aes(x = age, y = circumference, group = Tree, color = Tree), size = 2) + 
    scale_color_manual(values = my_cols) + 
    labs(x = "Age (in days)", y = "Circumference (in mm)", color = "Tree:") + 
  theme_bw()
A line plot illustrating the growth of Orange trees.

Figure 9.24: A line plot illustrating the growth of Orange trees.

  • ad 3: Figure 9.24 showed the growth of Orange trees as a line plot with different colors for the five trees. We now re-create the line plot of Orange trees so that it is legible in black-and-white print by adjusting aesthetics.
# Possible solutions:

# A1: Adjust aesthetics:
ggplot(Orange) + 
  geom_line(aes(x = age, y = circumference, linetype = Tree, alpha = Tree), 
            color = "black", linewidth = 1.5) +
  scale_x_continuous(limits = c(0, 1600)) + 
  theme_bw()

# A2: Using shape of geom_point() to differentiate lines:
ggplot(Orange) + 
  geom_line(aes(x = age, y = circumference, group = Tree), color = "grey80", linewidth = 1) +
  geom_point(aes(x = age, y = circumference, group = Tree, shape = Tree), size = 3) + 
  scale_x_continuous(limits = c(0, 1600)) + 
  theme_bw()
  • ad 4: Modifying the plot from Task 2 by adding a boxplot (Figure 9.25):
# Plot lines (and adjust color scale):
ggplot(Orange) +
    geom_boxplot(aes(x = age, y = circumference, group = age), fill = "grey90") + 
    geom_line(aes(x = age, y = circumference, group = Tree, color = Tree), linewidth = 1) +
    geom_point(aes(x = age, y = circumference, group = Tree, color = Tree), size = 2) + 
    scale_color_manual(values = my_cols) + 
    labs(x = "Age (in days)", y = "Circumference (in mm)", color = "Tree:") + 
  theme_bw()
A box and line plot illustrating the growth of Orange trees.

Figure 9.25: A box and line plot illustrating the growth of Orange trees.

Note that Figure 9.25 plotted the box plot first (i.e., behind the lines and points) and used the aesthetic mapping group = age for geom_boxplot().

  • ad 5: Exploring related geoms:

# Possible solutions:

# Illustrate related geoms:
ggplot(Orange) + 
  # geom_smooth(aes(x = age, y = circumference, group = Tree, color = Tree), method = "lm", se = FALSE) + 
  geom_step(aes(x = age, y = circumference, group = Tree, color = Tree)) + 
  geom_line(aes(x = age, y = circumference, group = Tree), color = "grey", linewidth = 2) +
  geom_path(aes(x = age, y = circumference, group = Tree, color = Tree), linewidth = 1) +
  geom_point(aes(x = age, y = circumference, group = Tree, shape = Tree, color = Tree), size = 3) + 
  scale_x_continuous(limits = c(0, 1600)) + 
  scale_color_manual(values = my_cols) + 
  theme_bw()

# Note that geom_line() and geom_path() are identical in this case (but differ when data is in different order). 

# geom_smooth():
# Using geom_smooth() to compute and show trend lines:
ggplot(Orange) + 
  geom_smooth(aes(x = age, y = circumference, group = Tree, color = Tree, fill = Tree), alpha = .10) + 
  geom_point(aes(x = age, y = circumference, group = Tree, color = Tree), size = 3) + 
  scale_x_continuous(limits = c(0, 1600)) + 
  scale_color_manual(values = my_cols) + 
  scale_fill_manual(values = my_cols) + 
  theme_bw()

# Using geom_smooth() for linear approximations:
ggplot(Orange) + 
  geom_smooth(aes(x = age, y = circumference, group = Tree, color = Tree), method = "lm", se = FALSE) + 
  geom_point(aes(x = age, y = circumference, group = Tree, shape = Tree, color = Tree), size = 3) + 
  scale_x_continuous(limits = c(0, 1600)) + 
  scale_color_manual(values = my_cols) + 
  theme_bw()

Advanced exercises

9.5.5 The rule of 72

In finance, the rule of 72 is a heuristic strategy for estimating the doubling time of an investment. Dividing the number (72) by the interest percentage per period (usually years) yields the approximate number of periods required for doubling the initial investment. (See Wikipedia for details: en | de.)

  • Create a line graph that compares the true doubling time with the heuristic estimates for a range of (positive) interest rates.

Hints:

  • Consider wrapping your code into a function.
  • Consider creating an interactive application of your function (think Shiny).

9.5.6 Advanced ggplot expressions

The following ggplot() expressions are copied from the documentation of the corresponding geoms. Run the code, inspect the result, and then try to explain how they work:

  1. A facet of histograms:
ggplot(economics_long, aes(value)) +
  facet_wrap(~variable, scales = 'free_x') +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))

9.5.7 More exercises

For more exercises on using ggplot2,


  1. For more detailed explanations of the connection between geoms and stats, see the ggplot2 documentation or the online article Demystifying stat layers in ggplot2 (by June Choe, 2020-09-26).↩︎

  2. Note the relation between the mathematical and the computational notion of a function: The scatterplot visualization shows a relation by mapping values on some dimension \(x\) to values on some dimension \(y\). If we view the \(x\)-values as inputs and the \(y\)-values as outputs, the underlying function is the relation that transforms the former into the latter.↩︎