20 Communication: plot formatting

20.1 Setup

This chunk of R code loads the packages that we will be using.

20.2 Introduction

In the chapters Introduction to data visualization and More data visualization, we started to explore some ways to format our plots.

Cole Nussbaumer Knaflic’s book Storytelling with Data: A Data Visualization Guide for Business Professionals is a good introduction to the principles of data visualization, which is a key part of data analytics. In the book, the point is made that data visualization is always in the service of making a point about what the data tell us. In the context of business, this then translates into influencing decisions.

The book lists six principles (see the Storytelling with Data: quick reference guide):

  1. Understand the context

  2. Choose the right type of display

  3. Eliminate clutter

  4. Draw attention to where you want it

  5. Tell a visual story

  6. Practice makes perfect

The Storytelling with Data: quick reference guide also includes the pre-attentive attributes of a visualization, and the Gestalt principles of visual perception:

  • proximity

  • similarity

  • enclosure

  • closure

  • continuity

  • connection

One of the great things about the {ggplot2} package is that it provides virtually infinite ways to make our already-good plots even better. Here are some ways to start to incorporate formatting and design elements that improve our plots.

20.3 Text labels

Another type of annotation is to use text. For the example below, the names of the top teams are added to the 2002 season data (the year in which the Moneyball story is set). First, filter for 2002.

The next step is to identify the teams at the extremes of each quadrant—top winners and biggest losers in the above- and below-average spending teams. For this, we will use the mutate() function to add a new variable, based on the other values, using the case_when() function.

mlb_2002 <- mlb_pay_wl |> 
  filter(year_num == 2002)

# add salary group
mlb_2002 <- mlb_2002 |> 
  mutate(salary_grp = case_when(
    pay_index >= 100 ~ "above",
    pay_index < 100 ~ "below"
  ))


# add quadrant
mlb_2002 <- mlb_2002 |> 
  mutate(team_quad = case_when(
    pay_index >= 100 & w_l_percent >= 0.5 ~ "I",
    pay_index < 100 & w_l_percent >= 0.5 ~ "II",
    pay_index < 100 & w_l_percent < 0.5 ~ "III",
    pay_index >= 100 & w_l_percent < 0.5 ~ "IV"
  ))

team_for_label <- mlb_2002 |> 
  group_by(salary_grp) |> 
  filter(w_l_percent == max(w_l_percent) |
         w_l_percent == min(w_l_percent))

team_for_label
## # A tibble: 5 × 10
## # Groups:   salary_grp [2]
##   year_num tm    attend_g est_payroll pay_index     w     l w_l_percent salary_grp team_quad
##      <dbl> <chr>    <dbl>       <dbl>     <dbl> <dbl> <dbl>       <dbl> <chr>      <chr>    
## 1     2002 CHC      33248    75690833     112.     67    95       0.414 above      IV       
## 2     2002 DET      18795    55048000      81.4    55   106       0.342 below      III      
## 3     2002 NYY      43323   125928583     186.    103    58       0.64  above      I        
## 4     2002 OAK      26788    40004167      59.2   103    59       0.636 below      II       
## 5     2002 TBD      13157    34380000      50.8    55   106       0.342 below      III

Note that we end up with five teams in the list, since Detroit (“DET”) and Tampa Bay (“TBD”) ended up with identical win-loss records.

We can now use the team names from that table as annotations, using geom_text.

# the same plot as before, but with just the 2002 teams
ggplot(mlb_2002, aes(x = pay_index, y = w_l_percent)) + 
  geom_point() +
  # add the names from the "team_for_label" table
  geom_text(data = team_for_label, aes(label = tm))

This isn’t entirely satisfactory, since the labels overlie the points. In the version below, the geom_label() is used instead, along with the nudge_x argument to move the label slightly to the left of the point (that is, -6 units on the x-axis—and yes, I experimented a bit to find the right nudge!).

# the same plot as before, but with just the 2002 teams
ggplot(mlb_2002, aes(x = pay_index, y = w_l_percent)) + 
  geom_point() +
  # add horizontal and vertical lines
  geom_vline(xintercept  = 100, colour = "grey") +
  geom_hline(yintercept = 0.5, colour = "grey") +
  # add the names from the "team_for_label" table
  geom_label(data = team_for_label, aes(label = tm),
             nudge_x = -6)

Another approach would be to omit the points altogether, and have the team abbreviations represent the location of each team on the plot:

ggplot(mlb_2002, aes(x = pay_index, y = w_l_percent)) + 
  # add horizontal and vertical lines
  geom_vline(xintercept  = 100, colour = "grey") +
  geom_hline(yintercept = 0.5, colour = "grey") +
  # plot the team names 
  geom_text(aes(label = tm))

For another example of this sort of labeling, see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, R for Data Science (2nd ed.), “Graphics for Communication: Annotations”.

20.4 Annotations

The scatterplot we made in [moneyball] is interesting in and of itself. But with some annotations, some of the details can be made explicit.

One way to do that is to add lines to a plot that create sections to the plot. In the example in the previous chapter, vertical line and horizontal lines were added by using geom_vline() and geom_hline().

As you will recall, the red line runs vertically at the “1” point on the X axis. The teams to the left of the line spent below the league average for that season; the teams to the right spent more. As you can see, there have been cases when some teams spent twice as much as the league average. The blue line runs horizontally at the “0.5” point on the Y axis. Above this line, the teams won more games than they lost. Below the line, they lost more games than they won.

mlb_pay_wl <- read_csv("data/mlb_pay_wl.csv", 
                       col_types = 
                         cols(year_num = col_character()))

# create plot object for repeated use
moneyball_plot <- ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  geom_point() 

moneyball_plot +
  geom_vline(xintercept  = 100, colour = "red", size = 2) +
  geom_hline(yintercept = 0.5, colour = "blue", size = 2)

In the chart above, the vertical and horizontal lines are plotted on top of the points, obscuring some from view. Here are a couple of approaches that resolve that.

First, it’s useful to understand that the layers are plotted in order…so if we plot the lines first and then the points, the points will be in front of the lines.

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  # plot the lines first
  geom_vline(xintercept  = 100, colour = "red", size = 2) +
  geom_hline(yintercept = 0.5, colour = "blue", size = 2) +
  # then the points
  geom_point() 

A second approach is to use the alpha parameter when describing the lines. “Alpha refers to the opacity of a geom. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.” (From the {ggplot2} reference on “Colour related aesthetics: colour, fill, and alpha”)

For our plot, we will set alpha = 0.5 for both lines.

moneyball_plot +
  geom_vline(xintercept  = 100, colour = "red", size = 2, alpha = 0.5) +
  geom_hline(yintercept = 0.5, colour = "blue", size = 2, alpha = 0.5)

Another option would have been to add a block of shading for the “Moneyball” teams that spent below the league average, but had winning seasons. For this, the annotate("rect") is used. The fill colour is hex code for the specific shade of green used by the Oakland Athletics, as found at usteamcolors.com. (See R Graphics Cookbook, 2nd ed. for applications of this in a line plot.)

moneyball_plot +
  annotate("rect", xmin = 0, xmax = 100, ymin = 0.5, ymax = 0.8,
           alpha = .3, fill = "#003831")

To that plot we might want to add a description of the quadrant that is shaded.

Note that in the label, the backslash-n is used to force a line break in the text string.

# as previously plotted
moneyball_plot +
  annotate("rect", xmin = 0, xmax = 100, ymin = 0.5, ymax = 0.8,
           alpha = .3, fill = "#003831") +
  # text annotation
  annotate("text",             # annotation type
           label = "Teams with\nbelow-average payroll\n& winning records", 
           x = 5, y = 0.75,    # location of annotation
           hjust = 0,
           fontface = "bold",  # text formatting
           colour = "#ffffff") 

20.5 Colour

Another way to tell the “Moneyball” story would be to identify the Oakland Athletics in the mass of dots shown, by using color. The code is in two steps:

  • first, create a subset of the main table that is only the Oakland data points

  • redraw the plot, adding a new geom_point() layer. Note that the colour used is now the yellow-gold in the Athletic’s colour scheme, and the size of the points is specified to be slightly larger than the others.

oakland <- mlb_pay_wl |> 
  filter(tm == "OAK")

moneyball_plot +
  geom_point(data = oakland, aes(x = pay_index, y = w_l_percent),
             colour = "#efb21e", size = 2)

We might want to make other changes in this plot before including it in our publication.

In this version, the plot has a number of variations:

  • a light theme, using theme_bw()

  • pale gray points for all of the team points

  • green points for Oakland, using a named colour (“seagreen”)

  • adding the vertical and horizontal lines, but gray and lighter than the default

# create new version of plot with 
# gray points on white background
ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  geom_point(colour = "gray75") +
  theme_bw() +
  # Oakland points
  geom_point(data = oakland, aes(x = pay_index, y = w_l_percent),
             colour = "seagreen", size = 2) +
  geom_vline(xintercept  = 100, colour = "grey50", size = 0.75) +
  geom_hline(yintercept = 0.5, colour = "grey50", size = 0.75)

20.5.1 Colour palettes

There are many pre-defined colour palettes available when plotting in {ggplot2}.

(The {ggplot2} reference page that covers this is under “Scales” https://ggplot2.tidyverse.org/reference/index.html#section-scales))

An important thing to recognize (and this took me a long time to figure out!) is that how you specify the palette needs to match the type of variable that’s being represented by the colour. In general, they are either discrete (categories, such as factors or character strings) or continuous (a range of numbers).

First, let’s add “year_num” as a colour variable to a subset of our Moneyball plot.

Important: In this example, “year_num” is a specified as a “character” variable, it is therefore a discrete variable—but beware: if it had been read as a numeric variable, it would be continuous!

mlb_pay_wl |> 
  filter(year_num > "2014") |> 
  ggplot(aes(x = pay_index, y = w_l_percent, color = year_num)) + 
  geom_point()

We can change the default palette in a number of ways. Below, we use one of the “ColorBrewer” palettes. These palettes are designed for discrete scales on maps14, but they translate well to data plotting. There are palettes for sequential, diverging, and qualitative colour scales.

A valuable reference is at R Graphics Cookbook, 2nd ed., “Using a Different Palette for a Discrete Variable”, which includes the names and images of a variety of palettes.

(See also http://colorbrewer2.org.)

And for even more information about the use of colorbrewer.org scales, see this page at the {ggplot} reference: https://ggplot2.tidyverse.org/reference/scale_brewer.html

mlb_pay_wl |> 
  filter(year_num > "2014") |> 
  ggplot(aes(x = pay_index, y = w_l_percent, colour = year_num)) + 
  geom_point() +
  # add ColorBrewer palette
  scale_colour_brewer(palette = "Set1")

In the plots below, the average attendance is plotted as a colour aesthetic. Average attendance is a continuous variable, so we need to make sure our palette can represent that type.

The “ColorBrewer” palettes, while designed for discrete scales, can be adapted to a continuous scale (as we have here) by using one of the distiller scales, which interpolate the values between the discrete colours in the original scale.

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent, color = attend_g)) + 
  geom_point() +
  scale_color_distiller(palette = "Greens")

Another option is to use one of the viridis palettes. Note that in this case, the default scale has the largest value plotted as the lightest colour, which seems counter-intuitive to me, so I have added the direction = -1 argument to the scale_color_viridis_c() function.

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent, color = attend_g)) + 
  geom_point() +
  scale_color_viridis_c(direction = -1)

For more information about the viridis palettes, see R Graphics Cookbook, 2nd ed, “Using Colors in Plots: Using a Colorblind-Friendly Palette”

20.6 Axis modification

Fairly often when we are creating plots, we will want to modify the axes, and related to that, the grid lines.

For this examples, we will use the penguins data table from the {palmerpenguins} package. This plot is a step-by-step recreation of the “Flipper length vs. bill length” plot at the {palmerpenguins} reference page, with a couple of small tweaks.

First, we load the package and take a quick peak at the data. For this plot, the length of the bird’s flipper will be on the X axis, and the length of the bill on the Y.

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
##  2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
##  3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
##  4 Adelie  Torgersen           NA            NA                  NA          NA <NA>    2007
##  5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
##  6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
##  7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
##  8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
##  9 Adelie  Torgersen           34.1          18.1               193        3475 <NA>    2007
## 10 Adelie  Torgersen           42            20.2               190        4250 <NA>    2007
## # ℹ 334 more rows

20.6.1 setup

We will plot the penguin’s flipper length on the X axis, and the bill length on the Y. We will also differentiate the three species of penguin using the colour and point shape. For the point, we will make them bigger than the default with size = 3 and slightly transparent with alpha = 0.8 (the default alpha is 1, and 0 is fully transparent).

Note that for this example, the theme is set to theme_minimal(). In the {palmerpenguins} package example, the function ggplot2::theme_set(ggplot2::theme_minimal()) is at the top of the page, and all the plots created after that have that them applied…this is a useful technique if you are making multiple plots in a document, because it allows you to apply a consistent theme to all of them.

flipper_bill <- ggplot(data = penguins,
                         aes(x = flipper_length_mm,
                             y = bill_length_mm)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  theme_minimal()

flipper_bill
flipper_bill +
  theme(
    panel.background = element_rect(fill = "black")
  )
flipper_bill +
  theme(
    panel.background = element_rect(fill = "black"),
    panel.grid.minor = element_blank()
  )
flipper_bill +
  theme(
    panel.background = element_rect(fill = "black"),
    panel.grid = element_line(colour = "orange")
  )

For the next step we will change the colour palette, using scale_colour_manual().

Remember that in the scale functions, the type of data matters—here we are working with a continuous variable, but if it’s a categorical variable, a “discrete” function will be required.

flipper_bill_2 <- flipper_bill +
  scale_color_manual(values = c("darkorange","purple","cyan4")) 

flipper_bill_2

To that chart we will add a variety of titles with the labs() function.

In addition to the chart title and subtitle, the function also adds a caption at the bottom with the data source, and changes the text that appears on the axis and legend labels.

flipper_bill_3 <- flipper_bill_2 + 
  labs(title = "Flipper and bill length",
       subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       caption = "Souce: {palmerpenguins} R package",
       x = "Flipper length (mm)",
       y = "Bill length (mm)",
       color = "Penguin species",
       shape = "Penguin species") 

flipper_bill_3

In this step, the legend is moved inside the plot, with the legend.position = argument, which provides the top left-hand corner of the legend will be located.. These coordinates are as a percentage of the axis range…“0.85” is 85% of the X axis length, and “0.15” is the height on the Y axis.

flipper_bill_4 <- flipper_bill_3 +
  theme(legend.position = c(0.85, 0.15))

flipper_bill_4

20.6.2 setting the range

One way to set the range (or limits) of an axes is to use the xlim() and ylim() functions, and specify what values we want to have for the starting and end points.

Note that in this example, the maximum value of the Y axis uses the max() function to get the largest value of the variable.

flipper_bill_4 +
  # set the range of the X axis from 100 to 250
  xlim(100, 250) + 
  # set the range of the Y axis from 0 to the maximum value
  ylim(0, max(penguins$bill_length_mm))

We have much more control over the axes if we use the appropriate scale_() function.

In this plot, both of our axes are continuous variables, so we need scale_y_continuous() (as opposed to scale_y_discrete()). In this example, we will modify only the Y axis.

  • The limits = argument sets the limits in the same way as ylim().

  • The breaks = sets the points for the major gridlines and axis numbering. In this example, the values are stated explicitly.

  • The minor_breaks = sets the minor gridlines. Note the use of the seq() function to create a sequence that runs from 0 to 60 in steps of 5.

flipper_bill_4 +
  scale_y_continuous(
    limits = c(0, 75),
    breaks = c(0, 30, 40, 55, 60),
    minor_breaks = seq(0, 60, 5)
  ) 

To remove the minor breaks, we can set minor_breaks = NULL.

flipper_bill_4 +
  scale_y_continuous(
    limits = c(0, 75),
    breaks = c(0, 30, 40, 55, 60),
    minor_breaks = NULL
  )

The scale_() functions have even more options, including adding things like commas to your big numbers or percentage symbols, all using the labels = argument. See this example: https://ggplot2.tidyverse.org/articles/faq-axes.html

20.6.3 transform the scales of the axis

When we have scales with both small and large numbers, we may wish to incorporate a transformation of the values in our plots.

One type of transformation is a logrithmic scale, which is useful if you are dealing with a scale where there is a very large range of numbers. You can see this in the Hans Rosling Gapminder presentation “200 Countries, 200 Years”, where the equally-spaced breaks in the income variable (on the X axis) are shown as 0, 400, 4,000, and 40,000.

Here we will read a data file containing the population centres in British Columbia.

(Source: Wikipedia, “List of population centres in British Columbia” via Statistics Canada, Census of Population)

population_centres_bc <- read_csv(here::here("data", "population_centres_bc_2021.csv"))
ggplot(population_centres_bc, aes(x = population_2021, y = population_density_2021)) +
  geom_point()

By adding scale_x_log10(), how the values on the X axis are displayed is transformed. Now there is an equal distance between 1,000 (the smallest community) and 10,000 as there is between 100,000 and 1,000,000.

ggplot(population_centres_bc, aes(x = population_2021, y = population_density_2021)) +
  geom_point() +
  scale_x_log10()

For more information about modifying the axes, see the ggplot FAQ: https://ggplot2.tidyverse.org/articles/faq-axes.html


20.7 Reading & reference

Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, R for Data Science

{ggplot2} reference site

20.7.1 {ggplot2} extensions

Depending on the type of plot you want to make, {ggplot2} by itself might not have the functionality to do everything. Developers around the world have been creating extensions that add even more functions that integrate with the plot created using the ggplot() function.

A full list of {ggplot2} extensions can be found here: https://exts.ggplot2.tidyverse.org/

The {ggforce} package is one that adds some powerful extensions, notably in the area of annotations:

20.7.2 Animated plots

If you are interested in exploring animated plots (similar to Hans Rosling’s gapminder visualizations), here are some resources for using the {gganimate} package:

20.7.3 More ggplot2 plotting resources

20.7.4 More general plotting resources

Note: this exercise does not delve into the question of how to design your plot. The structure, use of colour, annotations, and other plot elements can significantly improve the impact of a plot. See Kieran Healy’s book above, as well as

20.7.5 Data visualization books

Unfortunately these don’t have a free online version, but are worth finding your local library or bookstore:

  • Stephanie Evergreen, Presenting Data Effectively: Communicating Your Findings for Maximum Impact (Evergreen 2014)

  • Scott Berinato, Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations (Berinato 2023)

-30-