Applied Biostats 2024: Chapter 11: Better Figures in R

Motivating scenarios: You know what figure you want to make but need the R skills to go beyond the basics.

How to Implement Best Practices and Create Engaging, Attractive Plots in R

You’ve already been introduced to the basics of ggplot and explored the key elements that make for effective figures. But at this point, you might be feeling a bit frustrated. You know how to generate plots in R, and you understand what makes a plot good, yet creating polished, impactful visuals in R still seems challenging. My advice is twofold:

Now let’s get started on creating both informative and visually appealing figures in R!

Review: What Makes a Good Plot

We’ll explore these concepts below, with a particular focus on creating honest, transparent, and clear plots, as this is where R offers the most opportunities for customization.

Combining Plots to Tell a Story

Most analyses require several plots to fully convey the story of your data. Some of these plots may be Most analyses require several plots to fully convey the story of your data. Some of these plots may be distributed throughout a document, while others will be combined into multi-panel figures. Here, we focus on creating multi-panel figures –later we will consider how to maintain consistency in visual metaphors across your figures (e.g., colors, shapes, and symbols should represent the same variables or categories across different plots).

There are several ways make multi-panel figures in ggplot2, but I usually use the patchwork package. Figure @

$\ref{fig:3penguins}$ shows how we can make a three-paneled figure – with two narrow figures on top and one wider Figure below relatively easily. To learn more about about patchwork, check out its extensive documentation. Remmeber that if you can’t do it in patchwaork packages—such as cowplot and gridExtra can also make multi-paneled figures, and that each of these packages has its own strengths and weaknesses.

# If you have not installed patchwork, you will need to!
library(patchwork)  # After patchwork is installed you still need to load it with the library function  
library(ggplot2)
library(palmerpenguins)

p1 <- ggplot(penguins, aes(x = bill_length_mm, fill = species)) +  geom_density(alpha = .4, show.legend = FALSE)
p2 <- ggplot(penguins, aes(x = species, y= bill_length_mm, color = species))+geom_boxplot(show.legend = FALSE) 
p3 <- ggplot(penguins, aes(x = bill_depth_mm, y= bill_length_mm, color = species))+geom_point()
(p1 + p2)/p3 + plot_annotation(tag_levels = 'A') + plot_layout(heights = c(1,2))

Making plots ofr the medium

Most results are presented in press, in speech, in a poster or online. Most of this chapter and course focuses on presentation in press, but here I have a few resources for alternative modes of presentations.

Digital documents

For online documents. Digital formats open up opportunities to engage readers with interactive graphs and animations, making data visualization more dynamic and accessible. Here are some powerful tools to consider:

library(highcharter)

hchart(penguins, "scatter", hcaes(x = flipper_length_mm, 
                                  y = bill_length_mm, 
                                  group = species, 
                                  color = species))

Posters

Poster sessions often take place in large halls filled with hundreds of competing posters. Fun and playful plots are especially effective to capture attention as people move around in this busy setting. Isotype plots, where images represent quantities or categories instead of traditional points, bars, or lines can help get your poster that attention. For instance, Figure 3, uses the ggtextures package to replace bars with pictures of penguins of each species to depict the distribution of species across three islands.

library(ggtextures)  # Add image paths for each penguin species
penguins <- penguins%>% 
  mutate(image = case_when(species == "Adelie"    ~ "images/adelie.jpeg",
                            species == "Chinstrap" ~ "images/chinstrap.jpeg",
                            species == "Gentoo"    ~ "images/gentoo.jpeg"))

penguins%>% group_by(species, island,image) %>% tally()%>%ungroup()%>%
  ggplot(aes(species, n, image = image)) +
  geom_isotype_col(img_width = unit(2, "cm"), img_height = unit(2, "cm")) +
  facet_wrap(~island, labeller = "label_both")+
  geom_hline(yintercept = seq(0,150,25), color = "lightgrey", lty = 2)+
  theme(panel.background = element_rect(fill = "white", size = 0.5, linetype = "solid",color = "black"))

Scientific talks

For talks: When presenting your work in a talk, you can use transitions to gradually introduce elements of your figures, guiding the audience through your findings step by step. This allows you to control how the plot is revealed, creating a dynamic narrative as you walk through the data. While this approach is a great benefit in a talk, there are challenges—namely, the audience may be seated far from the screen, making it difficult to see small details in your plot. To address this, keep your take-home message clear and simple, using large fonts and bold points to ensure readability. You’ll likely need to adjust the theme() function in R to increase the text, label, and point sizes for better visibility. I discuss how to handle these common challenges in the next section.

Adjusting text size

For different presentation modes—such as posters or talks—you’ll need to adjust the text size accordingly. Text should be much larger for posters and talks to ensure readability from a distance. Mastering the theme() function in ggplot2 allows you to fully customize your plots, including text size. Most importantly, you can adjust the size of text using the element_text(size = ...) function, which is applied after specifying the plot element you want to modify (see Fig. 4).

ggplot(penguins, aes(x = species, y = bill_length_mm, color = species))  +  
  geom_point()+
  theme(axis.title.x = element_text(size = 20, color = "orange"),
        axis.text.x  = element_text(size = 15),
        axis.text.y  = element_text(size = 15, color = "firebrick"),
        legend.text  = element_text(color = "purple"),
        legend.title = element_text(size = 30, color = "gold"))

Changing font size by specifying [`element_text()`](https://ggplot2.tidyverse.org/reference/element.html) in [`theme()`](https://ggplot2.tidyverse.org/reference/theme.html). The colors are very bad, and are meant to help you connect our R to the output - you should stay away from doing silly things to your font color.

Figure 4: Changing font size by specifying element_text() in theme(). The colors are very bad, and are meant to help you connect our R to the output - you should stay away from doing silly things to your font color.

While there are quick tricks for using the theme() function, it can be quite complex and sometimes frustrating to work with. I highly recommend the ggThemeAssist package, which provides a graphical user interface (GUI) that allows you to point and click your way to the desired figure. It then generates the corresponding R code for you (see Fig. 5). Personally, I’ve become much more proficient with the theme() function over the past few years, thanks to the insights I gained from using ggThemeAssist, as it helped me understand which arguments control which parts of the plot. Of course, now ChatGPT would also be very helpful!

To use ggThemeAssist:

Install and load the package.
Create a ggplot in your R script.
Select ggplot Theme Assistant from the add-ins drop-down menu.
A GUI will appear. Point and click your way through the options, and the corresponding R code will be inserted into your script.

Making Honest, Transparent, Clear, Accessible, and Fun plots

So how do we translate these ideas into R plots? Check out these tips and tricks below!!!

Making and critiquing plots is one of my favorite parts of science — I absolutely love it! However, I know it can be a major time sink, and we want to avoid that. Here are my tips for preventing yourself from getting bogged down by every figure:

Know your goal: Determine whether you’re creating an exploratory or explanatory figure. Don’t waste time perfecting an exploratory plot — it is meant for quick insights, not for publication.
Standardize your process: Develop a few go-to themes and color schemes that you use frequently. Save and reuse these templates so you can produce attractive plots without customizing each one from scratch.
Master the basics: Get comfortable with the most common tasks you’ll perform in ggplot2. Keep the ggplot cheat sheet handy, and bookmark additional resources that suit your workflow. A quick Google search can also be a lifesaver!
Premature optimization is the root of all evil: Save detailed customizations (e.g., annotations, special formatting) for last. This way, you can focus on the essential elements of the plot first without getting bogged down in complex code prematurely.
Get help: Reach out to friends, use Google, consult books, or turn to ChatGPT and other resources to solve problems quickly. Remember, the more specific your question, the better the help you’ll receive!

Making honest figures

The most common ways figures deceive is with inappropriate limits, nonlinear (or nonsensical) axis scales, or changing data. Here I show how you can use ggplot to prevent these deceptions.

daphnia_resist <- read_csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter15/chap15q17DaphniaResistance.csv")

Set your limits

When working with data on proportions, such as Daphnia resistance to cyanobacteria, it’s often best to present the data on a scale from zero to one, since proportions naturally range between these values. However, a basic plot created with geom_point() only spans the actual range of the data, which may not always include the full 0 to 1 scale. To address this, you can use the limits argument within the scale_y_continuous() function to explicitly set the y-axis to range from 0 to 1. Figure 6 contrasts these two choices.

library(patchwork)
library(ggplot2)
library(readr)

no_set_limit <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance))  + 
  geom_point() +
  labs(title = "ggplot's deault limits")

limit_0_1 <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance))  + 
  geom_point() +
  scale_y_continuous(limits = c(0,1))+
  labs(title = "Adding a limit from zero to one")

no_set_limit + limit_0_1 + plot_annotation(tag_levels = 'A')

Set x or y limits in the [`scale_x_continuous()`](https://ggplot2.tidyverse.org/reference/scale_continuous.html) or [`scale_y_continuous()`](https://ggplot2.tidyverse.org/reference/scale_continuous.html) functions. *Note* there are other ways to set limits, but this is the way I do it.

Figure 6: Set x or y limits in the scale_x_continuous() or scale_y_continuous() functions. Note there are other ways to set limits, but this is the way I do it.

Ordering Ordinals

In the Daphnia dataset, the explanatory variable on the x-axis represents cyanobacteria density as “high,” “low,” or “medium.” The natural order here is “low”, “medium”, “high”, (or maybe “high”,“medium”, “low”). By default, R does neither of these things and rather presents things in alphabetical order. This can confuse readers and potentially mislead them if they don’t examine the axis carefully.

The forcatspackage allows us to modify the order R assigns to categorical variables. For example, the fct_relevel() function enables us to reorder factor levels by specifying the desired sequence. As shown below, using fct_relevel() inside the mutate() functionlets us present trends more clearly and intuitively (7).

library(patchwork)
library(ggplot2)
library(readr)
library(forcats)

# Create a plot using the default order of the categorical variable (cyandensity) in R
default_order <- limit_0_1 +  
  stat_summary(aes(group = 1), geom = "line", fun = "mean", lty = 2) +  # Adds a dashed line (lty = 2) connecting the mean values
  labs(title = "x is in R's default order")  # Title indicating that the x-axis follows R's default alphabetical ordering

# Reorder the 'cyandensity' variable so that it appears in a custom order: "high", "med", "low"
daphnia_resist <- daphnia_resist %>%
  mutate(cyandensity = fct_relevel(cyandensity, "high", "med", "low"))  # Use 'fct_relevel' to reorder the factor levels of 'cyandensity'

# Create a plot using the reordered 'cyandensity' variable with a logical order
better_order <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) + 
  geom_point() +  # Adds points for each data point
  scale_y_continuous(limits = c(0,1)) +  # Ensures the y-axis is scaled between 0 and 1
  stat_summary(aes(group = 1), geom = "line", fun = "mean", lty = 2) +  # Adds a dashed line connecting the mean values for each group
  labs(title = "x is in a sensible order")  # Title indicating that the x-axis is in a logical, user-defined order

default_order  + better_order  + plot_annotation(tag_levels = 'A')

Set the order of levels with the [`fct_relevel`](https://forcats.tidyverse.org/reference/fct_relevel.html) function. Be sure to do this inside [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html), and save this modification. Note that I added a line connecting means to further highlight the pattern.

Figure 7: Set the order of levels with the fct_relevel function. Be sure to do this inside mutate(), and save this modification. Note that I added a line connecting means to further highlight the pattern.

Do Not Let jitter Deceive

When visualizing data, it’s common to use the jitter function to spread out points and avoid overplotting, especially when there are many overlapping data points. However, a simple jitter can be deceptive in two ways: (a) It can create the false impression that an ordinal variable is continuous, and (b) it can distort the actual values of a continuous response. Narrowoing the width of the jitter, and coloring the points by category to differentiate groups clearly. Since the coloring is redundant and doesn’t provide new information, you can hide the legend to keep the plot clean.

Below, you’ll see two examples: one where jittering creates a misleading plot, and another where jittering is applied with precautions to ensure clarity (Figure 8).

# Create a "dishonest" jitter plot where jitter is applied randomly to both x and y axes
dishonest_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) + 
  geom_jitter() +  # Adds random noise to both x and y axes (spreading points around)
  scale_y_continuous(limits = c(0,1)) +  # Ensures the y-axis is scaled between 0 and 1
  labs(title = "Dishonest jitter")  # Adds a title indicating the "dishonest" nature of the plot

# Create an "honest" jitter plot with controlled jitter only on the x-axis
honest_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) + 
  geom_jitter(height = 0, width = .2, show.legend = FALSE) +  # Jitters only the x-axis (spreading the points horizontally), no vertical jitter
  scale_y_continuous(limits = c(0,1)) +  # Ensures the y-axis is scaled between 0 and 1
  labs(title = "Honest jitter")  # Adds a title indicating the "honest" nature of the plot

# Combine both plots side by side with plot labels A and B
dishonest_jitter + honest_jitter + plot_annotation(tag_levels = 'A')  # Annotates the plots with labels A (dishonest) and B (honest)

Notify your audience of nonlinear scales

To highlight another common problem in data presentaiton, let’s explore the variability in mammal lifespan across species as a function of body weight. Because both variables are extremely right skewed, log-transforming these data can reveal important patterns (compare Fig. 9a with Fig. 9b). However, it’s crucial to make it clear to the reader that the data are presented on a log scale. Figure 9, uses the annotation_logticks function to clearly indicate the use of a log scale to prevent misinterpretation of relationships as linear. Note that we use the scale_x_continuous() and scale_y_continuous() functions to apply the log transformation while retaining linear tick marks, ensuring that the data are displayed on a log scale but labeled in linear terms.

# Load the mammal lifespan dataset from a tab-delimited (.txt) file online
mammal_lifespan <- read_tsv("http://www.statsci.org/data/general/sleep.txt") # because it is tab-delimited 

# Create a base plot with mammal body weight on the x-axis and lifespan on the y-axis
base_plot <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan)) +
  geom_point()  # Add points for each species in the dataset

# Modify the base plot to apply a log10 transformation on both x and y axes
log_10_plot <- base_plot +
  scale_x_continuous(trans = "log10") +  # Log10 scale for the x-axis (Body Weight)
  scale_y_continuous(trans = "log10")    # Log10 scale for the y-axis (Lifespan)

# Add log ticks to the log10 plot on the bottom (x-axis) and left (y-axis)
log_10_plot_w_ticks <- log_10_plot +
  annotation_logticks(sides = "bl", base = 10, size = .2)  # Ticks on bottom "b" and left "l"

base_plot + log_10_plot + log_10_plot_w_ticks + plot_annotation(tag_levels = 'A')

$Figure **a** hides the pattern. **b** log transforms both axes to reveal the log linear relationship, but only a careful reader would notice the axes increase on a $log_{10}$ scale. **c** reveals the patterns and notifies the reader that the plot is on $log_{10}$ scale.$

Figure 9: Figure a hides the pattern. b log transforms both axes to reveal the log linear relationship, but only a careful reader would notice the axes increase on a $log_{10}$ scale. c reveals the patterns and notifies the reader that the plot is on $log_{10}$ scale.

Show Your Data to Be Transparent

When I say “show your data,” I mean SHOW THE FUCKING DATA - THAT IS, THE ACTUAL POINTS , not just summarizing the data with means, standard errors, or boxplots. Fortunately, when working with raw data, ggplot2 makes it fairly difficult to hide the data. As long as you’re feeding raw data into ggplot, most geoms will display the data directly. However, there are a few important considerations.

Don’t Hide Your Data by Overplotting

With a modest-sized dataset, like the daphnia_resistance data, overplotting can obscure individual data points. A simple jitter plot can solve this issue by spreading out points that would otherwise overlap (compare Fig. 10a to Fig. 10b). Remember to adjust the width and height arguments in geom_jitter() to create an honest jitter plot.

daphnia_points <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) + 
  geom_point(show.legend = FALSE) +
  scale_y_continuous(limits = c(0,1)) +
  labs(title = "Points hide data")

daphnia_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) + 
  geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
  scale_y_continuous(limits = c(0,1)) +
  labs(title = "Jitter shows data")

daphnia_points + daphnia_jitter + plot_annotation(tag_levels = 'A')

With larger datasets, overplotting becomes a more significant challenge. For instance, plotting the price of diamonds by clarity can lead to overplotting, and while jittering can help, it doesn’t always fully resolve the issue. Below are three common solutions:

Figure 11 demonstrates several suboptimal solutions (Fig. 11a-d). The optimal solution (Fig. 11e) combines a violin plot (to help guide the eye toward the shape of the distribution) with a sinaplot that uses significant transparency. If these approaches don’t work for your data, you can explore more strategies in Chapter 5 of the R Graphics Cookbook (Chang 2020).

library(ggforce)
base_diamond <- ggplot(diamonds, aes(x = clarity, y = price)) + scale_y_continuous(trans = "log10")+ annotation_logticks(sides = "l")

plot_a <- base_diamond + geom_point() + labs(title = "geom_point") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price,na.rm=TRUE)), color = "red")
plot_b <- base_diamond + geom_jitter() + labs(title = "geom_jitter") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_c <- base_diamond + geom_jitter(alpha = 0.1) + labs(title = "geom_jitter(alpha = 0.1)") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_d <- base_diamond + geom_violin() + labs(title = "geom_violin") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_e <- base_diamond + geom_violin() + geom_sina(alpha = 0.1) + labs(title = "geom_violin() and\ngeom_sina(alpha = 0.1)") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)

plot_a + plot_b + plot_c + plot_d + plot_e

Don’t Hide Your Data with Summaries

Boxplots, which show quartiles and are excellent for summarizing data, don’t address overplotting. In fact, when a boxplot is layered on top of raw data, it can obscure individual data points. As seen in Figure 12, boxplots alone do not fully represent the data, and if added after data points, they can hide underlying patterns.

daphnia_boxplot <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) + 
  geom_boxplot(outlier.color = NA) + 
  scale_y_continuous(limits = c(0,1)) + 
  labs(title = "Where are the data?")

daphnia_boxplot_b <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) + 
  geom_jitter(width = 0.2, height = 0) + 
  geom_boxplot(outlier.color = NA) + 
  scale_y_continuous(limits = c(0,1)) + 
  labs(title = "ggplot layers geoms on\ntop of each other")

daphnia_boxplot_c <- daphnia_boxplot + 
  geom_jitter(width = 0.2, height = 0) + 
  labs(title = "Here are the data!")

daphnia_boxplot + daphnia_boxplot_b + daphnia_boxplot_c + plot_annotation(tag_levels = 'A')

Similarly, reducing data to means and error bars can hide variability. If you need to include these summaries, ensure they don’t obscure the raw data.

means_and_errors <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) + 
  stat_summary(size = .2) + 
  scale_y_continuous(limits = c(0,1)) + 
  labs(title = "Means +/- error hides the data")

jitter_plus_means_and_errors <- daphnia_jitter + 
  stat_summary(color = "black", size = .2) + 
  labs(title = "Adding means +/- error can enhance a plot")

means_and_errors + jitter_plus_means_and_errors + plot_annotation(tag_levels = 'A')

Supercharge Your ggplot Skills to Make Clear Plots

ggplot2 is excellent for making clear, explanatory plots, but mastering this requires learning several new tricks. Clear plots are easy to interpret and help readers focus on key patterns. Below are some useful R tricks to achieve these goals. I recommend skimming these to get a sense of what you can do, rather than memorizing each trick. For further exploration, I highly recommend The R Graphics Cookbook (Chang 2020).

Order Categories Sensibly

By default, R orders categorical variables alphabetically, which often doesn’t make sense biologically.

In the previous sections, we used the forcats package to reorder ordinal variables as needed. However, if your categorical variable is nominal (i.e., has no natural order), a great way to highlight patterns is to order categories from the largest to the smallest mean.

The fct_reorder function from the forcats package will reorder a factor based on another value. Note: This transformation happens before plotting, so we use the mutate function.

After generating Figure 13a with the default order of regions, I reorder region by the mean student_ratio in descending order (removing NA values) to create Figure 13b using the same code as Figure 13a.

student_order_a <-  ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) + 
  geom_jitter(alpha = .5, size = 3, height = 0, width = .15, show.legend = FALSE) +
  scale_y_continuous(limits = c(5,100), trans = "log10") +
  stat_summary(color = "black")

df_ratios <- df_ratios %>%
  mutate(region = fct_reorder(region, student_ratio, mean, na.rm = TRUE, .desc = TRUE))

student_order_b <-  ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) + 
  geom_jitter(alpha = .5, size = 3, height = 0, width = .15, show.legend = FALSE) +
  scale_y_continuous(limits = c(5,100), trans = "log10") +
  stat_summary(color = "black")

student_order_a + student_order_b + plot_annotation(tag_levels = 'A')

Make Clear Labels

Preventing Labels from Overlapping

In our earlier example of student-to-faculty ratios by region, we saw that region names were overlapping, which made them difficult to read. We can solve this by either rotating the axis labels or, even better, flipping the coordinates. Figure 14 demonstrates how to implement these fixes in R.

student_plot_a <- ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) + 
  geom_jitter(alpha = 0.3, size = 2, height = 0, width = 0.15, show.legend = FALSE) +
  scale_y_continuous(limits = c(5, 100), trans = "log10")

# Rotating axis labels by 90 degrees
student_plot_b <- student_plot_a +
  theme(axis.text.x = element_text(angle = 90))

# Flipping the coordinates to avoid overlapping labels
student_plot_c <- student_plot_a +
  coord_flip()

Make descriptive labels

We often want plots with more descriptive labels than we want to type in our data analysis. You can override the defaults labels with the labs() function. You can add math and Greek by using the expression function (Fig 15).

student_plot_c   + 
labs(y = expression(paste('Students per faculty (', rho, " = ", frac('#  students','#faculty'),")")))

Add labels to facets when ambiguous

Consider Figure 16 – In Figure 16a an educated reader could deduce that the facets designate the penguin species. Maybe a reader who knew a lot about cars could guess that the facet in Figure 16c referred to the number of gears in the car, but I would not have guessed that.

We do not want our readers guessing. At best this is frustrating for them and at worst they guess wrong. We could note the labels in the legend, but that would add to the reader’s cognitive load. Best to include this information in the facet with the labeller function in facet_wrap (e.g.facet_wrap(... , labeller = "label_both"). see Fig. 16)

Add Annotations, Text, Summaries, and Lines to Highlight Patterns

This part can be very fun, but there are two common pitfalls to avoid:

It can become a time sink.
You risk adding too much information, overwhelming the reader.

To minimize these risks, I suggest leaving these enhancements for the final stages of your presentation and carefully considering which annotations to include, ensuring each serves a clear purpose.

We can enhance our plots by adding statistical summaries, guiding lines, and annotations to help the reader interpret patterns more easily. In Figure 17, we illustrate several techniques:

ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) + 
  geom_jitter(alpha = .3, size = 2, height = 0, width = .15, show.legend = FALSE) +
  scale_y_continuous(limits = c(0, 100)) +
  stat_summary(fun.data = mean_cl_boot, color = "black") +
  geom_hline(yintercept = summarise(df_ratios, mean(student_ratio, na.rm = TRUE)) %>% pull(), lty = 2) + # lty = 2 for dashed line
  annotate(x = 3.5, y = 28, label = "mean", geom = "text", size = 3) +
  geom_text(data = . %>%
              filter(region == "Africa") %>%
              filter(student_ratio == max(student_ratio, na.rm = TRUE) | 
                     student_ratio == min(student_ratio, na.rm = TRUE)),
            aes(label = country), size = 3, hjust = 0, show.legend = FALSE)

Use Direct Labeling

There are several ways to add direct labels to your plots. The simplest method is using the annotate() function, as introduced earlier. However, a more refined approach is to use geom_label(), allowing you to summarize your data on the fly, as was done in Figure 18.

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
    geom_point() +
    geom_label(data = . %>%
               group_by(species) %>%
               summarise_at(c("bill_depth_mm", "bill_length_mm"), mean),
             aes(label = species), fontface = "bold",size = 3,alpha=.4)+
  theme(legend.position = "none")

Group Data to Avoid Confusion

R doesn’t automatically recognize which data points belong together in groups. For instance, if we are plotting the weight of baby chicks on different diets over time (as in Figure 19) or bacterial growth rates over time, we need to explicitly tell R to group by chick, bacteria plate, or whatever grouping variable is relevant. Without this, we may end up with nonsensical plots (e.g., Figure 19a).

In Figure 19b, we specify group = Chick in the main aes call when setting up the ggplot, ensuring that R knows to connect data points for each chick. In Figure 19c, we set group = Chick only in the geom_point aes, allowing us to visualize how the mean chick weight changes over time for each diet, without cluttering the plot.

ChickWeight <- mutate(ChickWeight , Diet = factor(Diet)) # letting R know diet is categorical  

chick_plot_a <- ggplot(ChickWeight, aes(x = Time, y = weight, color = Diet))+
     geom_line() 
 
chick_plot_b <- ggplot(ChickWeight, aes(x = Time, y = weight, group = Chick, color = factor(Diet)))+
     geom_line(alpha = .8)
 
chick_plot_c <-  ggplot(ChickWeight, aes(x = Time, y = weight, color = factor(Diet)))+
     geom_line(aes(group = Chick), alpha = .3)+
     stat_summary(fun= mean, geom = "line", size = 2,alpha = .7)
 
chick_plots <- plot_grid(chick_plot_a + theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "red", label = "bad :(", geom = "text", hjust = -1, vjust = 1) ,
                         chick_plot_b + theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "blue", label = "better :|", geom = "text", hjust = -1, vjust = 1),
                         chick_plot_c + theme(legend.position = "none")+ theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "blue", label = "better :|", geom = "text", hjust = -1, vjust = 1), 
                         labels = c("a","b","c"), nrow = 1)

chick_legend <- get_legend(chick_plot_c + theme(legend.position = "bottom"))

plot_grid(chick_plots,chick_legend, rel_heights = c(8,2), nrow =2)

Think About Colors

Just as your data type should guide your choice of plot, it should also guide your choice of color scheme when mapping color onto a variable:

I’ve spent a lot of time choosing colors for plots, and I suggest simplifying this task by using the virids package. This package offers a range of color schemes designed to minimize confusion for readers with color vision deficiencies. I demonstrate how to use this package in Figure 20.

color_a <- chick_plot_c +
  scale_color_viridis_d() +
  theme(legend.position = "bottom") +
  labs(title = "Discrete qualitative palettes\nfor nominal variables")

color_b <- mammal_lifespan %>%
  mutate(Danger = factor(Danger)) %>%  # Indicating Danger is categorical
  ggplot(aes(x = BodyWt, y = LifeSpan, color = Danger)) +
  geom_point(size = 3, alpha = .6) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  annotation_logticks(sides = "bl", base = 10, size = .2) +
  scale_color_viridis_d() +
  theme(legend.position = "bottom") +
  labs(title = "Discrete sequential palettes\nfor ordinal variables")

color_c <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan, color = Gestation)) +
  geom_point(size = 3, alpha = .6) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  annotation_logticks(sides = "bl", base = 10, size = .2) +
  scale_color_viridis_c() +
  theme(legend.position = "bottom") +
  labs(title = "Continuous sequential palettes\nfor continuous variables")

plot_grid(color_a, color_b, color_c, labels = letters[1:3], ncol = 2)

Also check out the colorspace package! Check out the available palettes and which types of variables they best accompany here!

If you want to specify a few colors yourself, you can use the scale_color_manual(values = <vector of colors>) function. For example, scale_color_manual(values = c("red", "blue")) will color the first category in red and the second in blue. Alternatively, for a bit of fun, you can try the beyonce or wesanderson packages, which offer themed color palettes.

Avoid Distractions

Luckily, R doesn’t make it too easy to add distracting elements like backgrounds, 3D effects, or unnecessary animations. However, with enough creativity and effort, anything is possible. Before adding any fancy decorations to a plot, ask yourself if the addition provides critical information or simply distracts from the data.

Accessibility and Universal Design

Consider Color Vision Deficiency

Choose Accessible Colors

The main reason I recommended the colorspace package is that its color palettes are already optimized for accessibility.

Use Color Vision Deficiency Emulators

Even with accessible color choices, it’s essential to check if your figures are interpretable for people with color vision deficiency. Earlier, we explored an online color vision deficiency emulator. Alternatively, the colorblindr package can emulate color vision deficiencies directly in R.

Modify Text and Point Sizes for Accessibility

We’ve previously discussed adjusting the size of points (e.g., in geom_point()) or text elements (in theme()). You can also explore the ggThemeAssist package to make theme adjustments easier.

Use Redundant Coding for Better Accessibility

Colors can be difficult to distinguish, especially when printed in black and white. Using redundant coding, such as varying shapes, sizes, or patterns alongside color, ensures that readers can interpret your data in multiple ways. Figure 22 provides examples:

# Step A: Create plot with log-transformed axes and color/shape based on 'Danger'

mammal_lifespan <- read_tsv("http://www.statsci.org/data/general/sleep.txt") # because it is tab-delimited 


redundant_coding_a <- mammal_lifespan %>%
  mutate(Danger = factor(Danger)) %>%                        # Ensure 'Danger' is a factor
  ggplot(aes(x = BodyWt, y = LifeSpan, color = Danger, shape = Danger)) +
  geom_point(size = 3, alpha = 1) +                          # Plot points with size and opacity
  scale_x_continuous(trans = "log10") +                      # Log transform x-axis
  scale_y_continuous(trans = "log10") +                      # Log transform y-axis
  annotation_logticks(sides = "bl", base = 10, size = .2) +  # Add log ticks
  scale_color_viridis_d() +                                  # Use discrete viridis color scale
  theme(legend.position = "bottom") +                        # Position legend at the bottom
  theme_light()                                              # Use light theme

# Step B: Modify plot by adding custom shapes and hiding legend text
redundant_coding_b <- redundant_coding_a +                   # Start with previous plot
  scale_shape_manual(values = as.character(1:5)) +           # Custom shapes
  theme(legend.text = element_blank())                       # Hide legend text

# Plot 3: A plot with log-transformed axes and color/size based on 'Gestation'
redundant_coding_c <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan, color = Gestation, size = Gestation)) +
  geom_point(alpha = .6) +                                   # Plot points with opacity
  scale_x_continuous(trans = "log10") +                      # Log transform x-axis
  scale_y_continuous(trans = "log10") +                      # Log transform y-axis
  annotation_logticks(sides = "bl", base = 10, size = .2) +  # Add log ticks
  scale_color_viridis_c() +                                  # Use continuous viridis color scale
  scale_size(guide = guide_legend(reverse = TRUE))+          # Reverse the order of the size legend
  theme_light()                                              # Use light theme


plot_grid(redundant_coding_a, redundant_coding_b, redundant_coding_c, labels = letters[1:3], ncol = 2)

A series of three scatter plots displaying the relationship between mammal body weight (x-axis) and lifespan (y-axis) on a log-log scale. In panel (a), data points are color-coded and shaped based on the 'Danger' level of each species, with a discrete viridis color scale. Logarithmic tick marks are visible on the bottom and left sides. In panel (b), the same data is plotted, but custom shapes are applied for the 'Danger' categories, and the legend text is hidden. In panel (c), species are color-coded and sized according to gestation length, using a continuous viridis color scale. The size legend is reversed so that larger sizes appear at the bottom of the legend, and logarithmic tick marks are also shown. All panels use a light theme for clarity.

Figure 22: Redundant coding provides multiple ways to interpret data. In panel (a), species are colored and shaped based on their ‘Danger’ level, using a discrete viridis color scale and custom shapes. In panel (b), the shapes are further are replaced with numbers (i.e. exactly what the shapes represent) to minimize the cognitive load on the reader, and the legend text is removed for clarity. In panel (c), species are colored and sized according to gestation length, with a continuous viridis color scale. Logarithmic transformations are applied to both axes, and log tick marks are displayed on the bottom and left sides of each plot to enhance interpretability.*

Watch this video from Stat 545 about improving your ggplots (7 min 25 sec). Note: they use a slightly different method for log scaling than what we’ve shown, but there are many ways to achieve the same result in R.

Extra Style

In this section, I highlight a few additional stylistic options to further customize your ggplots, which I’ve already used in the examples above.

Place Your Legend

In the examples above, we used the legend.position argument in theme() to place the legend at the bottom of the plot. However, you can have much more control over legend placement. Try experimenting with the ggThemeAssist package to explore different options for positioning and styling your legends!

Pick Your Theme

The default ggplot theme may not suit everyone. You can easily spice up your plots by switching to other themes. For example, I used theme_light() in a previous plot. ggplot comes with several built-in themes, and the ggthemes package provides even more options (see Fig. 24). If you’re feeling creative, you can even design your own custom theme.

Chapter 11: Better Figures in R

Authors

Affiliations

Published

DOI