Using R to make your figures honest, clear, and convincing.
Motivating scenarios: You know what figure you want to make but need the R skills to go beyond the basics.
Learning goals: By the end of this chapter you should be able to:
Here’s an edited version of your introduction:
You’ve already been introduced to the basics of ggplot and explored the key elements that make for effective figures. But at this point, you might be feeling a bit frustrated. You know how to generate plots in R, and you understand what makes a plot good, yet creating polished, impactful visuals in R still seems challenging. My advice is twofold:
Start with the right plot for your data. For most exploratory plots—which will make up about 95% of what you create—this is key. Often, a well-chosen plot paired with a couple of quick R tricks can make your visuals clear and informative.
This chapter is here to guide you through the rest. For more advanced plotting and customization, take a look at these excellent resources: The R Graphics Cookbook (Chang 2020), ggplot2: Elegant Graphics for Data Analysis (Wickham 2016), Data Visualization: A Practical Introduction (Healy 2018), and Modern Data Visualization with R (Kabacoff 2024).
Don’t forget about AI tools like ChatGPT! Large language models (LLMs) are excellent for refining and improving your figures, offering tips and tricks that can help you take your plots to the next level.
Now let’s get started on creating both informative and visually appealing figures in R!
We’ll explore these concepts below, with a particular focus on creating honest, transparent, and clear plots, as this is where R offers the most opportunities for customization.
Most analyses require several plots to fully convey the story of your data. Some of these plots may be Most analyses require several plots to fully convey the story of your data. Some of these plots may be distributed throughout a document, while others will be combined into multi-panel figures. Here, we focus on creating multi-panel figures –later we will consider how to maintain consistency in visual metaphors across your figures (e.g., colors, shapes, and symbols should represent the same variables or categories across different plots).
There are several ways make multi-panel figures in ggplot2
, but I usually use the patchwork package. Figure @\(\ref{fig:3penguins}\) shows how we can make a three-paneled figure – with two narrow figures on top and one wider Figure below relatively easily. To learn more about about patchwork, check out its extensive documentation. Remmeber that if you can’t do it in patchwaork packages—such as cowplot and gridExtra can also make multi-paneled figures, and that each of these packages has its own strengths and weaknesses.
# If you have not installed patchwork, you will need to!
library(patchwork) # After patchwork is installed you still need to load it with the library function
library(ggplot2)
library(palmerpenguins)
p1 <- ggplot(penguins, aes(x = bill_length_mm, fill = species)) + geom_density(alpha = .4, show.legend = FALSE)
p2 <- ggplot(penguins, aes(x = species, y= bill_length_mm, color = species))+geom_boxplot(show.legend = FALSE)
p3 <- ggplot(penguins, aes(x = bill_depth_mm, y= bill_length_mm, color = species))+geom_point()
(p1 + p2)/p3 + plot_annotation(tag_levels = 'A') + plot_layout(heights = c(1,2))
In this example:
p1
shows the distribution of bill length for each penguin species using a density plot, p2
compares bill length across species using a boxplot, and p3
shows the relationship between bill depth and bill length with a scatter plot./
operator to stack p3
below the side-by-side p1
and p2
, and the plot_layout()
function adjusts the relative heights of the two sections for better balance. Finally, plot_annotation()
adds labeled tags (A, B, C) to the individual panels, which is useful when referring to specific parts of a figure.Most results are presented in press, in speech, in a poster or online. Most of this chapter and course focuses on presentation in press, but here I have a few resources for alternative modes of presentations.
For online documents. Digital formats open up opportunities to engage readers with interactive graphs and animations, making data visualization more dynamic and accessible. Here are some powerful tools to consider:
The Shiny package allows you to build interactive web applications that let users update graphs, tables, and other visual outputs in real-time, facilitating exploratory data analysis and the creation of interactive dashboards.
gganimate enables the creation of animations, bringing your data to life and capturing your audience’s attention more effectively.
Other packages, such as plotly
, ggiraph
, rbokeh
, rcharts
, and highcharter
, offer additional capabilities for creating interactive and visually engaging graphs. These tools make it easy for users to explore your data in a more hands-on way. For example, Figure 2 shows an interactive plot that focuses on a specific penguin species when you hover over its points.
For more information, check out the Interactive Graphs chapter in Modern Data Visualization with R (Kabacoff 2024). For deeper exploration, see the book Interactive Web-Based Data Visualization with R, plotly, and shiny (Sievert 2020).
library(highcharter)
hchart(penguins, "scatter", hcaes(x = flipper_length_mm,
y = bill_length_mm,
group = species,
color = species))
Poster sessions often take place in large halls filled with hundreds of competing posters. Fun and playful plots are especially effective to capture attention as people move around in this busy setting. Isotype plots, where images represent quantities or categories instead of traditional points, bars, or lines can help get your poster that attention. For instance, Figure 3, uses the ggtextures package to replace bars with pictures of penguins of each species to depict the distribution of species across three islands.
library(ggtextures) # Add image paths for each penguin species
penguins <- penguins%>%
mutate(image = case_when(species == "Adelie" ~ "images/adelie.jpeg",
species == "Chinstrap" ~ "images/chinstrap.jpeg",
species == "Gentoo" ~ "images/gentoo.jpeg"))
penguins%>% group_by(species, island,image) %>% tally()%>%ungroup()%>%
ggplot(aes(species, n, image = image)) +
geom_isotype_col(img_width = unit(2, "cm"), img_height = unit(2, "cm")) +
facet_wrap(~island, labeller = "label_both")+
geom_hline(yintercept = seq(0,150,25), color = "lightgrey", lty = 2)+
theme(panel.background = element_rect(fill = "white", size = 0.5, linetype = "solid",color = "black"))
For talks: When presenting your work in a talk, you can use transitions to gradually introduce elements of your figures, guiding the audience through your findings step by step. This allows you to control how the plot is revealed, creating a dynamic narrative as you walk through the data. While this approach is a great benefit in a talk, there are challenges—namely, the audience may be seated far from the screen, making it difficult to see small details in your plot. To address this, keep your take-home message clear and simple, using large fonts and bold points to ensure readability. You’ll likely need to adjust the theme()
function in R to increase the text, label, and point sizes for better visibility. I discuss how to handle these common challenges in the next section.
For different presentation modes—such as posters or talks—you’ll need to adjust the text size accordingly. Text should be much larger for posters and talks to ensure readability from a distance. Mastering the theme()
function in ggplot2
allows you to fully customize your plots, including text size. Most importantly, you can adjust the size of text using the element_text(size = ...)
function, which is applied after specifying the plot element you want to modify (see Fig. 4).
ggplot(penguins, aes(x = species, y = bill_length_mm, color = species)) +
geom_point()+
theme(axis.title.x = element_text(size = 20, color = "orange"),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15, color = "firebrick"),
legend.text = element_text(color = "purple"),
legend.title = element_text(size = 30, color = "gold"))
While there are quick tricks for using the theme()
function, it can be quite complex and sometimes frustrating to work with. I highly recommend the ggThemeAssist
package, which provides a graphical user interface (GUI) that allows you to point and click your way to the desired figure. It then generates the corresponding R code for you (see Fig. 5). Personally, I’ve become much more proficient with the theme()
function over the past few years, thanks to the insights I gained from using ggThemeAssist
, as it helped me understand which arguments control which parts of the plot. Of course, now ChatGPT would also be very helpful!
To use ggThemeAssist
:
ggplot
in your R script.In the previous chapter we said good plots are
So how do we translate these ideas into R plots? Check out these tips and tricks below!!!
Making and critiquing plots is one of my favorite parts of science — I absolutely love it! However, I know it can be a major time sink, and we want to avoid that. Here are my tips for preventing yourself from getting bogged down by every figure:
ggplot2
. Keep the ggplot cheat sheet handy, and bookmark additional resources that suit your workflow. A quick Google search can also be a lifesaver!Be sure that your figures do not deceive.
Do not let yourself be deceived by your exploratory figures – you will just waste a lot of your time and energy.
Do not deceive your audience with your explanatory figures, you will lose their trust.
The most common ways figures deceive is with inappropriate limits, nonlinear (or nonsensical) axis scales, or changing data. Here I show how you can use ggplot to prevent these deceptions.
daphnia_resist
dataset, which shows the resistance of the crustacean Daphnia to a poisonous cyanobacteria across years with high, medium and low concentrations of cyanobacteria.
daphnia_resist <- read_csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter15/chap15q17DaphniaResistance.csv")
When working with data on proportions, such as Daphnia resistance to cyanobacteria, it’s often best to present the data on a scale from zero to one, since proportions naturally range between these values. However, a basic plot created with geom_point()
only spans the actual range of the data, which may not always include the full 0 to 1 scale. To address this, you can use the limits
argument within the scale_y_continuous()
function to explicitly set the y-axis to range from 0 to 1. Figure 6 contrasts these two choices.
library(patchwork)
library(ggplot2)
library(readr)
no_set_limit <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_point() +
labs(title = "ggplot's deault limits")
limit_0_1 <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_point() +
scale_y_continuous(limits = c(0,1))+
labs(title = "Adding a limit from zero to one")
no_set_limit + limit_0_1 + plot_annotation(tag_levels = 'A')
In the Daphnia dataset, the explanatory variable on the x-axis represents cyanobacteria density as “high,” “low,” or “medium.” The natural order here is “low”, “medium”, “high”, (or maybe “high”,“medium”, “low”). By default, R
does neither of these things and rather presents things in alphabetical order. This can confuse readers and potentially mislead them if they don’t examine the axis carefully.
The forcats
package allows us to modify the order R assigns to categorical variables. For example, the fct_relevel()
function enables us to reorder factor levels by specifying the desired sequence. As shown below, using fct_relevel()
inside the mutate()
functionlets us present trends more clearly and intuitively (7).
library(patchwork)
library(ggplot2)
library(readr)
library(forcats)
# Create a plot using the default order of the categorical variable (cyandensity) in R
default_order <- limit_0_1 +
stat_summary(aes(group = 1), geom = "line", fun = "mean", lty = 2) + # Adds a dashed line (lty = 2) connecting the mean values
labs(title = "x is in R's default order") # Title indicating that the x-axis follows R's default alphabetical ordering
# Reorder the 'cyandensity' variable so that it appears in a custom order: "high", "med", "low"
daphnia_resist <- daphnia_resist %>%
mutate(cyandensity = fct_relevel(cyandensity, "high", "med", "low")) # Use 'fct_relevel' to reorder the factor levels of 'cyandensity'
# Create a plot using the reordered 'cyandensity' variable with a logical order
better_order <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_point() + # Adds points for each data point
scale_y_continuous(limits = c(0,1)) + # Ensures the y-axis is scaled between 0 and 1
stat_summary(aes(group = 1), geom = "line", fun = "mean", lty = 2) + # Adds a dashed line connecting the mean values for each group
labs(title = "x is in a sensible order") # Title indicating that the x-axis is in a logical, user-defined order
default_order + better_order + plot_annotation(tag_levels = 'A')
jitter
DeceiveWhen visualizing data, it’s common to use the jitter
function to spread out points and avoid overplotting, especially when there are many overlapping data points. However, a simple jitter can be deceptive in two ways: (a) It can create the false impression that an ordinal variable is continuous, and (b) it can distort the actual values of a continuous response. Narrowoing the width of the jitter, and coloring the points by category to differentiate groups clearly. Since the coloring is redundant and doesn’t provide new information, you can hide the legend to keep the plot clean.
Below, you’ll see two examples: one where jittering creates a misleading plot, and another where jittering is applied with precautions to ensure clarity (Figure 8).
# Create a "dishonest" jitter plot where jitter is applied randomly to both x and y axes
dishonest_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_jitter() + # Adds random noise to both x and y axes (spreading points around)
scale_y_continuous(limits = c(0,1)) + # Ensures the y-axis is scaled between 0 and 1
labs(title = "Dishonest jitter") # Adds a title indicating the "dishonest" nature of the plot
# Create an "honest" jitter plot with controlled jitter only on the x-axis
honest_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) +
geom_jitter(height = 0, width = .2, show.legend = FALSE) + # Jitters only the x-axis (spreading the points horizontally), no vertical jitter
scale_y_continuous(limits = c(0,1)) + # Ensures the y-axis is scaled between 0 and 1
labs(title = "Honest jitter") # Adds a title indicating the "honest" nature of the plot
# Combine both plots side by side with plot labels A and B
dishonest_jitter + honest_jitter + plot_annotation(tag_levels = 'A') # Annotates the plots with labels A (dishonest) and B (honest)
To highlight another common problem in data presentaiton, let’s explore the variability in mammal lifespan across species as a function of body weight. Because both variables are extremely right skewed, log-transforming these data can reveal important patterns (compare Fig. 9a with Fig. 9b). However, it’s crucial to make it clear to the reader that the data are presented on a log scale. Figure 9, uses the annotation_logticks
function to clearly indicate the use of a log scale to prevent misinterpretation of relationships as linear. Note that we use the scale_x_continuous()
and scale_y_continuous()
functions to apply the log transformation while retaining linear tick marks, ensuring that the data are displayed on a log scale but labeled in linear terms.
# Load the mammal lifespan dataset from a tab-delimited (.txt) file online
mammal_lifespan <- read_tsv("http://www.statsci.org/data/general/sleep.txt") # because it is tab-delimited
# Create a base plot with mammal body weight on the x-axis and lifespan on the y-axis
base_plot <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan)) +
geom_point() # Add points for each species in the dataset
# Modify the base plot to apply a log10 transformation on both x and y axes
log_10_plot <- base_plot +
scale_x_continuous(trans = "log10") + # Log10 scale for the x-axis (Body Weight)
scale_y_continuous(trans = "log10") # Log10 scale for the y-axis (Lifespan)
# Add log ticks to the log10 plot on the bottom (x-axis) and left (y-axis)
log_10_plot_w_ticks <- log_10_plot +
annotation_logticks(sides = "bl", base = 10, size = .2) # Ticks on bottom "b" and left "l"
base_plot + log_10_plot + log_10_plot_w_ticks + plot_annotation(tag_levels = 'A')
When I say “show your data,” I mean SHOW THE FUCKING DATA - THAT IS, THE ACTUAL POINTS , not just summarizing the data with means, standard errors, or boxplots. Fortunately, when working with raw data, ggplot2
makes it fairly difficult to hide the data. As long as you’re feeding raw data into ggplot
, most geoms will display the data directly. However, there are a few important considerations.
With a modest-sized dataset, like the daphnia_resistance
data, overplotting can obscure individual data points. A simple jitter plot can solve this issue by spreading out points that would otherwise overlap (compare Fig. 10a to Fig. 10b). Remember to adjust the width
and height
arguments in geom_jitter()
to create an honest jitter plot.
daphnia_points <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) +
geom_point(show.legend = FALSE) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "Points hide data")
daphnia_jitter <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance, color = cyandensity)) +
geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "Jitter shows data")
daphnia_points + daphnia_jitter + plot_annotation(tag_levels = 'A')
With larger datasets, overplotting becomes a more significant challenge. For instance, plotting the price of diamonds by clarity can lead to overplotting, and while jittering can help, it doesn’t always fully resolve the issue. Below are three common solutions:
alpha
parameter in geom_jitter()
to make overlapping points less prominent.geom_sina()
function from the ggforce
package.Figure 11 demonstrates several suboptimal solutions (Fig. 11a-d). The optimal solution (Fig. 11e) combines a violin plot (to help guide the eye toward the shape of the distribution) with a sinaplot that uses significant transparency. If these approaches don’t work for your data, you can explore more strategies in Chapter 5 of the R Graphics Cookbook (Chang 2020).
library(ggforce)
base_diamond <- ggplot(diamonds, aes(x = clarity, y = price)) + scale_y_continuous(trans = "log10")+ annotation_logticks(sides = "l")
plot_a <- base_diamond + geom_point() + labs(title = "geom_point") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price,na.rm=TRUE)), color = "red")
plot_b <- base_diamond + geom_jitter() + labs(title = "geom_jitter") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_c <- base_diamond + geom_jitter(alpha = 0.1) + labs(title = "geom_jitter(alpha = 0.1)") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_d <- base_diamond + geom_violin() + labs(title = "geom_violin") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_e <- base_diamond + geom_violin() + geom_sina(alpha = 0.1) + labs(title = "geom_violin() and\ngeom_sina(alpha = 0.1)") + geom_point(data = . %>% group_by(clarity) %>% summarise(price = n()/sum(1/price)), color = "red", size = 3)
plot_a + plot_b + plot_c + plot_d + plot_e
Boxplots, which show quartiles and are excellent for summarizing data, don’t address overplotting. In fact, when a boxplot is layered on top of raw data, it can obscure individual data points. As seen in Figure 12, boxplots alone do not fully represent the data, and if added after data points, they can hide underlying patterns.
daphnia_boxplot <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_boxplot(outlier.color = NA) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "Where are the data?")
daphnia_boxplot_b <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
geom_jitter(width = 0.2, height = 0) +
geom_boxplot(outlier.color = NA) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "ggplot layers geoms on\ntop of each other")
daphnia_boxplot_c <- daphnia_boxplot +
geom_jitter(width = 0.2, height = 0) +
labs(title = "Here are the data!")
daphnia_boxplot + daphnia_boxplot_b + daphnia_boxplot_c + plot_annotation(tag_levels = 'A')
Similarly, reducing data to means and error bars can hide variability. If you need to include these summaries, ensure they don’t obscure the raw data.
means_and_errors <- ggplot(daphnia_resist, aes(x = cyandensity, y = resistance)) +
stat_summary(size = .2) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "Means +/- error hides the data")
jitter_plus_means_and_errors <- daphnia_jitter +
stat_summary(color = "black", size = .2) +
labs(title = "Adding means +/- error can enhance a plot")
means_and_errors + jitter_plus_means_and_errors + plot_annotation(tag_levels = 'A')
ggplot2
is excellent for making clear, explanatory plots, but mastering this requires learning several new tricks. Clear plots are easy to interpret and help readers focus on key patterns. Below are some useful R tricks to achieve these goals. I recommend skimming these to get a sense of what you can do, rather than memorizing each trick. For further exploration, I highly recommend The R Graphics Cookbook (Chang 2020).
By default, R orders categorical variables alphabetically, which often doesn’t make sense biologically.
In the previous sections, we used the forcats
package to reorder ordinal variables as needed. However, if your categorical variable is nominal (i.e., has no natural order), a great way to highlight patterns is to order categories from the largest to the smallest mean.
The fct_reorder
function from the forcats
package will reorder a factor based on another value. Note: This transformation happens before plotting, so we use the mutate
function.
After generating Figure 13a with the default order of regions, I reorder region
by the mean student_ratio
in descending order (removing NA
values) to create Figure 13b using the same code as Figure 13a.
student_order_a <- ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) +
geom_jitter(alpha = .5, size = 3, height = 0, width = .15, show.legend = FALSE) +
scale_y_continuous(limits = c(5,100), trans = "log10") +
stat_summary(color = "black")
df_ratios <- df_ratios %>%
mutate(region = fct_reorder(region, student_ratio, mean, na.rm = TRUE, .desc = TRUE))
student_order_b <- ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) +
geom_jitter(alpha = .5, size = 3, height = 0, width = .15, show.legend = FALSE) +
scale_y_continuous(limits = c(5,100), trans = "log10") +
stat_summary(color = "black")
student_order_a + student_order_b + plot_annotation(tag_levels = 'A')
There are several ways in which labels can be unclear:
Let’s explore how to solve these issues in R.
In our earlier example of student-to-faculty ratios by region, we saw that region names were overlapping, which made them difficult to read. We can solve this by either rotating the axis labels or, even better, flipping the coordinates. Figure 14 demonstrates how to implement these fixes in R.
student_plot_a <- ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) +
geom_jitter(alpha = 0.3, size = 2, height = 0, width = 0.15, show.legend = FALSE) +
scale_y_continuous(limits = c(5, 100), trans = "log10")
# Rotating axis labels by 90 degrees
student_plot_b <- student_plot_a +
theme(axis.text.x = element_text(angle = 90))
# Flipping the coordinates to avoid overlapping labels
student_plot_c <- student_plot_a +
coord_flip()
We often want plots with more descriptive labels than we want to type in our data analysis. You can override the defaults labels with the labs()
function. You can add math and Greek by using the expression
function (Fig 15).
student_plot_c +
labs(y = expression(paste('Students per faculty (', rho, " = ", frac('# students','#faculty'),")")))
Consider Figure 16 – In Figure 16a an educated reader could deduce that the facets designate the penguin species. Maybe a reader who knew a lot about cars could guess that the facet in Figure 16c referred to the number of gears in the car, but I would not have guessed that.
We do not want our readers guessing. At best this is frustrating for them and at worst they guess wrong. We could note the labels in the legend, but that would add to the reader’s cognitive load. Best to include this information in the facet with the labeller
function in facet_wrap
(e.g.facet_wrap(... , labeller = "label_both")
. see Fig. 16)
This part can be very fun, but there are two common pitfalls to avoid:
We can enhance our plots by adding statistical summaries, guiding lines, and annotations to help the reader interpret patterns more easily. In Figure 17, we illustrate several techniques:
mean_cl_boot
function with the stat_summary()
function, allowing us to show both the estimates and their uncertainty.geom_hline()
to represent the overall mean. You could similarly use geom_vline()
to add a vertical line.annotate()
function to label the dashed line with the word “mean,” making it clear to the reader what the line represents.geom_text()
function. Instead of labeling every point, we filtered the data to display only the countries in Africa with the largest and smallest student-to-faculty ratios, as Africa is a particularly variable region. This filtering was done using standard dplyr
tools.ggplot(df_ratios, aes(x = region, y = student_ratio, color = region)) +
geom_jitter(alpha = .3, size = 2, height = 0, width = .15, show.legend = FALSE) +
scale_y_continuous(limits = c(0, 100)) +
stat_summary(fun.data = mean_cl_boot, color = "black") +
geom_hline(yintercept = summarise(df_ratios, mean(student_ratio, na.rm = TRUE)) %>% pull(), lty = 2) + # lty = 2 for dashed line
annotate(x = 3.5, y = 28, label = "mean", geom = "text", size = 3) +
geom_text(data = . %>%
filter(region == "Africa") %>%
filter(student_ratio == max(student_ratio, na.rm = TRUE) |
student_ratio == min(student_ratio, na.rm = TRUE)),
aes(label = country), size = 3, hjust = 0, show.legend = FALSE)
There are several ways to add direct labels to your plots. The simplest method is using the annotate()
function, as introduced earlier. However, a more refined approach is to use geom_label()
, allowing you to summarize your data on the fly, as was done in Figure 18.
ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point() +
geom_label(data = . %>%
group_by(species) %>%
summarise_at(c("bill_depth_mm", "bill_length_mm"), mean),
aes(label = species), fontface = "bold",size = 3,alpha=.4)+
theme(legend.position = "none")
R doesn’t automatically recognize which data points belong together in groups. For instance, if we are plotting the weight of baby chicks on different diets over time (as in Figure 19) or bacterial growth rates over time, we need to explicitly tell R to group by chick, bacteria plate, or whatever grouping variable is relevant. Without this, we may end up with nonsensical plots (e.g., Figure 19a).
In Figure 19b, we specify group = Chick
in the main aes
call when setting up the ggplot
, ensuring that R knows to connect data points for each chick. In Figure 19c, we set group = Chick
only in the geom_point
aes
, allowing us to visualize how the mean chick weight changes over time for each diet, without cluttering the plot.
ChickWeight <- mutate(ChickWeight , Diet = factor(Diet)) # letting R know diet is categorical
chick_plot_a <- ggplot(ChickWeight, aes(x = Time, y = weight, color = Diet))+
geom_line()
chick_plot_b <- ggplot(ChickWeight, aes(x = Time, y = weight, group = Chick, color = factor(Diet)))+
geom_line(alpha = .8)
chick_plot_c <- ggplot(ChickWeight, aes(x = Time, y = weight, color = factor(Diet)))+
geom_line(aes(group = Chick), alpha = .3)+
stat_summary(fun= mean, geom = "line", size = 2,alpha = .7)
chick_plots <- plot_grid(chick_plot_a + theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "red", label = "bad :(", geom = "text", hjust = -1, vjust = 1) ,
chick_plot_b + theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "blue", label = "better :|", geom = "text", hjust = -1, vjust = 1),
chick_plot_c + theme(legend.position = "none")+ theme(legend.position = "none") + annotate(x = -Inf, y = Inf, color = "blue", label = "better :|", geom = "text", hjust = -1, vjust = 1),
labels = c("a","b","c"), nrow = 1)
chick_legend <- get_legend(chick_plot_c + theme(legend.position = "bottom"))
plot_grid(chick_plots,chick_legend, rel_heights = c(8,2), nrow =2)
Just as your data type should guide your choice of plot, it should also guide your choice of color scheme when mapping color onto a variable:
I’ve spent a lot of time choosing colors for plots, and I suggest simplifying this task by using the virids package. This package offers a range of color schemes designed to minimize confusion for readers with color vision deficiencies. I demonstrate how to use this package in Figure 20.
color_a <- chick_plot_c +
scale_color_viridis_d() +
theme(legend.position = "bottom") +
labs(title = "Discrete qualitative palettes\nfor nominal variables")
color_b <- mammal_lifespan %>%
mutate(Danger = factor(Danger)) %>% # Indicating Danger is categorical
ggplot(aes(x = BodyWt, y = LifeSpan, color = Danger)) +
geom_point(size = 3, alpha = .6) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
annotation_logticks(sides = "bl", base = 10, size = .2) +
scale_color_viridis_d() +
theme(legend.position = "bottom") +
labs(title = "Discrete sequential palettes\nfor ordinal variables")
color_c <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan, color = Gestation)) +
geom_point(size = 3, alpha = .6) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
annotation_logticks(sides = "bl", base = 10, size = .2) +
scale_color_viridis_c() +
theme(legend.position = "bottom") +
labs(title = "Continuous sequential palettes\nfor continuous variables")
plot_grid(color_a, color_b, color_c, labels = letters[1:3], ncol = 2)
Also check out the colorspace package! Check out the available palettes and which types of variables they best accompany here!
If you want to specify a few colors yourself, you can use the scale_color_manual(values = <vector of colors>)
function. For example, scale_color_manual(values = c("red", "blue"))
will color the first category in red and the second in blue. Alternatively, for a bit of fun, you can try the beyonce
or wesanderson
packages, which offer themed color palettes.
Luckily, R doesn’t make it too easy to add distracting elements like backgrounds, 3D effects, or unnecessary animations. However, with enough creativity and effort, anything is possible. Before adding any fancy decorations to a plot, ask yourself if the addition provides critical information or simply distracts from the data.
How can we build universal design into R plots?
The main reason I recommended the colorspace
package is that its color palettes are already optimized for accessibility.
Even with accessible color choices, it’s essential to check if your figures are interpretable for people with color vision deficiency. Earlier, we explored an online color vision deficiency emulator. Alternatively, the colorblindr
package can emulate color vision deficiencies directly in R.
We’ve previously discussed adjusting the size of points (e.g., in geom_point()
) or text elements (in theme()
). You can also explore the ggThemeAssist
package to make theme adjustments easier.
Colors can be difficult to distinguish, especially when printed in black and white. Using redundant coding, such as varying shapes, sizes, or patterns alongside color, ensures that readers can interpret your data in multiple ways. Figure 22 provides examples:
# Step A: Create plot with log-transformed axes and color/shape based on 'Danger'
mammal_lifespan <- read_tsv("http://www.statsci.org/data/general/sleep.txt") # because it is tab-delimited
redundant_coding_a <- mammal_lifespan %>%
mutate(Danger = factor(Danger)) %>% # Ensure 'Danger' is a factor
ggplot(aes(x = BodyWt, y = LifeSpan, color = Danger, shape = Danger)) +
geom_point(size = 3, alpha = 1) + # Plot points with size and opacity
scale_x_continuous(trans = "log10") + # Log transform x-axis
scale_y_continuous(trans = "log10") + # Log transform y-axis
annotation_logticks(sides = "bl", base = 10, size = .2) + # Add log ticks
scale_color_viridis_d() + # Use discrete viridis color scale
theme(legend.position = "bottom") + # Position legend at the bottom
theme_light() # Use light theme
# Step B: Modify plot by adding custom shapes and hiding legend text
redundant_coding_b <- redundant_coding_a + # Start with previous plot
scale_shape_manual(values = as.character(1:5)) + # Custom shapes
theme(legend.text = element_blank()) # Hide legend text
# Plot 3: A plot with log-transformed axes and color/size based on 'Gestation'
redundant_coding_c <- ggplot(mammal_lifespan, aes(x = BodyWt, y = LifeSpan, color = Gestation, size = Gestation)) +
geom_point(alpha = .6) + # Plot points with opacity
scale_x_continuous(trans = "log10") + # Log transform x-axis
scale_y_continuous(trans = "log10") + # Log transform y-axis
annotation_logticks(sides = "bl", base = 10, size = .2) + # Add log ticks
scale_color_viridis_c() + # Use continuous viridis color scale
scale_size(guide = guide_legend(reverse = TRUE))+ # Reverse the order of the size legend
theme_light() # Use light theme
plot_grid(redundant_coding_a, redundant_coding_b, redundant_coding_c, labels = letters[1:3], ncol = 2)
Watch this video from Stat 545 about improving your ggplots (7 min 25 sec). Note: they use a slightly different method for log scaling than what we’ve shown, but there are many ways to achieve the same result in R.
In this section, I highlight a few additional stylistic options to further customize your ggplots, which I’ve already used in the examples above.
In the examples above, we used the legend.position
argument in theme()
to place the legend at the bottom of the plot. However, you can have much more control over legend placement. Try experimenting with the ggThemeAssist
package to explore different options for positioning and styling your legends!
The default ggplot theme may not suit everyone. You can easily spice up your plots by switching to other themes. For example, I used theme_light()
in a previous plot. ggplot comes with several built-in themes, and the ggthemes
package provides even more options (see Fig. 24). If you’re feeling creative, you can even design your own custom theme.