5.8 Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data stored in a data frame. It provides a programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

First, let’s load ggplot2.

library(ggplot2)

ggplot2 functions work best with data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

Each chart built with ggplot2 must include the following

  • Data

  • Aesthetic mapping (aes)

    • Describes how variables are mapped onto graphical attributes
    • Visual attribute of data including x-y axes, color, fill, shape, and alpha
  • Geometric objects (geom)

    • Determines how values are rendered graphically, as bars (geom_bar), scatterplot (geom_point), line (geom_line), etc.

Thus, the template for graphic in ggplot2 is:

<DATA> %>%
    ggplot(aes(<MAPPINGS>)) +
    <GEOM_FUNCTION>()

Remember from the last lesson that the pipe operator %>% places the result of the previous line(s) into the first argument of the function. ggplot is a function that expects a data frame to be the first argument. This allows for us to change from specifying the data = argument within the ggplot function and instead pipe the data into the function.

  • use the ggplot() function and bind the plot to a specific data frame.
avg_count_by_year %>% ggplot()

Is the same as

ggplot(data=avg_count_by_year)

  • define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
avg_count_by_year %>% ggplot(aes(x=Year, y=Avg))

This can also be written more concisely:

avg_count_by_year %>% ggplot(aes(Year, Avg))

  • add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms for common graph types. To add a geom to the plot use the + operator. Note to that you can save plots as objects.
year_avg_line <- avg_count_by_year %>%
  ggplot(aes(x=Year, y=Avg)) + 
  geom_line()

year_avg_line

It might be nice to see where each data point falls on the line. To do this we can add another geometry layer.

year_avg_line <- avg_count_by_year %>% 
  ggplot(aes(x=Year, y=Avg)) + 
  geom_line() + 
  geom_point()

year_avg_line

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots, so the above plot can also be generated with code like this:

year_avg_line <- year_avg_line + geom_point()

year_avg_line

There are many ways to customize your plot, like changing the color or line type, adding labels and annotations. One thing that would make our graph easier to read is tick marks at each decade on the x-axis.

year_avg_line <-avg_count_by_year %>% 
  ggplot(aes(Year, Avg)) +
  geom_line() + 
  geom_point() + 
  scale_x_continuous(breaks = seq(1900, 2000, 10))

year_avg_line

Perhaps we would like to add some more information to our graph for context. Let’s drop a reference line at 1963 to clearly indicate on the graph when the vaccine was introduced. To do this we add a geom_vline() and the annotate() function. There are multiple ways of adding lines and text to a plot, but these will serve us well for this case. Note that you can change features of lines such as color, type, and size.

year_avg_line <- avg_count_by_year %>% 
  ggplot(aes(Year, Avg)) + 
  geom_line() + 
  geom_point() + 
  scale_x_continuous(breaks = seq(1900, 2000, 10)) +
  geom_vline(xintercept = 1963, color = "red", linetype= "dashed") +
  annotate(geom = "label", x=1963, y=18000, label="1963: vaccine introduced")

year_avg_line

Finally, let’s add a title and axis labels to our plot with the labs() function. Note that axis labels will automatically be supplied from the column names, but you can use this function to override those defaults.

year_avg_line <-
  ggplot(avg_count_by_year, aes(Year, Avg)) +
  geom_line() + 
  geom_point() + 
  scale_x_continuous(breaks = seq(1900, 2000, 10)) +
  labs(title = "Average number of measles cases per year",
         x = "Year",
         y = "Average")

year_avg_line

To emphasize that ggplot2 is part of the tidyverse along with dplyr, consider that you could clean some data and plot it in one step. Below we create a new plot that takes a closer look at measles cases in 1963. First we filter our dataset to look at rows for 1963, and then we use geom_bar to view each state.

#Combine data cleaning and plotting in one step
#plot the counts for each state in a bar graph in the year 1963, year vaccine was introduced.
count_1963 <-
  non_cumulative_year %>% 
  filter(Year == 1963) %>% 
  ggplot(aes(x = State, y = TotalCount)) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90))

count_1963