This chapter presents fundamentals of data visualization in R using the ggplot2 package. One of R’s strengths is that it empowers a large global community of developers to create and disseminate new functionality through user-contributed packages. As a result, there are many packages that improve on the data manipulation and plotting capabilities included with the base R installation. Each package provides a set of functions with accompanying documentation and datasets. In this chapter, we will begin to explore several packages that are part of the tidyverse collection of data science tools. For simplicity, we will start by working with non-spatial datasets. In upcoming chapters, these techniques will be extended to generate maps and analyze geospatial data.
library() function is used to load R packages. If these packages are not yet on your computer, you will need to install them using the
install.packages() function or the installation tools available in RStudio under Tools > Install Packages.
library(ggplot2) library(dplyr) library(readr) library(readxl)
Nearly every script requires loading one or more packages. Although a package only needs to be installed once, it must be loaded with
library() every time a new R session is started. Therefore, it is good practice to include the necessary code at the beginning of the script file. This approach ensures that the packages are loaded at the beginning of your session and makes it easier to see which packages are being used in the script.
The data used in this chapter are meteorological observations from the Oklahoma Mesonet, a network of environmental monitoring stations distributed throughout Oklahoma (https://weather.ok.gov/). The data are provided in a comma-separated values (CSV) file, which can be read and stored as a data frame object using the
read_csv() function from the readr package. This function does essentially the same thing as the base
read.csv() function that was used in the last tutorial, but has a few added features and is faster for reading large csv files.
<- read_csv("mesodata_small.csv", show_col_types = FALSE) mesosm class(mesosm) ##  "spec_tbl_df" "tbl_df" "tbl" "data.frame" mesosm## # A tibble: 240 × 9 ## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 2014 HOOK 49.5 17.9 83.0 29.0 0.17 ## 2 2 2014 HOOK 47.2 17.1 88.3 39.9 0.3 ## 3 3 2014 HOOK 60.7 26.1 79.0 25.4 0.31 ## 4 4 2014 HOOK 72.4 39.3 81.8 21.0 0.4 ## 5 5 2014 HOOK 84.4 48.3 75.4 18.8 1.25 ## 6 6 2014 HOOK 90.7 61.9 90.9 28.8 3.18 ## 7 7 2014 HOOK 90.6 64.7 88.2 33.1 2.58 ## 8 8 2014 HOOK 95.8 64.5 85.4 21.8 0.95 ## 9 9 2014 HOOK 84.3 58.0 91.2 36.4 1.48 ## 10 10 2014 HOOK 76.1 44.9 85.7 28.1 1.72 ## # … with 230 more rows, and 1 more variable: DATE <date>
Note that the
mesosm data frame has multiple classes including
tbl_df. It is an enhanced version of a data frame called a tibble that is part of the tidyverse, a collection of R packages for data science. A tibble is identical to a data frame but includes some additional features. For example, the default print method for a tbl object provides an abbreviated view of the first few rows of the data frame rather than trying to print all the data to the screen. In most cases, tbl objects can be used in exactly the same way as data frame objects, and these object types will be treated as synonymous throughout the book.
There are functions in R for importing data from just about any external file type. For example, the
read_excel() function from the readxl package can be used to import data from XLS and XLSX files. Note that these spreadsheet files can contain multiple sheets, so it may be necessary to specify the sheet containing the data. In this example, the first sheet contains a data dictionary and the second sheet contains the data. Additional arguments can also be provided to extract data from a specific range of cells.
<- read_excel("mesodata_small.xlsx", sheet=2) mesosm2 class(mesosm2) ##  "tbl_df" "tbl" "data.frame" mesosm2## # A tibble: 240 × 9 ## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 2014 HOOK 49.5 17.9 83.0 29.0 0.17 ## 2 2 2014 HOOK 47.2 17.1 88.3 39.9 0.3 ## 3 3 2014 HOOK 60.7 26.1 79.0 25.4 0.31 ## 4 4 2014 HOOK 72.4 39.3 81.8 21.0 0.4 ## 5 5 2014 HOOK 84.4 48.3 75.4 18.8 1.25 ## 6 6 2014 HOOK 90.7 61.9 90.9 28.8 3.18 ## 7 7 2014 HOOK 90.6 64.7 88.2 33.1 2.58 ## 8 8 2014 HOOK 95.8 64.5 85.4 21.8 0.95 ## 9 9 2014 HOOK 84.3 58.0 91.2 36.4 1.48 ## 10 10 2014 HOOK 76.1 44.9 85.7 28.1 1.72 ## # … with 230 more rows, and 1 more variable: DATE <dttm>
The data contain monthly summaries of several meteorological variables from 2014-2018. We will work with the following data columns.
- STID: Station ID code
- TMAX: Monthly mean of maximum daily temperature (°F)
- TMIN: Monthly mean of minimum daily temperature (°F)
- RAIN: Cumulative monthly rainfall (inches)
- DATE: Date of observation
There are a couple of new classes in these imported data frames. In the
mesosm data frame, the
DATE column belongs to the
date class. In the
mesosm2 data frame, the
DATE column belongs to the
dttm (date/time) class. Don’t worry about these details for now. The functions in ggplot2 will know how handle these classes automatically. An upcoming chapter will provide more information about how to import and manipulate date objects.
One of the most common types of scientific graphics is a time series plot, in which the date or time element is on the x-axis, and the measured variable of interest is on the y-axis. The data values are usually connected by a line to indicate progression through time.
The following code uses the
filter() function from the dplyr package to extract the rows containing meteorological data for the Mount Herman station (MTHE). More details about
filter() and other dplyr functions will be provided in Chapter 4. Here, the output of
filter() is assigned to a new object called
mesomthe. Then, a time series graph of monthly rainfall is generated (Figure 2.1). By default, the graphical output is written to the Plot tab in the RStudio GUI.
<- filter(mesosm, STID =="MTHE") mesomthe ggplot(data = mesomthe) + geom_line(mapping = aes(x = DATE, y = RAIN))
ggplot() function creates a coordinate system onto which data can be plotted. The first argument to
data, the dataset to plot. Running only the first line,
ggplot(data = mesomthe), would create just the blank coordinate system. To add data and modify the appearance of the graph, additional functions from the ggplot2 package are used. In the previous example
geom_line() adds lines to the plot. The
+ symbol indicates that the line will be added to the coordinate system created by
ggplot() using the
mesomthe data frame
mapping argument specifies the aesthetic mapping to use. Aesthetic mappings indicates which columns in the dataset get used for (or “mapped to”) various features of the plot. The mapping is always specified by the
aes() function. In this example, the code specifies that the
DATE column contains x-axis values and the
RAIN column contains y-axis values.
In the previous example, data columns were mapped to the x and y axes. To visualize a third column of data, it needs to be mapped to some other aspect of the plot. To illustrate, we will use the full
mesosm dataset, which contains five years of monthly meteorological data from four sites: HOOK (Hooker in western OK), MTHE (Mount Herman in southeastern OK), SKIA (Skiatook in northeastern OK), and SPEN (Spencer in central OK).
Because there are four sites in this dataset, there needs to be a way to distinguish lines for the sites. A common choice is to show different lines with different colors. Here the
STID column is mapped to the
color aesthetic (Figure 2.2).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID))
Taking a critical look at 2.2, it is possible to differentiate the colored lines. However, they overlap considerably, and the time series of the different stations are difficult to identify and compare. We can experiment with different aesthetics to see if the graph can be improved. One idea is to try mapping multiple aesthetics to the same column. In this example, each site has a different line pattern as well as a different color (Figure 2.3).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID, linetype = STID))
Mapping aesthetics based on patterns in addition to (or instead of) color is important in many situations. For example, pattern-based aesthetics can be interpreted by color-blind individuals and reproduced in black and white. However, in this situation, they don’t really help with distinguishing the overlapping lines in the graph.
Other aspects of the graph’s aesthetics can be manipulated. An aesthetic can be set to a fixed value by defining that aesthetic outside of the
aes() function. The following code increases the width of every line slightly to make them easier to see. In this example, the default line size of 0.5 is increased to 1.0 (Figure 2.4). The colors of the thicker lines are a bit easier to distinguish from one another.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID), size = 1.0)
In this sample dataset, the rainfall values from all the stations fall within the same range of values. No matter how the plot aesthetics are modified, they will be crowded and difficult to view on a single set of plot axes. An alternative is to organize these data into multiple plots using facets. Faceting splits the data into subsets based on one or more columns in the data frame and creates a separate chart for each subset.
To facet by a single variable, the
facet_wrap() function is used. Another
+ operator is used at the end of the
geom_line() function to indicate that all of these functions are combined to generate the plot. In this example, the
facets argument takes the name of a single column. The
vars() function is also needed to convert the column name into a format recognized by
STID column contains a character vector with four different station codes, so four facets are generated (Figure 2.5).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID))
Because these are time series data, it is often helpful to arrange the subplots on top of each other instead of side-by-side. This format makes it easy to compare values from the different stations at a given time point by scanning vertically across the subplots (Figure 2.6). The layout can be changed by using the
nrow argument to specify the number of columns or rows.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID), ncol = 1)
By default, the
facet_wrap() function uses the same data range for each of the subplots. This approach clearly shows differences in the overall magnitude of rainfall across the subplots. In particular, rainfall at the Hooker Mesonet station in western Oklahoma is considerably lower than at the other stations. The rainfall values at Hooker are compressed within a relatively small portion of the y-axis, which makes it difficult to discern variation through time. Setting the
scales argument to
"free_y" allows the data in each facet to take up the entire vertical space. Alternate values for
"free_x" for scales to vary freely in the x dimension and
"free" to allow them to vary freely in both dimensions. The resulting plot more clearly shows the relative variations in precipitation across the four stations (Figure 2.7).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID), ncol = 1, scales = "free_y")
One problem with allowing the scales to vary across facets is that readers may not carefully examine the y-axis of each plot. A naive viewer might look at Figure 2.7 and assume that rainfall is similar at all four locations based on the aesthetics alone. This approach should therefore be used with caution, and the differences in axis ranges should be explicitly noted in the figure caption and other accompanying text.
geom is the geometrical object that a plot uses to represent data. There are often multiple ways to represent the same data visually. For example, we have been using the
geom_line() function to visualize our time series of weather data as line geometries. Line graphs can be very useful for mapping time series such as weather data because the lines that connect the data points highlight the changes that occur between time periods. Another common technique for graphing data is to use a point for each combination of x and y values with no connecting lines. Replacing the
geom_line with the
geom_point() function generates the following graph (Figure 2.8).
ggplot(data = mesosm) + geom_point(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID), ncol = 1)
In most cases, using points alone is not a great choice for time series graphs because the connections between subsequent time periods are not as apparent as in the line graph. Conversely, lines should not be used with data that do not have a natural ordering because they will falsely imply connections between adjacent values in the graph. In some situations, it makes sense to plot data using multiple geometries. For example, with line graphs, it can be difficult to identify values at specific points in time based only on changes in line direction.
Including both lines and points can be effective when it is important to emphasize the sequential nature of time series data and clearly see the individual measurements at each time interval (Figure 2.9). However, the graph is also more crowded and complicated than the simple line graph, and the additional point symbols may become a distraction if they are not really needed. In general, the key to effective scientific graphics is to include just enough details to effectively communicate the important patterns in the data while resisting the urge to add unnecessary embellishments.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + geom_point(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID), ncol = 1, scales = "free_y")
Note that some aesthetics can only be used with certain geoms. For example, points can have a
shape aesthetic, but lines cannot. Conversely, lines can have a
linetype aesthetic, but points cannot.
Scales control how data values are translated to visual properties. The default scale setting can be overridden to adjust details like axis labels and legend keys or to use a completely different translation from data to aesthetic.
labs() function can be used to change the axis, legend, and plot labels. Additional arguments to
tag, and any other aesthetics that have been mapped such as
linetype. This example shows precipitation data from the four stations on a single plot with labels that specify the x-axis values, y-axis values, title for the legend of station colors, and an overall plot title (Figure 2.10).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID), size = 1.0) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID", title = "Rainfall from Four Stations in OK")
Including a graph title is not necessary in many cases. The title can usually be added later in a presentation or document file, and descriptive information can be provided in a separate caption. However, it is good practice to always add descriptive axis and legend titles with measurement units where appropriate. By default, these labels are just the column names from the data frame. Although the analyst may be familiar with these codes, they will usually not be interpretable by other people viewing the graph.
Tick marks and associated text labels indicate how the data are scaled along the axes of the graph. The
ggplot() function uses an algorithm to optimize the default number and placement of the ticks along each axis. In general, there should be enough ticks to make it easy to associate plot aesthetics with axis values, but not so many that the axis becomes cluttered and difficult to interpret. Ticks are associated with round numbers to make the labels easier to read.
In some situations, it is necessary to change the placement of the tick marks and their labels. In the time series graphs produced so far, there is one tick mark every two years on the x-axis, with each tick located on the first day of the year. The wide spacing between ticks makes it difficult to determine what year we are looking at when examining the precipitation patterns. It would be helpful to change the x-axis so that there are ticks and labels for every year.
scale_x_date() function allows modification of the tick marks and labels on the x-axis. There are also
scale_x_discrete() functions that are appropriate when there are continuous or discrete variables mapped to the axis, as well as versions of all these functions for the y-axis. Two arguments are provided to
breaks argument indicates where tick marks and labels will be placed, and the
date_label argument indicates how the dates will be formatted. The
%Y code indicates that only the year will be displayed. The resulting graph now shows one tick mark and label per year (Figure 2.11).
<- c("2014-01-01", "2015-01-01", "2016-01-01", datebreaks "2017-01-01", "2018-01-01", "2019-01-01") <- as.Date((datebreaks)) datebreaks ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(facets = vars(STID), ncol = 1) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID") + scale_x_date(breaks = datebreaks, date_label = "%Y")
Remember that the
DATE variable plotted on the x-axis is a
Date object. This is why it is necessary to use the
scale_x_date() function, and why the
datebreaks vector needs to be converted from a
character vector to a
date vector using the
as.Date() function. Chapter 4 will provide more information about date variables and how to manipulate them.
Scale functions are frequently used to specify the colors and symbols that will be mapped to the data values. For example, the
scale_color_manual() function can be used to manually specify the colors for each value of the
STID variable associated with the
color aesthetic (Figure 2.12). Note that other aesthetics have their own scale functions, such as
scale_size(). Additional examples of scale functions will be provided as more complex plots and maps are developed throughout the book.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID), size = 1.0) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID") + scale_x_date(breaks = datebreaks, date_label = "%Y") + scale_color_manual(values = c("firebrick", "steelblue", "olivedrab", "gold"))
Colors can be specified in several ways in R. The previous example uses a character string with the name of each color name. A list of the 657 color names available in R can be obtained by running the function
colors(). Colors can also be specified directly in terms of their red, green, and blue (RGB) components with a hexadecimal string of the form “#RRGGBB”. With so many possibilities, selecting effective colors for scientific graphics is a formidable challenge. There are several resources available online that list and display all of the named R colors, including the “R color cheatsheet” provided by the National Center for Ecological Analysis and Synthesis at https://www.nceas.ucsb.edu/r-spatial-guides. There are also R packages that can be used to automatically generate color palettes for graphs and maps. We will use several of these packages in upcoming chapters.
The default “look” of graphs created with
ggplot() has some distinctive and recognizable features. The plot background is light gray with white gridlines that align with the axis ticks. There are no axis lines and the tick marks extent outward from the edge of the background grid. This default theme is relatively simple, and many types of graphs are easy to read against this background. However, users may prefer an alternative graph design or need to modify the plot elements to meet publication requirements. One way to change the theme is by specifying an alternative theme function. For example, the
theme_bw() function uses a black grid on a white background (Figure 2.13).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID), size = 1.0) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID") + scale_color_manual(values = c("firebrick", "steelblue", "olivedrab", "gold")) + scale_x_date(breaks = datebreaks, date_label = "%Y") + theme_bw()
theme() function allows detailed formatting of plot components, including text, lines, and the plot area. The following example shows how to modify various text elements including axis text, legend text, and the main title. The
element_text() function is used to provide formatting details to each argument of the
theme() function. Arguments to
element_text() control text
vjust arguments control horizontal text justification along the x-axis and vertical text justification along the y-axis (0 = left, 0.5 = center, 1 = right). The
face argument changes the font type from the default of
bold.italic. Compared to Figure 2.13, the resulting plot has slightly larger text (Figure 2.14). The plot, axis, and legend titles are dark blue and bolded. The plot title is also left justified. The x-axis title is suppressed by specifying
element_blank(). The x-axis tick label text is angled to avoid crowding the larger text and justified to increase separation from the axis.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID), size = 1.0) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID", title = "Rainfall from Four Stations in OK") + scale_color_manual(values = c("firebrick", "steelblue", "olivedrab", "gold")) + scale_x_date(breaks = datebreaks, date_label = "%Y") + theme_bw() + theme(axis.text.x = element_text(angle = 45, size = 14, hjust = 1, vjust = 1), axis.text.y = element_text(size = 14), axis.title.x = element_blank(), axis.title.y = element_text(color = "darkblue", size = 16), legend.text = element_text(size = 14), legend.title = element_text(color = "darkblue", size = 16, face = "bold"), plot.title = element_text(color = "darkblue", size = 18, hjust = 0.5, face = "bold"), )
A common problem with plots and maps is text that is too small to be easily readable. Always check the size of your text and consider the size and manner in which your plot will be displayed. Will it be embedded in a PDF document? Displayed on a website that will be viewed on a computer screen? Projected on a large screen in front of a lecture hall? Different text sizes may be required for each of these examples. You can use arguments to
theme() like in the previous example to control text size and appearance in different sections of the graph. There are many more arguments to the
theme() function that can be used to control other aspects of plot appearance, and additional examples will be provided in later chapters.
Building a scientific graphic with ggplot involves a number of steps that are implemented with different types of functions.
- Create a ggplot using
- Add geometric representations of data to a plot using geoms.
- Map data columns to plot aesthetics.
- Split your dataset into subplots using facets.
- Control the visual properties of your plot using scales.
- Control other aspects of plot appearance by modifying themes .
Throughout the rest of the book, various combinations of these steps will be used to create charts and maps. By learning this process, you will be able to generate a wide variety of graphs and maps using diverse datasets. The upcoming chapters will provide many more examples of how to use
ggplot() for data visualization. In addition to covering the technical aspects of coding, they will continue to explore how to design scientific graphics so that they summarize data effectively and communicate the resulting information clearly and accurately to viewers.
The following subsections provide examples of how to generate other types of standard plots with
ggplot(), including scatterplots, bar charts, histograms, and boxplots. Each example also uses a different theme function to show some of the options that are available in the ggplot2 package.
Scatterplots are used to show the relationship between two variables. They are similar to time series plots, but they use points for the geometry, and both axes are measured variables rather than just the y-axis. This example shows the relationship between monthly summaries of daily minimum and maximum temperatures (Figure 2.15). There is one point for each monthly record at each of the four meteorological stations. The
scale_color_manual() function is used to assign a different color to each of the four stations. The plot shows that there is a very strong linear relationship between the monthly minimum and maximum temperatures at each station. The temperatures at Hooker are much higher than those at the other three stations.
ggplot(data = mesosm) + geom_point(mapping = aes(x = TMIN, y = TMAX, color = STID)) + labs(x = "Minimum Temperature (\u00B0F)", y = "Maximum Temperature (\u00B0F)", color = "Station ID") + scale_color_manual(values = c("firebrick", "steelblue", "olivedrab", "gold")) + theme_minimal()
Bar charts represent data values using the heights of horizontal or vertical bars, and the color and arrangement of the bars can be used to represent groupings within the data. The following example creates a simple bar chart to display the precipitation times series as bars rather than lines or dots (Figure 2.16). Other examples of grouped bar charts will be provided in later chapters. Because the monthly precipitation values represent cumulative sums for each month, the relative heights of the bars provide an intuitive representation of the month-to-month variation.
ggplot(data = mesosm) + geom_col(mapping = aes(x = DATE, y = RAIN)) + labs(x = "Date", y = "Precipitation (in)") + facet_wrap(facets = vars(STID)) + theme_bw()
A histogram is a graphical representation of the distribution of numerical data. The heights of the bars are proportional to the frequencies of observations within different value ranges. The example below plots the frequency distribution of rainfall at the Mount Herman station (Figure 2.17). Custom labels are added to the axes, and the text is bolded and increased in size to make it more readable. The graphs shows that most months in the dataset have rainfall between about 2-7 inches, but there are a few outlying months with rainfall between 10-20 inches.
ggplot(data = mesomthe) + geom_histogram(aes(x = RAIN), bins = 10) + labs(x = "Precipitation (in)", y = "Count of observations") + theme_linedraw() + theme(axis.text.x = element_text(size = 12, face = "bold"), axis.text.y = element_text(size = 12, face = "bold"), axis.title.x = element_text(size = 14, face = "bold"), axis.title.y = element_text(size = 14, face = "bold"))
Like histograms, boxplots also display the distributions of data. Boxplots provide a more simplified representation of the distribution than histograms, and it is often easier to compare multiple variables using boxplots. This example shows the rainfall distributions for the four mesonet stations (Figure 2.18). For boxplots generated with
ggplot(), the horizontal line represents the median value, and the box represents the inter-quartile range (the 25th through 75th percentiles). The upper and lower whiskers extend to the largest and smallest value no further than 1.5 times the hinges (the edges of the box). Data with higher or lower values than the whiskers are called “outliers” and are plotted individually.
ggplot(data = mesosm) + geom_boxplot(aes(x = STID, y = RAIN)) + labs(x = "Station Code", y = "Precipitation (in)") + theme_classic()
Although this chapter has covered a lot of information, it has only begun to scratch the surface of what is possible with the ggplot2 package (Wickham 2016). For a complete listing of ggplot2 functions, you can check out the online reference at https://ggplot2.tidyverse.org. Another helpful reference is the ggplot2 “cheatsheet”, which is available at https://www.rstudio.com/resources/cheatsheets/ and provides an overview of the most important ggplot2 functions.
Write code to generate the following graphs based on the
mesosm dataset. You can start with the examples provided in the text and modify them as needed.
Create a graph that displays four scatterplots of TMIN versus TMAX - one for each site.
Create a boxplot that compares the distribution of TMAX for each site. Make the axis text and labels bold so that they are easier to see.
Create a graph that displays four histograms of RAIN, one for each site. Experiment with changing the number of bins in the histograms to see how this affects the visualization.