Chapter 2 Graphics with ggplot2
This chapter will present the fundamentals of data visualization in R using the ggplot2 package. One of R’s strengths has always been that anybody can join the R community and contribute to R’s capabilities by creating new packages. As a result, there are many packages that improve on the data manipulation and plotting functions included with the basic R installation. Each package provides a set of functions with accompanying documentation and datasets. In this chapter, we will begin to explore several packages that are part of the “Tidyverse” collection of data science tools. For simplicity, we will start by working with non-spatial datasets, and in upcoming sections we will explore how these approaches can be extended to spatial data.
library() function is used to load R packages. If these packages are not yet on your computer, you will need to install them using the
install.packages() function or the installation tools available in RStudio under Tools > Install Packages.
library(ggplot2) library(dplyr) library(readr) library(readxl)
The data this chapter are stored in a csv (comma-delimited text) file, which can be read and stored as a data frame object using the ‘read_csv()’ function from the
readr package. This function does essentially the same thing as the
read.csv() function that we used in the last tutorial, but has a few added features and is faster for reading large csv files.
## ## -- Column specification -------------------------------------------------------- ## cols( ## MONTH = col_double(), ## YEAR = col_double(), ## STID = col_character(), ## TMAX = col_double(), ## TMIN = col_double(), ## HMAX = col_double(), ## HMIN = col_double(), ## RAIN = col_double(), ## DATE = col_date(format = "") ## )
##  "spec_tbl_df" "tbl_df" "tbl" "data.frame"
## # A tibble: 240 x 9 ## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date> ## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01 ## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01 ## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01 ## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01 ## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01 ## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01 ## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01 ## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01 ## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01 ## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01 ## # ... with 230 more rows
Note that the
mesosm data frame has multiple classes including
tbl_df. It is basically an enhanced version of a data frame called a “tibble” that is part of the “Tidyverse” a collection of R packages for data science. A tibble is identical to a data frame but includes some added features. For example, the default print method for a tbl object provides an abbreviated view of the first few rows of the data frame rather than trying to print the entire data frame to the screen. In most cases, ‘tbl’ objects can be used in exactly the same way as base R
data.frame objects, and we will treat these object types as synonomous throughout the book.
There are functions available in R for importing data from just about any external format. For example, we can use the
read_excel() function from the
readxl library to import data from xls and xlsx files. Note that excel files can contain multiple sheets, so we may need to specify the sheet containing our data (in this example, the second sheet). Additional arguments can also be provided to extract data from a specific range of cells.
read_excel("mesodata_small.xlsx", sheet=2) mesosm2 <-class(mesosm2)
##  "tbl_df" "tbl" "data.frame"
## # A tibble: 240 x 9 ## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> ## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01 00:00:00 ## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01 00:00:00 ## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01 00:00:00 ## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01 00:00:00 ## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01 00:00:00 ## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01 00:00:00 ## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01 00:00:00 ## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01 00:00:00 ## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01 00:00:00 ## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01 00:00:00 ## # ... with 230 more rows
There are a couple of new classes in these imported data frames. In the
mesosm data frame, the
DATE column belong to the
date class. In the
mesosm2 data frame, the
DATE column belongs to the
dttm (date/time) class. Don’t worry about these details for now - the functions in
ggplot2 will know how handle these classes automatically. A later chapter will provide more detailed information about how to import and manipulate date objects.
2.1 Creating a Simple Plot
One of the most common types of scientific graphics is a time series plot, in which the date or time element is on the x-axis and the measured variable of interest is on the y-axis. The data values are usually connected by a line to indicate progression through time.
The follow code uses the
filter function to extract meteorological data for the Mount Herman station (MTHE). The data contain monthly summaries of several meteorological variables from 2014-2018. The following code plots the time series of monthly rainfall.
filter(mesosm, STID=="MTHE") mesomthe <-ggplot(data = mesomthe) + geom_line(mapping = aes(x = DATE, y = RAIN))
ggplot() function creates a coordinate system onto which data can be plotted. The first argument to
ggplot() is always
data, the dataset to plot. You could run
ggplot(data = mesomthe) to create just the blank coordinate system, but this is not very interesting so it is not shown here.
Once you have a blank plot, you can add layers to it. For example,
geom_line() in the example above adds lines to the plot. The
+ symbol indicate that the line will be added to the coordinate system created by
ggplot() using the mesomthe dataset.
mapping argument specifies the aesthetic mapping to use. Aesthetic mappings tell ggplot which columns in the dataset get used for (or “mapped to”) which features of the plot. The mapping is always specified by the
aes() function. In this example, the mapping tells ggplot that the
DATE column contains x-axis values and the
RAIN column contains y-axis values.
2.2 Aesthetic Mappings
In the previous example, data columns were mapped to the x and y axes. To visualize a third column of data, it will need to be mapped to some other part of the plot.
To illustrate, we will use the full
mesosm dataset, which contains five years of monthly meteorological data from four sites: BUTL (Butler in western OK), MTHE (Mount Herman in southeastern OK, SKIA (Skiatook in northeastern OK, and SPEN (Spencer in central OK).
Because there are four sites in this dataset, we need some way to tell the lines for the sites apart. One common choice is to show different lines with different colors. Here the
STID column is mapped to the
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID))
Multiple aesthetics can be mapped to the same column. In this example, each site has a different line type as well as a different color.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID, linetype = STID))
An aesthetic can also be set to a fixed value by defining that aesthetic outside of the
aes() function. For example, the following code uses a different pattern for each line but makes all the lines blue.
ggplot(data = mesosm) + geom_line(aes(x = DATE, y = RAIN, linetype = STID), color = "blue")
In some situations, we may want to plot groups of data without mapping the groups to any particular aesthetic such as color or linetype. In this case, the
group argument can be used inside the
# no year aesthetic ggplot(data = mesosm) + geom_line(aes(x = DATE, y = RAIN, group = STID), color = "blue")
The sites are still plotted correctly, but it is no longer possible to tell which site is which. This type of graph may be desirable if the goal is to show the variability among locations without identifying individual sites
For a complete overview of all the possible aesthetic specifications, run the command
vignette('ggplot2-specs') to view the Aesthetic specifications vignette.
Another way to view additional variables on a plot is to use facets. Faceting splits the data into subsets and creates a separate chart for each subset. For example, the plots above are crowded and difficult to read with data from four sites. An alternative visualization approach is to plot each site on its own subplot, or facet.
To facet by a single variable, use the
facet_wrap() function. Another
+ operator is used at the end of the
geom_line() function to indicate that all of these functions are combined to generate the plot.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(~ STID)
Because these are time series data, it would be most effective to stack the subplots on top of each other instead of having them side-by-side. The subplot layout can be changed by using the
nrow arguments to specify the number of columns or rows.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(~ STID, ncol = 1)
By default, the facet function uses the same data range for each of the subplots. The
scales argument can be used to allow the data in each facet to take up the entire vertical space. In this case, set it to
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(~ STID, ncol = 1, scales = "free_y")
Notice how the range of the y axis now varies between subplots. Alternate values for
"free_x" for scales to vary freely in the x dimension and
"free" to allow them to vary freely in both dimensions. The default value is
2.4 Geometric objects
A geom is the geometrical object that a plot uses to represent data. There are often multiple ways to represent the same data visually. For example, we have been using line geometries to visualize our time series of weather data. An alternative might be to use points for each value. The function for a point geom is
ggplot(data = mesosm) + geom_point(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(~ STID, ncol = 1, scales = "free_y")
In most cases, using points alone is not a great choice for time series graphs because the connections between subsequent time periods are not as apparent.
Data can also be represented using multiple geoms. For example, including both lines and points can be effective in situations where it is important to emphasize the sequential nature of time series data, but also clearly see the individual measurements at each time period.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN)) + geom_point(mapping = aes(x = DATE, y = RAIN)) + facet_wrap(~ STID, ncol = 1, scales = "free_y")
Some aesthetics can only be used with certain geoms. For example, points can have a
shape aesthetic, but lines cannot. Conversely, lines can have a
linetype aesthetic, but points cannot.
Scales control how data values are translated to visual properties. The default scale setting can be overridden to adjust details like axis labels and legend keys, or to use a completely different translation from data to aesthetic.
labs() function can be used to change the axis, legend, and plot labels. Additional arguments to
tag, and any other aesthetics that have been mapped such as
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID", title = "Precipitation Patterns from Four Locations in OK")
In this example, the
scale_x_date() function allows modification of the tick marks and labels on the x axis. There is also a
scale_x_continuous() and scale_x_discrete()
function that are appropriate when there is a continuous or discrete variable mapped to the axis, as well as versions of all these functions for the y axis. Two arguments are provided toscale_x_date()
argument indicates where tick marks where and labels will be placed and thedate_label()
argument indicates how the dates will be formatted. The%Y` code indicates that only the year will be displayed.
c("2014-01-01", "2015-01-01", "2016-01-01", "2017-01-01", datebreaks <-"2018-01-01", "2019-01-01") as.Date((datebreaks)) datebreaks <-ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID", title = "Precipitation Patterns from Four Locations in OK") + scale_x_date(breaks = datebreaks, date_label = "%Y")
Remember that the
DATE variable plotted on the x-axis is a
date object. This is why it is necessary to use the
scale_x_date() function, and why the
datebreaks vector needs to be converted from a
character vector to a
date vector using the
as.Date() function. The
as.Date() function is used in a similar way in the next example.
lims() function can be used to adjust the limits of the x and y axes. In this example the x-axis is limited to dates within a single year (2017).
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + lims(x = as.Date(c('2017-01-01','2017-12-01')))
## Warning: Removed 192 row(s) containing missing values (geom_path).
Other types of scale functions allow even more adjustments to the x and y axes. For example, to plot the y axis on the log scale, which is often desirable when dealing with data that have a heavily skewed distribution, the
scale_y_log10() function can be used.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
Scale function are also used to specify the colors and symbols that will be mapped to the data values. For example, the
scale_color_manual() function can be used to manually specify the colors to be used for the variable associated with the
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + scale_color_manual(values = c("red", "blue", "brown", "purple"))
Colors can be specified in several ways in R. The simplest way is with a character string giving the color name (e.g., “red”). A list of the possible colors can be obtained with the function
colors(). Alternatively, colors can be specified directly in terms of their RGB components with a hexidecimal string of the form “#RRGGBB”. For more information see the “Color Specification” section under
help(“par”). You can also consult this handy reference: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
Each aesthetic has its own scale function, such as
scale_size(). More examples of scale functions will be provided as more complex plots and maps are developed in later chapters.
theme() function allows detailed formatting of various plot elements, including text, lines, and the plot area. The following example shows how to modify various text elements including axis text, legend text, and the main title. The
element_text() function is used to provide formatting details to each argument of the
theme() function. The arguments to
element_text() control text color, size, angle, horizonal and vertication justification (0 = left, 0.5 = center, 1 = right) and font face (plain, bold or italic). The y axis title is suppressed by specifying
element_blank() as the argument.
ggplot(data = mesosm) + geom_line(mapping = aes(x = DATE, y = RAIN, color = STID)) + labs(x = "Date", y = "Rainfall (in)", color = "Station ID", title = "Precipitation Patterns from Four Locations in OK") + theme(axis.text.x = element_text(color = "blue", size = 12, angle = 45, hjust = .5, vjust = .5, face = "plain"), axis.text.y = element_text(color = "blue", size = 12, angle = 0, hjust = .5, vjust = .5, face = "plain"), axis.title.x = element_blank(), axis.title.y = element_text(color = "blue", size = 16, angle = 90, hjust = 0.5, vjust = 1, face = "bold"), legend.text = element_text(color = "blue", size = 14, angle = 0, hjust = 0, vjust = .5, face = "plain"), legend.title = element_text(color = "blue", size = 16, angle = 0, hjust = 0, vjust = .5, face = "bold"), plot.title = element_text(color = "blue", size = 18, angle = 0, hjust = 0, vjust = .5, face = "bold"), )
A common problem with plots and maps is text that is too small to be readable. Always check the size of your text and consider the size and manner in which your plot will be displayed. Will it be embedded in a PDF document? Displayed on a website that will be viewed on a computer screen? Projected on a screen in front of a large lecture hall? Different text sizes may be required for each of these examples.
There are many more arguments to the
theme() function that can be used to control other aspects of plot appearance, and additional examples will be provided in later chapters.
2.7 The generalized ggplot template
Up to this point, you have learned how to:
- create a ggplot using
- add geometric representations of data to a plot using geoms
- map data columns to plot aesthetics
- split your dataset into subplots using facets.
- control the visual properties of your plot using scales
- control other aspects of plot apperance by modifying themes
These steps can be combined to create generalized code for plotting in ggplot.
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) + <FACET_FUNCTION> + <SCALE_FUNCTION> + <THEME_FUNCTION>
This is the template that we use for most of the plots in this book. Once you master it, the plots you can create will allow you to learn a lot about your data. This process is part of data exploration. We will see more more examples of how to use
ggplot() for data exploration in subsequent in the upcoming chapters.
2.8 Other types of plots
Scatterplots are used to show the relationship between two variables. They are like time series plots, but they always use points for the geometry and both axes are measured variables rather than just the y axis. A scale function is used to manually assign colors for each of the four stations.
ggplot(data = mesosm) + geom_point(mapping = aes(x = TMIN, y = TMAX, color = STID)) + labs(x = "Minimum Temperature (\u00B0C)", y = "Minimum Temperature (\u00B0C)", color = "Station ID") + scale_color_manual(values = c("red", "blue", "green", "orange"))
A histogram is a graphical representation of the distribution of numerical data. It usually has boxes whose areas are proportional to the frequency of observations within different value ranges. The example below plots the frequency distribution of rainfall at the Mount Herman station. Custom labels are added to the axes and the text is bolded and increased in size to make it more readable.
ggplot(data = mesomthe) + geom_histogram(aes(x = RAIN), bins = 10) + labs(x = "Precipitation (in)", y = "Count of observations") + theme(axis.text.x = element_text(size = 10, face = "bold"), axis.text.y = element_text(size = 10, face = "bold"), axis.title.x = element_text(size = 14, face = "bold"), axis.title.y = element_text(size = 14, face = "bold"))
Like histograms, boxplots also display the distributions of data. For boxplots generated with ‘ggplot()’, the horizontal line represents the median value, the box represents the inter-quartile range (the 25th through 75th percentiles). The upper whisker and lower whiskers extend to the largest and smallest value no further than 1.5 times the hinges (the edges of the box). Data with higher or lower values than the whiskers are called “outliers” and are plotted individually.
ggplot(data = mesosm) + geom_boxplot(aes(x = STID, y = RAIN)) + labs(x = "Precipitation (in)", y = "Station Code")
Although this chapter has covered a lot of information, it has only begun to scratch the surface of what is possible with the ggplot2 package. For a complete listing of ggplot2 functions, you can check out the following online reference: https://ggplot2.tidyverse.org/reference/ Another helpful reference is this PDF “cheatsheet” that provides a condensed overview of
ggplot2 functions: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf
Create a graph that displays four scatterplots of TMIN versus TMAX - one for each site.
Create a boxplot that compares the distribution of TMAX for each site. Make the axis text and labels bold so that they are easier to see.
Create a graph that displays four histograms of RAIN, one for each site. Experiment with changing the number of bins in the histograms to see how this affects the resulting visualization.