4.1 Basic Plotting With ggplot2
ggplot2 package allows you to quickly plot attractive graphics and to visualize and explore data. Objects created with
ggplot2 can also be extensively customized with
ggplot2 functions (more on that in the next subsection), and because
ggplot2 is built using grid graphics, anything that cannot be customized using
ggplot2 functions can often be customized using grid graphics. While the structure of
ggplot2 code differs substantially from that of base R graphics, it offers a lot of power for the required effort. This first subsection focuses on useful, rather than attractive graphs, since this subsection focuses on exploring rather than presenting data. Later sections will give more information about making more attractive or customized plots, as you’d want to do for final reports, papers, etc.
To show how to use basic
ggplot2, we’ll use a dataset of Titanic passengers, their characteristics, and whether or not they survived the sinking. This dataset has become fairly famous in data science, because it’s used, among other things, for one of Kaggle’s long-term “learning” competitions, as well as in many tutorials and texts on building classification models.
Kaggle is a company that runs predictive modeling competitions, with top competitors sometimes winning cash prizes or interviews at top companies. At any time, Kaggle is typically is hosting several competitions, including some with no cash reward that are offered to help users get started with predictive modeling.
Doing Kaggle competitions is a great way to practice working with data and building models. Kaggle also has forums, repositories of code script for different competitions, and a blog that includes tips from past winners.
To get this dataset, you’ll need to install and load the
titanic package, and then you can load and rename the training datasets, which includes data on about two-thirds of the Titanic passengers:
# install.packages("titanic") # If you don't have the package installed library(titanic) data("titanic_train", package = "titanic") titanic <- titanic_train
The other data example we’ll use in this subsection is some data on players in the 2010 World Cup. This is available from the
# install.packages("faraway") # If you don't have the package installed library(faraway) data("worldcup")
Unlike most data objects you’ll work with, the data that comes with an R package will often have its own help file. You can access this using the
? operator. For example, try running:
All of the plots we’ll make in this subsection will use the
ggplot2 package (another member of the tidyverse!). If you don’t already have that installed, you’ll need to install it. You then need to load the package in your current session of R:
# install.packages("ggplot2") ## Uncomment and run if you don't have `ggplot2` installed library(ggplot2)
The process of creating a plot using
ggplot2 follows conventions that are a bit different than most of the code you’ve seen so far in R (although it is somewhat similar to the idea of piping we introduced in an earlier course). The basic steps behind creating a plot with
- Create an object of the
ggplotclass, typically specifying the data and some or all of the aesthetics;
- Add on geoms and other elements to create and customize the plot, using
You can add on one or many geoms and other elements to create plots that range from very simple to very customized. We’ll focus on simple geoms and added elements first, and then explore more detailed customization later.
4.1.1 Initializing a
The first step in creating a plot using
ggplot2 is to create a ggplot object. This object will not, by itself, create a plot with anything in it. Instead, it typically specifies the data frame you want to use and which aesthetics will be mapped to certain columns of that data frame (aesthetics are explained more in the next subsection).
Use the following conventions to initialize a ggplot object:
## Generic code object <- ggplot(dataframe, aes(x = column_1, y = column_2)) ## or, if you don't need to save the object ggplot(dataframe, aes(x = column_1, y = column_2))
The dataframe is the first parameter in a
ggplot call and, if you like, you can use the parameter definition with that call (e.g.,
data = dataframe). Aesthetics are defined within an
aes function call that typically is used within the
ggplot2, life is much easier if everything you want to plot is included in a dataframe as a column, and the first argument to
ggplot must be a dataframe. This format has been a bit hard for some base R graphics users to adjust to, since base R graphics tends to plot based on vector, rather than dataframe, inputs. Trying to pass in a vector rather than a dataframe can be a common reason for
ggplot2 errors for all R users.
4.1.2 Plot aesthetics
Aesthetics are properties of the plot that can show certain elements of the data. For example, in Figure 4.1, color shows (i.e., is mapped to) gender, x-position shows height, and y-position shows weight in a sample data set of measurements of children in Nepal.
Any of these aesthetics could also be given a constant value, instead of being mapped to an element of the data. For example, all the points could be red, instead of showing gender. Later in this section, we will describe how to use these constant values for aesthetics. We’ll discuss how to code this later in this section.
Which aesthetics are required for a plot depend on which geoms (more on those in a second) you’re adding to the plot. You can find out the aesthetics you can use for a geom in the “Aesthetics” section of the geom’s help file (e.g.,
?geom_point). Required aesthetics are in bold in this section of the help file and optional ones are not. Common plot aesthetics you might want to specify include:
||Position on x-axis|
||Position on y-axis|
||Color of border of elements|
||Color of inside of elements|
||Transparency (1: opaque; 0: transparent)|
||Type of line (e.g., solid, dashed)|
4.1.3 Creating a basic ggplot plot
To create a plot, you need to add one of more geoms to the ggplot object. The system of creating a
ggplot object, mapping aesthetics to columns of the data, and adding geoms makes more sense once you try a few plots. For example, say you’d like to create a histogram showing the fares paid by passengers in the example Titanic data set. To plot the histogram, you’ll first need to create a
ggplot object, using a dataframe with the “Fares” column you want to show in the plot. In creating this
ggplot object, you only need one aesthetic (
x, which in this case you want to map to “Fares”), and then you’ll need to add a histogram geom. In code, this is:
ggplot(data = titanic, aes(x = Fare)) + geom_histogram()
This code sets the dataframe as the
titanic object in the user’s working session, maps the values in the
Fare column to the
x aesthetic, and adds a histogram geom to generate a histogram.
If R gets to the end of a line and there is not some indication that the call is not over (e.g.,
%>% for piping or
ggplot2 plots), R interprets that as a message to run the call without reading in further code. A common error when writing
ggplot2 code is to put the
+ to add a geom or element at the beginning of a line rather than the end of a previous line— in this case, R will try to execute the call too soon. To avoid errors, be sure to end lines with
+, don’t start lines with it.
There is some flexibility in writing the code to create this plot. For example, you could specify the aesthetic for the histogram in an
aes statement when adding the geom (
geom_histogram) rather than in the
ggplot(data = titanic) + geom_histogram(aes(x = Fare))
Similarly, you could specify the dataframe when adding the geom rather than in the
ggplot() + geom_histogram(data = titanic, aes(x = Fare))
Finally, you can pipe the
titanic dataframe into a
ggplot call, since the
ggplot function takes a dataframe as its first argument:
titanic %>% ggplot() + geom_histogram(aes(x = Fare)) # or titanic %>% ggplot(aes(x = Fare)) + geom_histogram()
While all of these work, for simplicity we will use the syntax of specifying the data and aesthetics in the
ggplot call for most examples in this subsection. Later, we’ll show how this flexibility can be used to use data from differents dataframe for different geoms or change aesthetic mappings between geoms.
A key thing to remember, however, is that
ggplot is not flexible about whether you specify aesthetics within an
aes call or not. We will discuss what happens if you do not later in the book, but it is very important that if you want to show values from a column of the data using aesthetics like color, size, shape, or position, you remember to make that specification within
aes. Also, be sure that you specify the dataframe before or when you specify aesthetics (i.e., you can’t specify aesthetics in the
ggplot statement if you haven’t specified the dataframe yet), and if you specify a dataframe within a geom, be sure to use
data = syntax rather than relying on parameter position, as
data is not the first parameter expected for geom functions.
When you run the code to create a plot in RStudio, the plot will be shown in the “Plots” tab in one of the RStudio panels. If you would like to save the plot, you can do so using the “Export” button in this tab. However, if you would like to use code in an R script to save a plot, you can do so (and it’s more reproducible!).
To save a plot using code in a script, take the following steps: (1) open a graphics device (e.g., using the function
png); (2) run the code to draw the map; and (3) close the graphics device using the
dev.off function. Note that the function you use to open a graphics device will depend on the type of device you want to open, but you close all devices with the same function (
Geom functions add the graphical elements of the plot; if you do not include at least one geom, you’ll get a blank plot space. Each geom function has its own arguments to adjust how the graph is created. For example, when adding a historgram geom, you can use the
bins argument to change the number of bins used to create the histogram— try:
ggplot(titanic, aes(x = Fare)) + geom_histogram(bins = 15)
As with any R functions, you can find out more about the arguments available for a geom function by reading the function’s help file (e.g.,
Geom functions differ in the aesthetic inputs they require. For example, the
geom_histogram funciton only requires a single aesthetic (
x). If you want to create a scatterplot, you’ll need two aesthetics,
y. In the
worldcup dataset, the
Time column gives the amount of time each player played in the World Cup 2010 and the
Passes column gives the number of passes he made. To see the relationship between these two variables, you can create a ggplot object with the dataframe, mapping the x aesthetic to
Time and the y aesthetic to
Passes, and then adding a point geom:
ggplot(worldcup, aes(x = Time, y = Passes)) + geom_point()
All geom functions have both required and accepted aesthetics. For example, the
geom_point function requires
y, but the function will also accept
stroke aesthetics. If you try to create a geom without one its required aesthetics, you will get an error:
ggplot(worldcup, aes(x = Time)) + geom_point()
Error: geom_point requires the following missing aesthetics: y
You can, however, add accepted aesthetics to the geom if you’d like; for example, to use color to show player position and size to show shots on goal for the World Cup data, you could call:
ggplot(worldcup, aes(x = Time, y = Passes, color = Position, size = Shots)) + geom_point()
The following table gives some of the geom functions you may find useful in
ggplot2, along with the required aesthetics and some of the most useful some useful specific arguments for each geom function (there are other useful arguments that can be applied to many different geom functions, which will be covered later). The elements created by these geom functions are usually clear from the function names (e.g.,
geom_point plots points;
geom_segment plots segments).
|Function||Common aesthetics||Common arguments|
4.1.5 Using multiple geoms
Several geoms can be added to the same
ggplot object, which allows you to build up layers to create interesting graphs. For example, we previously made a scatterplot of time versus shots for World Cup 2010 data. You could make that plot more interesting by adding label points for noteworthy players with those players’ team names and positions. First, you can create a subset of data with the information for noteworthy players and add a column with the text to include on the plot. Then you can add a text geom to the previous ggplot object:
library(dplyr) noteworthy_players <- worldcup %>% filter(Shots == max(Shots) | Passes == max(Passes)) %>% mutate(point_label = paste(Team, Position, sep = ", ")) ggplot(worldcup, aes(x = Passes, y = Shots)) + geom_point() + geom_text(data = noteworthy_players, aes(label = point_label), vjust = "inward", hjust = "inward")
In this example, we’re using data from different dataframes for different geoms. We’ll discuss how that works more later in this section.
As another example, there seemed to be some horizontal clustering in the scatterplot we made of player time versus passes made for the
worldcup data. Soccer games last 90 minutes each, and different teams play a different number of games at the World Cup, based on how well they do. To check if horizontal clustering is at 90-minute intervals, you can plot a histogram of player time (
Time), with reference lines every 90 minutes. First initialize the ggplot object, with the dataframe to use and appropriate mapping to aesthetics, then add geoms for a histogram as well as vertical reference lines:
ggplot(worldcup, aes(x = Time)) + geom_histogram(binwidth = 10) + geom_vline(xintercept = 90 * 0:6, color = "blue", alpha = 0.5)
Based on this graph, player’s times do cluster at 90-minute marks, especially at 270 minutes, which would be approximately after three games, the number played by all teams that fail to make it out of the group stage.
4.1.6 Constant aesthetics
Instead of mapping an aesthetic to an element of your data, you can use a constant value for it. For example, you may want to make all the points green in the World Cup scatterplot. You can do that by specifying the color aesthetic outside of an
aes call when adding the points geom. For example:
ggplot(worldcup, aes(x = Time, y = Passes)) + geom_point(color = "darkgreen")
You can do this with any of the aesthetics for a geom, including color, fill, shape, and size. If you want to change the shape of points, in R, you use a number to specify the shape you want to use. Figure 4.2 shows the shapes that correspond to the numbers 1 to 25 in the
shape aesthetic. This figure also provides an example of the difference between the color aesthetic (black for all these example points) and fill aesthetic (red for these examples). If a geom has both a border and an interior, the color aesthetic specifies the color of the border while the fill aesthetic specifies the color of the interior. You can see that, for point geoms, some shapes include a fill (21 for example), while some are either empty (1) or solid (19).
If you want to set color to be a constant value, you can do that in R using character strings for different colors. Figure 4.3 gives an example of a few of the different blues available in R. To find images that show all these named choices for colors in R, google “R colors” and search by “Images” (for example, there is a pdf here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).
Later we will cover additioal ways of handling colors in R, including different color palettes you can use. However, these “named” colors just shown can be a fast way to customize constant colors in R plots.
4.1.7 Other useful plot additions
There are also a number of elements besides geoms that you can add onto a
ggplot object using
+. A few that are used very frequently are:
||x- and y-axis labels|
||Limits of x- and y-axis|
You can also use this syntax to customize plot scales and themes, which we will discuss later in this section.
4.1.8 Example plots
In this subsection, we’ll show a few more examples of basic plots created with
ggplot2. For the example plots in this subsection, we’ll use a dataset in the
faraway package called
nepali. This gives data from a study of the health of a group of Nepalese children. You can load this data using:
# install.packages("faraway") ## Uncomment if you do not have the faraway package installed library(faraway) data(nepali)
Each observation in this dataframe represents a measurement for a child, including some physiological measurements like height and weight, and some children were measured multiple times and so have multiple observations in this data. Before plotting this data, we cleaned it a bit. We used tidyverse functions to select a subset of the columns: child id, sex, weight, height, and age. We also used the
distinct function from
dplyr to limit the dataset to the first measurement for each child.
nepali <- nepali %>% select(id, sex, wt, ht, age) %>% mutate(id = factor(id), sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>% distinct(id, .keep_all = TRUE)
After this cleaning, the data looks like this:
id sex wt ht age 1 120011 Male 12.8 91.2 41 2 120012 Female 14.9 103.9 57 3 120021 Female 7.7 70.1 8 4 120022 Female 12.1 86.4 35 5 120023 Male 14.2 99.4 49 6 120031 Male 13.9 96.4 46
We’ll use this cleaned dataset to show how to use
ggplot2 to make histograms, scatterplots, and boxplots.
Histograms show the distribution of a single variable. Therefore,
geom_histogram() requires only one main aesthetic,
x, which should be numeric. For example, to create a histogram of children’s heights for the Nepali dataset (Figure 4.4), create a ggplot object with the data
nepali and with the height column (
ht) mapped to the ggplot object’s x aesthetic. Then add a histogram geom:
ggplot(nepali, aes(x = ht)) + geom_histogram()
If you run the code with no arguments for
geom_histogram, you will get a message saying “
bins = 30. Pick better value with
binwidth.”. This message is just saying that a default number of bins was used to create the histogram. You can use arguments to change the number of bins used, but often this default is fine. You may also get a message that observations with missing values were removed.
You can add some elements to this plot to customize it a bit. For example (Figure 4.5), you can add a figure title (
ggtitle) and clearer labels for the x-axis (
xlab). You can also change the range of values shown by the x-axis (
ggplot(nepali, aes(x = ht)) + geom_histogram(fill = "lightblue", color = "black") + ggtitle("Height of children") + xlab("Height (cm)") + xlim(c(0, 120))
Note that these additional graphical elements are added on by adding function calls to
xlim to our ggplot object.
A scatterplot shows the association between two variables. To create a scatterplot, add a point geom (
geom_point) to a ggplot object. For example, to create a scatterplot of height versus age for the Nepali data (Figure 4.6), you can run the following code:
ggplot(nepali, aes(x = ht, y = wt)) + geom_point()
Again, you can use some of the options and additions to change the plot appearance. For example, to add a title, change the x- and y-axis labels, and change the color and size of the points on the scatterplot (Figure 4.7), you can run:
ggplot(nepali, aes(x = ht, y = wt)) + geom_point(color = "blue", size = 0.5) + ggtitle("Weight versus Height") + xlab("Height (cm)") + ylab("Weight (kg)")
You can also try mapping a variable to the
color aesthetic of the plot. For example, to use color to show the sex of each child in the scatterplot (Figure 4.8), you can run add an additional mapping of this optional aesthetic to the
sex column of the
nepali dataframe with the following code:
ggplot(nepali, aes(x = ht, y = wt, color = sex)) + geom_point(size = 0.5) + ggtitle("Weight versus Height") + xlab("Height (cm)") + ylab("Weight (kg)")
Boxplots are one way to show the distribution of a continuous variable. You can add a boxplot geom with the
geom_boxplot function. To plot a boxplot for a single, continuous variable, you can map that variable to
y in the
aes call and map
x to the constant
1. For example, to create a boxplot of the heights of children in the Nepali dataset (Figure 4.9), you can run:
ggplot(nepali, aes(x = 1, y = ht)) + geom_boxplot() + xlab("")+ ylab("Height (cm)")
You can also create separate boxplots, one for each level of a factor (Figure 4.10). In this case, you’ll need to map columns in the input dataframe to two aesthetics (
y) when initializing the ggplot object The
y variable is the variable for which the distribution will be shown, and the
x variable should be a discrete (categorical or TRUE/FALSE) variable, which will be used to group the variable.
ggplot(nepali, aes(x = sex, y = ht)) + geom_boxplot() + xlab("Sex")+ ylab("Height (cm)")
4.1.9 Extensions of
There are a number of packages that extend
ggplot2 and allow you to create a variety of interesting plots. For example, you can use the
ggpairs function from the
GGally package to plot all pairs of scatterplots for several variables (Figure 4.11).
library(GGally) ggpairs(nepali %>% select(sex, wt, ht, age))
Notice how this output shows continuous and binary variables differently. For example, the center diagonal shows density plots for continuous variables, but a bar chart for the categorical variable.
See https://www.ggplot2-exts.org to find more
ggplot2 extensions. Later in this course, we will give an overview of how to make your own extensions.