5 Data Visualization
In this chapter we shall discuss how we can make different types of data visualization in R. We will use the package ggplot2
to visualize data. This is part of the tidyverse
package, meaning you should load tidyverse
to ensure you have ggplot2
loaded.
We shall also discuss a little bit about when to make different types of graphs, and what each type is best suited for. We will also give a few pieces of advice about how to make your visualizations as readable and interpretable as possible. For much more information on the theory of data visualization with excellent examples, please refer to the Fundamentals of Data Visualization book by Claus Wilke. To understand the power behind ggplot2
and for more data visualization examples, see the ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham.
5.1 Introduction to ggplot2
The first thing to do when we want to make a visualization with ggplot2
is to load the tidyverse:
library(tidyverse)
Next, let’s load in some data. We’ll pick the BlueJays.csv
data:
<- read_csv("data/BlueJays.csv")
df
head(df)
## # A tibble: 6 x 9
## BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0000-00000 M 8.26 9.21 25.9 56.6 73.3 30.7 1
## 2 1142-05901 M 8.54 8.76 25.0 56.4 75.1 31.4 1
## 3 1142-05905 M 8.39 8.78 26.1 57.3 70.2 31.2 1
## 4 1142-05907 F 7.78 9.3 23.5 53.8 65.5 30.3 0
## 5 1142-05909 M 8.71 9.84 25.5 57.3 74.9 31.8 1
## 6 1142-05911 F 7.28 9.3 22.2 52.2 63.9 30 0
In the next few steps, we’ll slowly build up a plot using ggplot2
. This is not how you would typically write the code. However, it is worth going step by step, just to show you the logic behind the code.
If we just run the function ggplot()
notice that all we get is a blank gray canvas. R knows that we want to make a plot, but it has no idea what type of plot or with what data - so it just throws up the canvas:
ggplot()
Now, if we add the dataset to ggplot()
, it still only gives us the blank canvas. It now knows we want to make a graph from the dataset called df
but doesn’t plot anything yet as we didn’t tell it what to plot:
ggplot(df)
For R to ‘know’ what you are trying to plot, you need to use aes()
. You put that most of the time inside ggplot()
after your dataframe name. (There are exceptions to this, but let’s not worry about that yet). Inside the aes()
we’ll put what columns contain our data for the x and y axes. We may also refer to other columns inside aes()
if we wish to modify the color or shape or something else about our data based on the values in some column.
For our first example, let’s make a scatterplot of body mass against head size of these Blue Jays. If you look at the original dataset, you’ll notice that both the Mass
and Head
columns contain continuous numeric data (i.e. they are numbers).
In the code below, we are telling aes()
to plot the Mass
data on the x-axis and to plot the Head
data on the y-axis.
ggplot(df, aes(x=Mass, y=Head) )
Something did change this time. We get a plot with labels on the x- and y-axes. It recognizes that we wish to plot Mass
and Head
data. It even knows the range of our data on each axis. For instance, it knows that the Mass data lies somewhere between 55 and 85. However, we haven’t yet told it precisely what type of plot we want to make (it doesn’t just assume that we wanted to make a scatterplot - it can’t read our minds).
So our next step is to tell it to make a scatterplot by adding points to the graph. We tell ggplot()
what we are adding to the chart by using different geoms
. For a scatterplot, the geom we require is geom_point()
- that means add datapoints. It is hard to remember all the different geoms, but you can just look them up.
Here is how we add datapoints to our graph with + geom_point()
.
ggplot(df, aes(x=Mass, y=Head) ) + geom_point()
That is our first ggplot graph! It looks pretty good. The amazing thing about ggplot is almost everything you are looking at on that graph can be customized to your preferred design choice. We’ll discuss several of these customizations throughout this chapter. First, let’s talk about changing the color of the datapoints. Inside of geom_point()
we can change the color of all the points like this:
ggplot(df, aes(x=Mass, y=Head) ) + geom_point(color="red")
This made the points red. Just make sure you put a recognized color name (you can look them up here) or a recognized hex code. Notice though that color name must be put inside of quotes.
What if we want to color the points based on another variable? For example, instead of having all of our data points be red, say we want them to be colored based on whether the birds or male or female? The column that has the information about whether the birds are male or female is KnownSex
. Because we are basing the color on a column, we put that information inside of aes()
with color = KnownSex
. We don’t put that inside geom_point()
. This code looks like this:
ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) + geom_point()
5.1.1 Assigning plots
When we make plots, our code can start to get quite long as we make more and more additions or changes to the plot. One very useful thing to know is that we can assign our plot to be an object, just as we would with a vector or a dataframe. For instance, let’s remake the plot above, but this time we’ll assign it the name p
. We do that using p <-
.
<- ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) + geom_point() p
Now, whenever we type and run p
we will get our plot. e.g.
p
5.1.2 Titles and Axes Titles
The advantage of assigning our plots to a short name, is that we can add things with less code. In R, if we wish to add a title to a plot, we do this with + ggtitle()
. So, here is how we add a title to our above plot:
+ ggtitle("Our first scatterplot") p
The above plot is basically the same as writing:
ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) +
geom_point() +
ggtitle("Our First Scatterplot")
You’ll notice that we are chaining together commands with the +
. This is similar to how we chain together commands with the %>%
when doing data carpentry. ggplot()
instead chains with the +
. Again, be careful not to start a row with a +
, and you must end a row with a +
unless it’s the very last row.
To change the title of the x-axis or the y-axis, we use xlab
and ylab
respectively. We can do it like this:
+ xlab("Body Mass (g)") + ylab("Head Size (mm)") p
5.1.3 Colors, Shapes and Sizes
R recognizes many default color names. These can be found at either of these places:
Color names 1
Color names 2
Or, you can use a hex code
Here we use the color dodgerblue
to change all the points to that color:
ggplot(df, aes(x=Mass, y=Head) ) + geom_point(color="dodgerblue")
Here we change the points to the color #ababcc
using a hexcode - note that hexcodes need to have #
in front of them:
ggplot(df, aes(x=Mass, y=Head) ) + geom_point(color="#ababcc")
You can also change the shape of the points you plot with geom_point(pch = )
. You need to insert the appropriate number according to this guide:
For example, to have dodgerblue asterisks, we add pch = 8
, separating the color and shape commands by a comma:
ggplot(df, aes(x=Mass, y=Head) ) + geom_point(color="dodgerblue", pch = 8)
Finally, we can change the size of our datapoints (or other shape we choose), using size =
:
ggplot(df, aes(x=Mass, y=Head) ) + geom_point(color="purple", size=2)
5.1.4 Themes
Default Themes
You may have noticed that every plot we have made so far has the same gray background with faint white gridlines. This is the default setting for the look of ggplot()
graphs. There are several other themes
that are available to us that change the overall appearance of our plots. Some of these are listed below:
theme_bw()
a variation on theme_grey() that uses a white background and thin grey grid lines.
theme_linedraw()
A theme with only black lines of various widths on white backgrounds, reminiscent of a line drawing.
theme_light()
similar to theme_linedraw() but with light grey lines and axes, to direct more attention towards the data.
theme_dark()
the dark cousin of theme_light(), with similar line sizes but a dark background. Useful to make thin colored lines pop out.
theme_minimal()
A minimalistic theme with no background annotations.
theme_classic()
A classic-looking theme, with x and y axis lines and no gridlines.
theme_void()
A completely empty theme
Let’s shows a couple of these different themes. The theme that we use the most in this course is theme_classic()
. This is how you would apply this theme to your plot:
ggplot(df, aes(x=Mass, y=Head) ) +
geom_point() +
theme_classic()
It creates a very sleek simple graph. The downside to this type of graph is that it does get rid of the gridlines which can be helpful sometimes.
Another theme that we use often is theme_minimal()
. Here is how we would add this:
ggplot(df, aes(x=Mass, y=Head) ) +
geom_point() +
theme_minimal()
This theme is also simplistic, but has gridlines too.
Custom themes
Rather than changing many different aspects of the graph at once, we can change individual things one by one with theme()
. We don’t propose to cover this in more detail in this book - for more information about themes look here - however, here is one quick example.
Let’s say we wanted to make the panel background light blue instead of gray. We could do it like this:
ggplot(df, aes(x=Mass, y=Head) ) +
geom_point() +
theme(panel.background = element_rect(fill = "lightblue"))
Again, this can get quite complicated - so stick with the default themes if you want to change your plots up a bit, or go to other help guides for more fine detail on customization.
5.2 Histograms
Histograms are very common data visualizations. Histograms plot the frequency on the y-axis of a continuous variable on the x-axis. For instance, let’s say we had the following data, that we’ll call d1
:
<- data.frame(vals = c(1, 3, 4, 3, 6, 7, 2, 9, 3, 2, 2, 3, 1, 5, 4, 4))
d1 d1
## vals
## 1 1
## 2 3
## 3 4
## 4 3
## 5 6
## 6 7
## 7 2
## 8 9
## 9 3
## 10 2
## 11 2
## 12 3
## 13 1
## 14 5
## 15 4
## 16 4
If we wanted to know how many of each number in the column vals we have, we could use table()
:
table(d1$vals)
##
## 1 2 3 4 5 6 7 9
## 2 3 4 3 1 1 1 1
The table above represents the frequency table or frequency count of the data. We can plot these data like this:
In this histogram, the height of each bar represents the total amount of the number on the x-axis. So, the height of the bar at x=9
is one. This mean we have 1 of this value in our data distribution. The height of the bar at x=3
is four, therefore we have four in our distribution for the value 3.
In the example above, the width of the bars is precisely 1. We could change the width to say two. This is illustrated below:
Here, the first bar is at height 9. It spans the values of x between 1-3. The second bar is at height 4, this include values between 3.01-5, and so on. What we did here was to adjust the binwidth
. When we have large distributions, adjusting the binwidth helps us to interpret the data more easily.
5.2.1 Histograms with ggplot2
To describe how to make histograms with the ggplot()
function, lets look at the films.csv
dataset.
<- read_csv("data/films.csv")
film
head(film)
## # A tibble: 6 x 5
## film year rottentomatoes imdb metacritic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Avengers: Age of Ultron 2015 74 7.8 66
## 2 Cinderella 2015 85 7.1 67
## 3 Ant-Man 2015 80 7.8 64
## 4 Do You Believe? 2015 18 5.4 22
## 5 Hot Tub Time Machine 2 2015 14 5.1 29
## 6 The Water Diviner 2015 63 7.2 50
This dataset contains 146 rows of data. Each row has a unique film, with the final three columns giving three different ratings measures of how good the film was. These are their respective rottentomatoes
, imdb
and metacritic
scores.
If we wished to plot the distribution of imdb
scores, we need to put x=imdb
inside the aes()
part of the ggplot code. That is to tell it to plot these scores on the x-axis. We do not need to put a y=
inside this, as we are not plotting anything from our dataset on the y-axis. Instead, ggplot2 will count up the frequency of our scores between regular intervals of imdb
scores.
We then add + geom_histogram()
to tell it to make a histogram. All together it looks like this:
ggplot(film, aes(x=imdb)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, this doesn’t look great and we have several problems with it. The two major problems that we get with our first histograms are. 1) The binwidth is almost never appropriate. We need to tell ggplot exactly what we want the binwidth on the x-axis to be. That is, what interval do we want our scores to be counted over. Looking at the graph, our scores range from just below 4 to about 8.6. Perhaps a better interval would be 0.2, so we count how many films had scores between 3.6-3.8, 3.8-4.0, 4.0-4.2, 4.2-4.4, …….. 8.4-8.6, 8.6-8.8 etc. 2) Having black bars makes it really hard to distinguish the bars when they are close in heights. We need to fix the color scheme.
OK, let’s make the bars dodgerblue and border them white. Inside geom_histogram()
we use color="white"
to represent the outside lines of the bars. We use fill="dodgerblue
to indicate the color inside the bars should be dodgerblue.
ggplot(film, aes(x=imdb)) +
geom_histogram(color="white", fill="dodgerblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now let’s fix that binwidth. To resolve this, inside geom_histogram()
we write binwidth = 0.2
.
ggplot(film, aes(x = imdb)) +
geom_histogram(binwidth = 0.2, color="white", fill="dodgerblue")
This looks a lot better. Now we can see that the majority of films have ratings in the 6.2-7.8 range, with relatively few above 8 and below 5. It’s not always easy to know what size interval to choose for the x-axis in histograms. It’s worth just playing around with that number and seeing how it looks.
When we set the interval to be some value - here, we chose 0.2 - R doesn’t automatically make that between easy to interpret numbers such as 4.0-4.2, 4.2-4.4 etc. It could just as easily have chosen 3.874-4.074, 4.074-4.274. Obviously, the latter is hard for us to interpret when looking at the axes. You can see in the above plot, that the vertical lines of the histogram bars don’t neatly fall on top of whole numbers. To fix, this you can adjust the boundaries by picking a value to center your interval on. So, if we pick boundary=4
, then that will be a boundary marker, and the interval will go 4.0-4.2, 4.2-4.4 etc.
ggplot(film, aes(x = imdb)) +
geom_histogram(binwidth = 0.2, color="white", fill="dodgerblue",boundary=4)
Just be careful with using the boundaries that it does not crop your histogram incorrectly. Changing histograms too much can lead to misrepresenting the data. We would recommend that you don’t use the boundary feature unless you have a real need to do so - just be careful!
Like with all ggplot figures, you can add as much customization as you wish. Here, we add a new theme, title and x- and y-axis labels:
ggplot(film, aes(x = imdb)) +
geom_histogram(binwidth = 0.2, color="white", fill="dodgerblue") +
theme_classic() +
ggtitle("Histogram of IMDB Ratings") +
xlab("Rating") +
ylab("Frequency")
This looks really nice !
5.2.2 Density Curves
Instead of plotting the frequency or counts of values on the y-axis, we can instead plot density. Here, we essentially convert the histogram to a solid line that estimates the overall shape of the distribution. We call this line a density curve. You can make this plot using ggplot()
using + geom_density()
instead of + geom_histogram()
.
In the code below we do this for the imdb
ratings, and we make the line color navy, and the fill of the density curve dodgerblue:
ggplot(film, aes(x = imdb)) +
geom_density(color = "navy", fill = "dodgerblue")
Usually the fill of these plots is too much, so it’s nice to add some transparency. You can do that by picking a number between 0 and 1 to provide to the alpha
argument. Here we choose alpha = .4
:
ggplot(film, aes(x = imdb)) +
geom_density(color = "navy", fill = "dodgerblue", alpha=.4)
The useful thing about density plots is that they give you a quick visual aid as to the overall shape of the distribution. You can easily see where the bulk of the data lie (here between 6 and 8 ratings score), and whether the data is symmetrical or not.
5.2.3 Comparing Distributions
Instead of just plotting one histogram or one density curve, we often are interested in comparing two or more distributions. This means we are interested in comparing two or more histograms or density curves. To do this, we first need to ensure that our data are all measured in the same units.
Overlaid Histograms
To illustrate this, let’s use the lifeexp.csv
data which contains life expectancy data for many countries.
<- read_csv("data/lifeexp.csv")
life
head(life)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia year_1952 28.8 8425333 779.
## 2 Afghanistan Asia year_2007 43.8 31889923 975.
## 3 Albania Europe year_1952 55.2 1282697 1601.
## 4 Albania Europe year_2007 76.4 3600523 5937.
## 5 Algeria Africa year_1952 43.1 9279525 2449.
## 6 Algeria Africa year_2007 72.3 33333216 6223.
You can see that one of the columns is called lifeExp
which is the life expectancy of each country in either 1952 or 2007. The year is shown in the year
column, and the country is shown in the country
column. You’ll notice that these data are in long format (see section 4.6).
Perhaps we are interested in the distribution of life expectancies across all countries in the year 1952 compared to the distribution of life expectancies in the year 2007. We have a few options to do this.
The first option does not look good for this example (although it may work in other situations). This is an overlaid histogram. To do this, inside aes()
as well as saying which column our distribution data is in x=lifeExp
, we also tell it to make separate histograms based on the year column with fill=year
. This will ensure it uses different fill colors for the two different years. Although not necessary, putting position="identity"
inside geom_histogram()
helps make the plot a little nicer. Putting color="black"
and alpha=.7
inside geom_histogram()
also helps distinguish the two histograms.
ggplot(life, aes(x=lifeExp, fill=year)) +
geom_histogram(binwidth=2, position="identity", color="black", alpha=.7) +
theme_minimal()
This plot is still pretty bad though. This method of plotting is better when the histograms are quite distinctive from one another and there isn’t much overlap in the distributions.
Choosing two colors that contrast more strongly than the default colors can help. Here, we are using hexcodes to pick a gray and a mustard yellow color. We manually define our fill colors using + scale_fill_manual(values = c("#999999", "#E69F00"))
. To change the colors, just change the hexcodes to different ones or the names of colors you’d like. Just make sure that you have the same number of colors as groups in your data. Here, we have two groups (1952 and 2007) so we need two colors. Also, notice that it says scale_fill_manual
and not scale_color_manual
. Because we are dealing with the inside color - this is considered to be a fill in ggplot2 terms. We used fill=year
inside aes()
so we need to match that with fill
when manually choosing colors.
ggplot(life, aes(x=lifeExp, fill=year)) +
geom_histogram( binwidth=2, position="identity", color="black", alpha=.7) +
theme_minimal() +
scale_fill_manual(values = c("#999999", "#E69F00"))
Overlaid Density Plots
Comparing distributions can also be done with geom_density
. This is usually simpler to compare than overlaid histograms.
The default plot for this would be to include fill=year
inside the aes()
code, as the year
column contains the data that we wish to make separate plots for.
ggplot(life, aes(x=lifeExp, fill=year)) +
geom_density(alpha = 0.4)
We can add a custom fill colors with + scale_fill_manual(values = c("#999999", "#E69F00"))
and a custom theme with + theme_classic()
.
ggplot(life, aes(x=lifeExp, fill=year)) +
geom_density(aes(fill = year), alpha = 0.4) +
scale_fill_manual(values = c("#999999", "#E69F00")) +
theme_classic()
This plot is now very easy to interpret. It’s clear that in 2007 most countries had life expectancies of over 70, with a tail towards younger life expectancies. In 1952, the opposite pattern is found with most countries having life expectancies around 40 with the tail going towards older countries.
5.2.4 Stem-and-Leaf Plots
Stem-and-leaf plots are a simplistic version of histograms. Before the advent of computers, this kind of plot would sometimes be easier to make than a histogram. Their heyday was quite a few decades ago! In fact, nowadays, these types of plots are almost never made by researchers or data scientists in the real world. They are pretty much exclusive to introductory statistics courses. This is a bit of a shame because we think they are pretty cute.
Here is an example. Imagine we have the following numbers in a distribution. They may represent temperatures:
20, 20, 23, 28, 29, 31, 32, 39, 40, 41, 42, 44, 44, 45, 48, 49, 55, 55, 56, 58, 59, 61, 62, 65, 66, 67, 70, 71, 75, 82, 86
We can represent these in a stem-and-leaf plot as below. The first column represents the “tens” and the second column represents the “ones”. So the “6” in the last row in the second column represents a temperature of 86. We put the second column data in ascending order. The heights of these bars represent a kind of histogram of sorts.
The columns do not have to be tens and ones. For instance, if our data had been seconds, and the distribution was 2.0, 2.0, 2.3, 2.8....... 7.5, 8.2, 8.6
we could have done the same stem-and-leaf plot.
There isn’t a simple ggplot way of making stem-and-leaf plots, but there is a built-in function called stem()
that can make them.
For an example, if we return to our imdb ratings:
head(film)
## # A tibble: 6 x 5
## film year rottentomatoes imdb metacritic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Avengers: Age of Ultron 2015 74 7.8 66
## 2 Cinderella 2015 85 7.1 67
## 3 Ant-Man 2015 80 7.8 64
## 4 Do You Believe? 2015 18 5.4 22
## 5 Hot Tub Time Machine 2 2015 14 5.1 29
## 6 The Water Diviner 2015 63 7.2 50
We can make a stem-and-leaf plot of the imdb
column like this. The scale=0.6
parameter dictates how long the stem-and-leaf plot should be. You can adjust it to your liking. Lower numbers make the plot shorter:
stem(film$imdb, scale=0.6)
##
## The decimal point is at the |
##
## 4 | 0234
## 4 | 6699
## 5 | 01224444
## 5 | 555556678999
## 6 | 0011112333333333444444
## 6 | 5555666666666777777789999999
## 7 | 0000111111122222222223333344444444
## 7 | 555555666777788888888899
## 8 | 012222344
## 8 | 6
Here, the lowest rating we have is 4.0, and the highest is 8.6.
5.3 Scatterplots
In the introduction to ggplot2
, we already demonstrated how to make a scatterplot. Here we will show a few extra features of these plots. Scatterplots plot continuous variables on the x- and y-axes, and can be very useful to examine the association between the two continuous variables. We use them a lot when plotting data related to correlation (see section 12) or regression (see section 13).
As we showed earlier, geom_point
is used to add datapoints to scatter plots. We’ll do this for the cheese.csv
dataset, that contains nutritional information about various cheeses:
<- read_csv("data/cheese.csv")
cheese head(cheese)
## # A tibble: 6 x 9
## type sat_fat polysat_fat monosat_fat protein carb chol fiber kcal
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 blue 18.7 0.8 7.78 21.4 2.34 75 0 353
## 2 brick 18.8 0.784 8.60 23.2 2.79 94 0 371
## 3 brie 17.4 0.826 8.01 20.8 0.45 100 0 334
## 4 camembert 15.3 0.724 7.02 19.8 0.46 72 0 300
## 5 caraway 18.6 0.83 8.28 25.2 3.06 93 0 376
## 6 cheddar 21.1 0.942 9.39 24.9 1.28 105 0 403
We’ll start with a simple scatterplot looking at the association between saturated fat on the x-axis and cholesterol on the y-axis intake.
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point()
We can change the color of the points by adding a color inside of geom_point
- making sure that the color name is in quotes:
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point(color = "purple")
To add a straight trendline through the data we use + stat_smooth(method = "lm")
. The stat_smooth
bit tells it to add a trendline, and the method="lm"
bit in the middle is telling it to make the straight line:
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point(color = "purple") +
stat_smooth(method = "lm")
Here you can see it automatically puts a shaded area around your trendline, which represents a confidence interval around the trendline. There is a way to remove it by adding se = FALSE
or se = F
inside of stat_smooth()
:
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point(color = "purple") +
stat_smooth(method = "lm", se = FALSE)
You can also change the color of the trendline, by adding to stat_smooth
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point(color = "purple") +
stat_smooth(method = "lm", se= F, color = "black")
As with all ggplot2
graphs, you can customize the plot. For example changing the theme, adding a title and axes titles:
ggplot(cheese, aes(x=sat_fat, y=chol) ) +
geom_point(color = "purple") +
stat_smooth(method = "lm", se= F, color = "black") +
xlab(" Saturated Fat") +
ylab("Cholesterol") +
ggtitle("Saturated Fat vs Cholesterol") +
theme_minimal()
If you wish to change the color of the points based on a grouping variable, then we need to put our color=
into the aes()
. You then need to provide the column that has the color grouping variable. For example, to change the color of points in our plot of body mass against head size in Blue Jays based on the sex of birds:
head(df)
## # A tibble: 6 x 9
## BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0000-00000 M 8.26 9.21 25.9 56.6 73.3 30.7 1
## 2 1142-05901 M 8.54 8.76 25.0 56.4 75.1 31.4 1
## 3 1142-05905 M 8.39 8.78 26.1 57.3 70.2 31.2 1
## 4 1142-05907 F 7.78 9.3 23.5 53.8 65.5 30.3 0
## 5 1142-05909 M 8.71 9.84 25.5 57.3 74.9 31.8 1
## 6 1142-05911 F 7.28 9.3 22.2 52.2 63.9 30 0
ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) +
geom_point()
If you wish to customize the colors of your datapoints, then you need to add scale_color_manual()
like this:
ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) +
geom_point() +
scale_color_manual(values = c("darkorange", "steelblue2")) +
theme_classic()
If you have a lot of points on your scatterplot, it can get quite hard to see all the datapoints. One way to deal with this is to change the transparency of the points. You can do this by adjusting the alpha
level inside of geom_point()
. alpha=
ranges from 0 to 1, with 0 being fully transparent and 1 being fully solid.
ggplot(df, aes(x=Mass, y=Head, color = KnownSex) ) +
geom_point(alpha=.4) +
scale_color_manual(values = c("darkorange", "steelblue2")) +
theme_classic()
5.3.0.1 Multiple Groups on a Scatterplot
We can add multiple trendlines to each group of datapoints plotted on a scatterplot. Let’s look at the following data of the chemical components of different olive oils produced in Italy. This is what the data look like:
<- read_csv("data/olives.csv")
olives head(olives)
## # A tibble: 6 x 10
## macro.area region palmitic palmitoleic stearic oleic linoleic linolenic
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Apulia.north 1075 75 226 7823 672 36
## 2 South Apulia.north 1088 73 224 7709 781 31
## 3 South Apulia.north 911 54 246 8113 549 31
## 4 South Apulia.north 966 57 240 7952 619 50
## 5 South Apulia.north 1051 67 259 7771 672 50
## 6 South Apulia.north 911 49 268 7924 678 51
## # ... with 2 more variables: arachidic <dbl>, eicosenoic <dbl>
If we use table()
, we can see how many different regions are represented in the data. There are three unique Italian areas where the olives come from:
table(olives$macro.area)
##
## Centre.North Sardinia South
## 151 98 323
Say we are interested in looking at how oleic
and linoleic
acid contents are related to each other by macro.area
:
ggplot(olives, aes(x=oleic, y=linoleic, color=macro.area)) +
geom_point() +
theme_classic()
If we wanted to add a trendline for each area, all we need to do is add our stat_smooth(method="lm)
line to the code. It already knows to plot these as separate trendlines for each group because inside aes()
we have color=macro.area
. As long as there is a group=
or color=
inside aes()
then it knows to do things like adding trendlines separately for each group:
ggplot(olives, aes(x=oleic, y=linoleic, color=macro.area)) +
geom_point() +
stat_smooth(method="lm", se=F) +
theme_classic()
5.3.1 Bubble Charts
Bubble Charts are an extension to scatterplots. In scatterplots we plot two continuous variables against each other. With a bubble chart we add a third continuous variable and vary the size of our datapoints according to this variable. For example, say we wish to also plot skull size on our Blue Jay scatterplot. We could increase the size of the points for individuals with larger skull sizes. We do this by adding size=Skull
into our aes()
part:
ggplot(df, aes(x=Mass, y=Head, color = KnownSex, size = Skull) ) +
geom_point(alpha=.4) +
scale_color_manual(values = c("darkorange", "steelblue2")) +
theme_classic()
The issue with bubble charts is that they can start to look very cluttered, making it hard to actually see any patterns. They should probably be used sparingly. One trick you can employ to make them a little easier to see is to add scale_size()
to the plot. Here, you enter two numbers to tell it what size points to scale to. In our example below, we used scale_size(range = c(.1, 4))
which makes our points range between sizes 0.1 and 4. This makes the plot a little less busy:
ggplot(df, aes(x=Mass, y=Head, color = KnownSex, size = Skull) ) +
geom_point(alpha=.4) +
scale_color_manual(values = c("darkorange", "steelblue2")) +
theme_classic() +
scale_size(range = c(.1, 4))
5.4 Line Graphs
Line graphs connect continuous values on the y-axis over time on the x-axis. They are very useful for show patterns of change over time.
Let’s look at the jennifer.csv
dataset:
<- read_csv("data/jennifer.csv")
jennifer
head(jennifer)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1916 Female Jennifer 5 0.00000461
## 2 1919 Female Jennifer 6 0.00000511
## 3 1920 Female Jennifer 7 0.00000563
## 4 1921 Female Jennifer 5 0.00000391
## 5 1922 Female Jennifer 7 0.00000561
## 6 1923 Female Jennifer 9 0.00000719
This dataset shows the number n
of children born each year (year
) in the United States with the name Jennifer. In 1916 there were five children born with the name Jennifer. In 1917 there were 0. In 1923 there were 9.
This dataset goes up to 2017 where there were 1052 children born with the name Jennifer:
tail(jennifer)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2012 Female Jennifer 1923 0.000993
## 2 2013 Female Jennifer 1689 0.000878
## 3 2014 Female Jennifer 1521 0.000779
## 4 2015 Female Jennifer 1283 0.000660
## 5 2016 Female Jennifer 1159 0.000601
## 6 2017 Female Jennifer 1042 0.000556
Therefore, we have a continuous variable (n
) and a time variable (year
). We can plot these as we would plot a scatterplot by supplying year
to our x-axis and n
to our y-axis. We could then add datapoints with geom_point()
essentially making a scatterplot:
ggplot(jennifer, aes(x=year, y=n) ) + geom_point()
But, we aren’t dealing with just a scatterplot. These datapoints can be connected to each other as they are ordered in time. Instead of using geom_point()
we can use geom_line()
to draw a line instead:
ggplot(jennifer, aes(x=year, y=n) ) + geom_line()
If you so desired, you could plot both the points and lines together:
ggplot(jennifer, aes(x=year, y=n) ) +
geom_point() +
geom_line()
You can adjust the colors of the lines and the points independently by supplying color=
inside of each geom:
e.g. Changing the color of the line, but not the points:
ggplot(jennifer, aes(x=year, y=n) ) +
geom_point() +
geom_line(color = "purple")
Changing the color of both the points and the line:
ggplot(jennifer, aes(x=year, y=n) ) +
geom_point(color = "violet") +
geom_line(color = "purple")
You can also change the width of lines by adding lwd=
to geom_line()
:
ggplot(jennifer, aes(x=year, y=n) ) +
geom_line(color = "purple", lwd=2)
There are also several different styles of lines. You can change these by adjusting the number you provide to lty=
inside of geom_line()
. Here are a few examples:
ggplot(jennifer, aes(x=year, y=n) ) + geom_line(lty=2)
ggplot(jennifer, aes(x=year, y=n) ) + geom_line(lty=3)
This illustration shows some of the linetype options:
Just a quick reminder: Please only connect datapoints into a line if it is meaningful to do so! This is almost always when your x-axis is some measure of time.
5.4.1 Multiple Line Graphs
Often we wish to compare the patterns over time of different groups. We can do that by plotting multiple lines on the same graph.
Let’s look at this example dataset.
<- read_csv("data/jenlinda.csv")
jenlinda
tail(jenlinda)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2015 Female Jennifer 1283 0.000660
## 2 2015 Female Linda 425 0.000218
## 3 2016 Female Jennifer 1159 0.000601
## 4 2016 Female Linda 436 0.000226
## 5 2017 Female Jennifer 1042 0.000556
## 6 2017 Female Linda 404 0.000215
Here, we have data in long format. We still have our continuous outcome variable of n
in one column. We also have year
in another column. So we can plot these two against each other. Importantly, we can split our lines based on our grouping variable, which is the name
column. In that column we have two different groups - Jennifer and Linda.
To plot separate lines based on the name
column, we need to add group=name
to our aes()
. We’ve also added some custom labels, titles and a theme.
ggplot(jenlinda, aes(x=year, y=n, group=name)) +
geom_line()+
xlab("Year") +
ylab("Number of Children Born") +
ggtitle("Popularity of Names Jennifer & Linda in USA") +
theme_minimal()
You may notice that both lines are the same color! To make the lines have different colors, we insert color=name
into the aes()
instead of group=name
:
ggplot(jenlinda, aes(x=year, y=n, color=name)) +
geom_line()+
xlab("Year") +
ylab("Number of Children Born") +
ggtitle("Popularity of Names Jennifer & Linda in USA") +
theme_minimal()
Again, we could customize these colors if we did not like them with scale_color_manual()
like this:
ggplot(jenlinda, aes(x=year, y=n, color=name)) +
geom_line()+
xlab("Year") +
ylab("Number of Children Born") +
ggtitle("Popularity of Names Jennifer & Linda in USA") +
theme_classic() +
scale_color_manual(values=c("#ffadf3", "#800f4f"))
Just insert your favorite colors, and make sure you provide the same number of colors as you have separate groups/lines.
5.5 Comparing Distributions across Groups
One of the most important data visualizations that we make is to compare the distribution of data across groups. Here we have a categorical variable on the x-axis, and a continuous variable on the y-axis. For some reason, the most common way to represent these data in most of the scientific literature is to plot bar graphs with error bars - so-called dynamite plots. However, in our very strong opinion these plots are dreadful and you should never use them. Fortunately others agree. Instead, please choose from strip plots, boxplots or violin plots, or a combination, depending upon your data.
In this section we’ll use the wheels1.csv
dataset. These data show the number of revolutions of a running wheel made by mice over a four day period. The mice vary by their strain (type). Here we just select the id
, strain
and day4
columns for this example:
<- read_csv("data/wheels1.csv")
wheels <- wheels %>% select(id, strain, day4)
wheels1 head(wheels1)
## # A tibble: 6 x 3
## id strain day4
## <chr> <chr> <dbl>
## 1 692ao B6 12516.
## 2 656aa B6 7404.
## 3 675ag B6 3761
## 4 675ai B6 11684
## 5 656af B6 8468.
## 6 656al B6 9291
The day4
column represents how many wheel revolutions the mice made on their fourth day running in the wheel. Some mice really like running in the wheel, others aren’t as bothered!
Let’s have a look at how many datapoints we have in each strain:
table(wheels$strain)
##
## B6 F1-129B6 F1-B6129 S129 Swiss
## 14 22 15 16 13
We have 80 mice in five different strains.
5.5.1 Strip Plots
Strip plots essentially just plot the raw data. It’s like plotting a scatterplot, except we plot a categorical variable on the x-axis.
So in our example, inside aes()
we’ll put strain on the x-axis with x=strain
, and on the y-axis we put our outcome variable with y=day4
. We’ll add datapoints with + geom_point()
:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_point() +
theme_classic()
The major issue with this plot is that all the points are in a very straight line, and it can be difficult to distinguish between different points. To change this, instead of using geom_point()
we use geom_jitter()
which bounces the points around a bit:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_jitter() +
theme_classic()
Whoops! The points exploded. Now it’s not possible to know which points belong to which group. To constrain this, we can set width=
inside of geom_jitter()
which tells the points how far they are allowed to bounce around:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_jitter(width = .15) +
theme_classic()
This looks a lot better!.
5.5.2 Boxplots
Boxplots are a very useful way of summarizing the distribution of data. The image below summarizes what each line in the boxplot represents. For more details on all of these descriptive measures see section 6:
The middle horizontal line is at 6. This represents the median of the distribution which is the middle value. 50% of the distribution lies above this value and 50% below it. The higher horizontal line at the top of the box represents the upper quartile. This is approximately the median of the upper 50% of the data, so is approximately the 75% percentile. The lower horizontal line at the bottom of the box represents the lower quartile. This is approximately the median of the lower 50% of the data, so is approximately the 25% percentile of the data. Therefore, the middle 50% of the data (from the 25% percentile to the 75% percentile) lies inside the box. The long vertical lines represent the range of the data. The top of that line is the maximum value in the data, and the bottom of that line is the minimum value in the distribution.
The above is a basic boxplot. However, ggplot2
does things a little bit differently. It turns out there is more than one way to calculate the lower and upper quartiles (see section 6.4.2). Also, R doesn’t necessarily extend the vertical lines (whiskers) all the way to the minimum and maximum values. If there are datapoints that are too far away from the upper or lower quartile, then it truncates the whisker and shows datapoints outside of this range as dots. Here is an illustration of a ggplot boxplot:
OK, let’s have a look at some boxplots using ggplot()
. We provide the same x=strain
and y=day4
values as we do with strip plots. Instead of geom_jitter()
we use geom_boxplot()
:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot() +
theme_classic()
You can see in this example, that the strain “F1-129B6” and the strain “S129” both have two datapoints that are shown as outliers beyond the whiskers.
To change the colors of the boxplots, you can change color=
and fill=
inside geom_boxplot()
. Remember that color
refers to the color of the lines, and fill
refers to the filled in color of the shape:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1") +
theme_classic()
You can change the size, color and shape of the outliers. For instance, to remove them completely, we do outlier.shape=NA
inside geom_boxplot()
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.shape = NA) +
theme_classic()
To change the size and color, you can use outlier.size
and outlier.color
respectively inside geom_boxplot()
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.size = .5, outlier.color = "gray66") +
theme_classic()
5.5.2.1 Overlaying points
It can often be helpful to overlay your raw datapoints over the top of boxplots, providing that you don’t have too much data. To do this, just add your points with either geom_point()
or preferably geom_jitter()
. But one warning - make sure you remove any outliers with outlier.shape=NA
otherwise those datapoints will show up twice:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.shape = NA) +
geom_jitter(width=0.15, color="navy") +
theme_classic()
Sometimes this can look a bit too busy. One way to contrast things is to set either the points or the boxplots themselves to have some transparency with alpha=
.
Here we make the points a bit transparent:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.shape = NA) +
geom_jitter(width=0.15, color="navy", alpha = .3) +
theme_classic()
Here we leave the points solid, but make the boxplot fill transparent:
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.shape = NA, alpha = .3) +
geom_jitter(width=0.15, color="navy") +
theme_classic()
And finally, making both a bit transparent, adding some custom titles and labels. You can just choose what you think looks best!
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", outlier.shape = NA, alpha = .3) +
geom_jitter(width=0.15, color="navy", alpha=.3) +
theme_classic() +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running")
5.5.2.2 Reordering categorical x-axes
The boxplot that we made looks ok, but one thing is visually annoying. The boxes are plotted in alphabetical order on the x-axis (B6, F1-129B6….. Swiss). There is no reason why they should be in this order. A more visually appealing way would be to order the boxplots from the group with the highest median to the lowest median.
To do this, instead of putting x=strain
inside of aes()
we put x = reorder(strain, -day4, median)
inside instead. This is a bit of a mouthful. To break it down, it’s saying plot strain on the x-axis, but reorder the groups based on the median of the strain column (that’s the ‘-day4’ in the code).
ggplot(wheels1, aes(x = reorder(strain, -day4, median), y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", alpha = .3) +
theme_classic()
The output looks pretty good - it is now really easy to notice that one of the groups has a much lower distribution than the others. The major issue is that the label of the x-axis is terrible. So let’s fix that:
ggplot(wheels1, aes(x = reorder(strain, -day4, median), y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", alpha = .3) +
theme_classic() +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running")
Much nicer!
5.5.2.3 Flipping Axes
Often boxplots look perfectly fine with the categorical grouping variable on the x-axis and the continuous variable on the y-axis. If you start to have many groups, then sometimes the boxplots looks too cluttered when placed on the x-axis. In this situation, it might look better to flip the axes, and have the boxplots stacked vertically. To do this, you write your plot code exactly as you would normally, but you just add + coord_flip()
to the end of the code.
Let’s add this to the reordered boxplots we just made in the previous section:
ggplot(wheels1, aes(x = reorder(strain, -day4, median), y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", alpha = .3) +
theme_classic() +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running") +
coord_flip()
This is OK, but it would be nicer if the highest values were at the top. There may well be a more straightforward way of doing this, but a quick solution is to wrap reorder(strain, strain, median)
with fct_rev
, so you now have x = fct_rev(reorder(strain, strain, median))
. It’s a whole lot of code, but it does make the graph really pretty, so it’s worth it:
ggplot(wheels1, aes(x = fct_rev(reorder(strain, -day4, median)), y = day4)) +
geom_boxplot(color="navy", fill="lightsteelblue1", alpha = .3) +
theme_classic() +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running") +
coord_flip()
5.5.3 Violin Plots
A disadvantage of boxplots, especially when you have large distributions, is that the box does not tell you much about the overall shape of the distribution. An alternative are violin plots, where the width of the shape reflects the shape of the distribution. To make these plots, instead of using geom_boxplot()
we use geom_violin()
.
ggplot(wheels1, aes(x = strain, y = day4)) +
geom_violin() +
theme_classic()
You can do all the customizing, reordering, coloring, transparency-ing, etc that you do with boxplots:
ggplot(wheels1, aes(x = fct_rev(reorder(strain, strain, median)), y = day4)) +
geom_violin(color="navy", fill="lightsteelblue1", alpha = .3) +
theme_classic() +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running") +
coord_flip()
5.5.4 Stacked Boxplots
Sometimes, we want to compare distributions for the same group side by side. For instance, we may not just want to plot the day4 wheel running data, but also plot the day1 data.
Below, we have data in wide format. We have ids, strain, day1 running and day4 running.
<- wheels %>% select(id, strain, day1, day4)
wheels2
head(wheels2)
## # A tibble: 6 x 4
## id strain day1 day4
## <chr> <chr> <dbl> <dbl>
## 1 692ao B6 12853 12516.
## 2 656aa B6 2644 7404.
## 3 675ag B6 4004. 3761
## 4 675ai B6 11754. 11684
## 5 656af B6 6906. 8468.
## 6 656al B6 6517 9291
We need to turn this to long data to be able to make the stacked boxplot graph. See section 4.6 for more on how to switch between wide and long data formats:
<- wheels2 %>% pivot_longer(cols = 3:4, names_to = "day")
wheels2.long wheels2.long
## # A tibble: 160 x 4
## id strain day value
## <chr> <chr> <chr> <dbl>
## 1 692ao B6 day1 12853
## 2 692ao B6 day4 12516.
## 3 656aa B6 day1 2644
## 4 656aa B6 day4 7404.
## 5 675ag B6 day1 4004.
## 6 675ag B6 day4 3761
## 7 675ai B6 day1 11754.
## 8 675ai B6 day4 11684
## 9 656af B6 day1 6906.
## 10 656af B6 day4 8468.
## # ... with 150 more rows
Now the wheel running data is in its own column - value
. So we use y=value
. The grouping variable is in the day
column, so we use fill=day
to make separate boxplots based on the day. This will also make them different fill colors:
ggplot(wheels2.long, aes(x = strain, y = value, fill=day)) +
geom_boxplot() +
theme_classic()
This looks ok, but the colors are yucky. Lets add custom titles, labels, and we’ll customize the fill colors using scale_fill_manual
. We have two groups (day1 and day4) so we need to provide two colors:
ggplot(wheels2.long, aes(x = strain, y = value, fill=day)) +
geom_boxplot() +
theme_classic() +
scale_fill_manual(values = c("#9381e3", "#faff75")) +
xlab("Mouse Strain") +
ylab("Total Revolutions") +
ggtitle("Mouse Wheel Running")
It turns out all the strains increase their overall running in wheels from day 1 to day 4, except the S129 strain who get bored with wheel running - probably similar to how you’re bored of seeing graphs about wheel running.
5.5.5 Ridgeline Plots
Another useful way of displaying distributions of data across groups is using ridgeline plots. These are essentially density histogram plots for each categorical group plotted side by side. To do this we need to use a package called ggridges
. This can be installed by going to the Packages
tab, selecting Install
and typing in ggridges
in the box.
Let’s go back to the olives
data. Say we are interested in displaying the distribution of oleic
acid content by macro.area
. We plot the categorical group of interest (here macro.area
) on the y-axis, and the continuous variable whose distribution we are interested in (oleic
) on the x-axis. We then use stat_density_ridges()
to plot the ridgeline plots.
library(ggridges)
ggplot(olives, aes(x = oleic, y = macro.area)) +
stat_density_ridges() +
theme_classic()
You can add color by adding in a fill=
to the aes()
.
ggplot(olives, aes(x = oleic, y = macro.area, fill = macro.area)) +
stat_density_ridges() +
theme_classic()
… and perhaps we can manually override the default color scheme - here I’m using hex codes to pick a very purpley color scheme:
ggplot(olives, aes(x = oleic, y = macro.area, fill = macro.area)) +
stat_density_ridges() +
theme_classic() +
scale_fill_manual(values=c("#D1B8D0", "#F78EF2", "#AC33FF"))
A nice thing about these ridgeline plots is that we can easily add on lines that represent the lower quartile, median and upper quartile by adding in the argument quantile_lines = TRUE
like this:
ggplot(olives, aes(x = oleic, y = macro.area, fill = macro.area)) +
stat_density_ridges(quantile_lines = TRUE) +
theme_classic() +
scale_fill_manual(values=c("#D1B8D0", "#F78EF2", "#AC33FF"))
The final ridgeline plot below plots the distributions of oleic acid by region. There are 9 regions. It’s best in these plots to try and plot the categories from highest to lowest median, as it looks nicer. The following code is a bit tricky, and if you’re not interested - then you can safely ignore. However, just in case it is of interest to anyone: to do that you need to make sure ggplot
recognizes the categorical variable region
in this case to be a factor (a grouped variable) and that they are in the right order. It can be done using using this line: fct_reorder(region, -oleic, .fun = median)
. Essentially this says, make the region
variable a factor, and reorder it to be from highest median of oleic acid to lowest. One final thing - you have to do this for both the y
axis category, and the fill
- otherwise your colors won’t match your y-axis categories.
In the below code, I also added quantiles, x-axis and y-axis titles, a title and I removed the legend as it didn’t add any extra information that isn’t already on the plot.
ggplot(olives, aes(x = oleic,
y = fct_reorder(region, -oleic, .fun = median),
fill = fct_reorder(region, -oleic, .fun = median)
+
)) stat_density_ridges(quantile_lines = TRUE) +
theme_classic() +
scale_fill_manual(values=c("#0000FF", "#2000DF", "#4000BF", "#60009F", "#800080", "#9F0060", "#BF0040", "#DF0020", "#FF0000")) +
theme(legend.position = "none") +
ylab("Region") +
xlab("Oleic Acid Content") +
ggtitle("Oleic Acid Content of Italian Olives by Region")
For more information about these plots, you can look up the help documentation for this package here.
5.6 Bar Graphs
A common form of data that we wish to show are the amounts of different categories. Often these data could be presented in a table format. For instance, the table below shows the total number of number 1 hits by six different artists in the UK.
A table is a completely legitimate way to present data. A graphical way of presenting these same data would be to make a bar graph. In these plots we have a categorical grouping variable on the x-axis and a numerical value (either continuous or discrete) on the y-axis. An advantage of bar graphs over tables is that it is often easier to visualize the proportional differences between categories in their values when looking at a bar graph compared to a table. Bar graphs are therefore especially useful when the differences between groups are larger.
If we wish to make a bar graph using ggplot()
our data may come in two different ways. First, we may already have the totals that we wish to plot - that is our dataset already contains the values that the bar heights will be at. Second, we may not have these counts but need R to calculate them for us. These two different initial data setups require different geoms to create bar graphs.
geom_col()
Let’s first describe the situation when you have a dataset where you have already counted the number that applies to each group. We will use the number1s.csv
data which contains the same data as the table above.
<- read_csv("data/number1s.csv")
df1 head(df1)
## # A tibble: 6 x 2
## name total
## <chr> <dbl>
## 1 Elvis 21
## 2 The Beatles 17
## 3 Cliff Richard 14
## 4 Westlife 14
## 5 Madonna 13
## 6 Take That 12
When data look like this and you have one column that is the category (x = name
) and one column containing the numerical data y = total
, you can use geom_col()
.
ggplot(df1, aes(x = name, y = total) ) +
geom_col() +
theme_classic()
Notice that the default order is alphabetical. You can reorder by putting reorder
around the x-axis column. If you put reorder(name, total)
this is telling it to reorder the name
variable by their respective increasing values of total:
ggplot(df1, aes(x = reorder(name, total), y = total) ) +
geom_col() +
theme_classic()
Alternatively, if you put reorder(name, -total)
, with the -
sign in front of ‘total’, this is telling it to reorder the name
variable by their respective decreasing values of total:
ggplot(df1, aes(x = reorder(name, -total), y = total) ) +
geom_col() +
theme_classic()
To change the color of the bars, you need to put fill=
inside geom_col()
as we are dealing with filling in a shape:
When changing color use ‘fill’ here because it’s a shape.
ggplot(df1, aes(x = reorder(name, -total), y = total) ) +
geom_col(fill = "#32b5ed") +
theme_classic()
If you wish to add a different color border around the bars, then you can add color=
inside the geom_col()
:
ggplot(df1, aes(x = reorder(name, -total), y = total) ) +
geom_col(fill = "#32b5ed", color="#193642") +
theme_classic()
And, as per usual, all other customizations are acceptable, including rotating the chart using coord_flip()
:
ggplot(df1, aes(x = reorder(name, total), y = total) ) +
geom_col(fill = "#32b5ed", color="#193642") +
xlab("") +
ylab("Total Number 1's") +
ggtitle("Number 1 hits in UK") +
theme_classic() +
coord_flip()
In the code for this flipped bar graph, notice that we removed the -
from next to -total
when reordering. If we’d left it in, it would have plotted the bars in the opposite order. If you are unsure with your own data whether to use it or not - just see what happens with and without it. Bar graphs look better when the highest value is at the top.
One other key thing about bar graphs, is that they should technically start at 0. As you are visualizing amounts, it would be misleading to start the graph at e.g. 10 in the above example. That would distort the relationship of the length of the bars to each other. Some people extend this rule to all graphs, but this is a misconception. Often we don’t need to know where 0 is for boxplots for instance. However, it is generally important to know where 0 is for bar graphs if we wish to compare bars between groups.
geom_bar()
Often we want to make bar graphs to visualize how many we have of each group, but we don’t yet know how many we have! For example, take the following dataset which is found in pets.csv
.
<- read_csv("data/pets.csv")
pets head(pets)
## # A tibble: 6 x 2
## name pet
## <chr> <chr>
## 1 Leon Cat
## 2 Lesley Dog
## 3 Devon Dog
## 4 Timothy Dog
## 5 Paul None
## 6 Jody Cat
These data show different individuals in a class in the name
column and what their favorite pet is in the pet
column. Perhaps we want to visualize which pets are the most popular. We’d like to get the total number of people who put ‘cat’ as their favorite, the total number of people that put ‘dog’ down and so on.
One quick way to visually inspect how many we have of each pet in the pet
column is to use the function table()
:
table(pets$pet)
##
## Bird Cat Dog None
## 2 6 11 6
To make the bar graph of these data using ggplot()
, we need to use geom_bar()
. Fortunately, geom_bar()
counts how many we have of each for us. We do not need to supply a y column. We just need to supply x=pet
to indicate that that column will be our grouping variable.
ggplot(pets, aes(x = pet)) +
geom_bar() +
theme_classic()
Once we have the basic plot down, all the other bits and pieces can be done:
Then just customize.
ggplot(pets, aes(x = pet)) +
geom_bar(color="black", fill="plum3") +
theme_classic()+
xlab("Pet")+
ylab("Total")+
ggtitle("Popularity of Pets in a Class")
You can also reorder your factor. With geom_bar()
we reorder in a similar way to how we did with geom_boxplot()
. We use x = reorder(pet, pet, table)
to tell it to reorder the pet category according to the frequency count of each as calculated by the table()
function. Using coord_flip()
makes it easier to read and compare bars.
ggplot(pets, aes(x = reorder(pet, pet, table))) +
geom_bar(color="black", fill="plum3") +
theme_classic()+
xlab("Pet")+
ylab("Total")+
ggtitle("Popularity of Pets in a Class") +
coord_flip()
5.7 Small Multiples
Often we want to compare graphs across multiple categories. One good strategy to do this is to make small multiples, which is essentially replicating the same graph for each group several times in different panels. This is probably best explained by doing an example.
Scatterplot small multiple
Here, we load in the penguins.csv
dataset. This data shows the size of various penguins culmen (the beak) and flippers:
<- read_csv("data/penguins.csv")
penguins head(penguins)
## # A tibble: 6 x 7
## species island culmen_length_mm culmen_depth_mm flipper_length_~ body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgers~ 39.1 18.7 181 3750
## 2 Adelie Torgers~ 39.5 17.4 186 3800
## 3 Adelie Torgers~ 40.3 18 195 3250
## 4 Adelie Torgers~ 36.7 19.3 193 3450
## 5 Adelie Torgers~ 39.3 20.6 190 3650
## 6 Adelie Torgers~ 38.9 17.8 181 3625
## # ... with 1 more variable: sex <chr>
The dataset contains three different species:
table(penguins$species)
##
## Adelie Chinstrap Gentoo
## 146 68 119
We might be interested in examining how body mass is associated with flipper length across species and across sex. Here, we have two different columns containing categorical variables. We have sex
and species
. If we wanted to show all of this on just one scatterplot, we could change the color of the points to represent species, and the shape of the points to represent sex. We change the shape by a column using shape=
inside of aes()
:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species, shape = sex)) +
geom_point() +
theme_classic()
The problem with this sort of graph is that it is far too cluttered. Using shape to distinguish categories isn’t that useful or helpful. You really have to squint at the graph to work out what is a circle and what is a triangle.
An alternative approach is to make small multiples. We create a separate scatterplot for each species. Here, we color our points by sex with color=sex
inside aes()
. We add to our code the line facet_wrap(~species)
to tell ggplot()
to make separate scatterplots for each species. Please note the ~
that comes before the column name that you wish to make separate plots for:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, color = sex)) +
geom_point() +
theme_minimal() +
facet_wrap(~ species)
You may notice that all the scatterplots have the same range of values on the x-axis. Technically, this is the most appropriate approach as it enables you to make comparisons across groups more easily. However, if you want to fit the data on each scatterplot to cover the whole canvas, you can make the axes unfixed by adding scales="free"
to your facet_wrap()
command:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, color = sex)) +
geom_point() +
theme_minimal() +
facet_wrap(~ species, scales = "free")
Line graph small multiple
We can also make small multiples for line graphs. Let’s illustrate this using the lifeexp_all.csv
dataset.
<- read_csv("data/lifeexp_all.csv")
le
head(le)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
In this dataset we have a column giving the life expectancy (lifeExp
) of various countries that are in the country
column. We also have a year
column that goes from 1952 to 2007 at five year intervals. Consequently, we could plot a line graph of year on the x-axis and life expectancy on the y-axis. We could make separate lines for each country. As there are far too many countries to plot, it is not worth making each one a separate color. Because of this, rather than putting color=country
into aes()
to indicate to make separate lines for each country, we’ll put group=country
. This will make separate lines for each country, but make them all the same color. If we make them a light color and a bit transparent, it will look best:
ggplot(le, aes(x = year, y = lifeExp, group = country)) +
geom_line(color="cornflowerblue", alpha=0.2) +
theme_minimal()
This gives us a sense of the overall pattern of life expectancies from 1952 to 2007. The trend for most countries is generally upwards, though there are some countries that have big crashes.
We also have another categorical variable in our dataset. There is a column called continent
. We could replot this line graph, but separate the plots based on which continent the lines/countries belong to. We do that again using facet_wrap(~continent)
.
ggplot(le, aes(x = year, y = lifeExp, group = country)) +
geom_line(color="cornflowerblue", alpha=0.5) +
theme_minimal() +
facet_wrap(~continent)
Because there are fewer lines on each graph, we upped the alpha to 0.5 to make the lines a bit darker on this plot.
If you wish to make the lines belonging to each panel different colors from each other, you can add color=continent
to your aes()
. You have to remove the color from geom_line()
to make this work:
ggplot(le, aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line( alpha=0.5) +
theme_minimal() +
facet_wrap(~continent)+
xlab("Year") +
ylab("Life Expectancy")
5.8 Saving and Exporting ggplot2 graphs
How do we save the nice plots that we have made using ggplot2? There are some quite advanced ways of saving high resolution images. Here, we’ll just run through some quick and easy options.
First, you could just hit zoom
in RStudio to make your plot bigger. Resize the window to your preferred graph size and then take a screenshot. Paste your screenshot into a program such as paint and crop away. This is a very crude method - but it’s fast and reliable if you just want to have an image to insert into some other program.
A second option is after you have made your plot, you can hit the ‘export’ tab on the plot viewer. Choose either “Save as Image” or “Save as PDF” and then choose how and where you want to save the image.
A more premium option is to use a function from ggplot called ggsave()
. The first step you should do is to assign your plot to an object name. In the code below, we are making a scatterplot that we save to the object plot1
:
<- ggplot(cheese, aes(x = chol, y = kcal)) +
plot1 geom_point(color='purple', size=2) +
theme_classic() +
xlab("Cholesterol") +
ylab("Calories in kcal") +
ggtitle("Cheese")
plot1
Next, run a line of code that will save your plot. You type ggsave()
. The first thing you put inside this is the location where you want your plot to be stored. You need to write a location on your computer. If you are using an Rproject such as with this course, you could put your plot in a folder called img
. Remember to type the file extension .png
or .pdf
after the name of your new plot. The second thing you need to write is the name of the graph object you wish to save. Here our graph is called plot1
.
ggsave("img/cheese_plot.png", plot1) # save as a png
ggsave("img/cheese_plot.pdf", plot1) # save as a pdf
You can also play around with the width and height of your saved image. You probably need to trial and error this a few times to get the proportions that you really like. Here we are making an image that is 10 inches wide and 8 inches high.
ggsave("img/cheese_plot2.png", plot1, width = 10, height = 8) #(in inches, though can be in cm)