Making our own plots in R.
Motivating scenarios: Motivating scenarios: you have a fresh new data set and want to check it out. How do you go about looking into it?
Learning goals: By the end of this chapter you should be able to:
Recall that as bio-statisticians, we bring data to bear on critical biological questions, and communicate these results to interested folks. A key component of this process is visualizing our data.
They say “a picture is worth a thousand words,” similarly a clear graph can communicate complex patterns in our data.
We generally think of two extremes of the goals of data visualization
The ggplot2
package in R
is well suited for both purposes. Today we focus on exploratory visualization in ggplot2 because
ggplot2
knowledge.Later in the term we will show how we can use ggplot2
to make high quality explanatory plots.
Whether developing an explanatory or exploratory plot, you should think hard about the biology you hope to convey before jumping into a plot. Ask yourself
The answers to these questions should guide our data visualization strategy, as this is a key step in our statistical analysis of a dataset. The best plots should evoke an immediate understanding of the (potentially complex) data. Put another way, a plot should highlight both the biological question and its answer.
Before jumping into making a plot in R, it is often useful to take this step back, think about your main biological question, and take a pencil and paper to sketch some ideas and potential outcomes. I do this to prepare my mind to interpret different results, and to ensure that I’m using R to answer my questions, rather than getting sucked in to so much Ring that I forget why I even started. With this in mind, we’re ready to get introduced to ggplot
ing!
My approach to figure-making in #ggplot ALWAYS begins with sketching out what I want the final product to look like. It feels a bit analog but helps me determine which #geom or #theme I need, what arrangement will look best, & what illustrations/images will spice it up. #rstats pic.twitter.com/GUjeEgqZxj
— Shasta E. Webb, PhD (@webbshasta) May 22, 2020
ggplot
As described in the video above ggplot
is built on a framework for building plots called the grammar of graphics. A major idea here is that plots are made up of data that we map onto aesthetic attributes.
Lets unpack this sentence, because there’s a lot there. Say we wanted to make a very simple plot e.g. observations for categorical data, or a simple histogram for a single continuous variable. Here we are mapping this variable onto a single aesthetic attribute – the x-axis.
ggplot
in one place, check out the ggplot2 book (Wickham 2016) and/or the socviz book (Healy 2018).
So first we need to think of the variables we hope to map onto an aesthetic. We will first explore this idea by building a scatterplot.
Two of the most familiar and commonly used aesthetics are x and y. When we have two continuous variables we usually map the explanatory variable onto the x axis and the response variable onto the y. We then tell ggplot
that we want to display the data as points with geom_point()
.
Before doing this in ggplot, let us revisit the plots we can make in GWalkR. Figure 2 shows a GWalkR plot with body mass on the x and flipper length on the y, with data as points (“scatter” from the pull down menu).
Figure 3 does the same thing in ggplot syntax. Here we map body mass and flipper length onto the x and y aesthetics, respectively, and show the data as points with geom_point()
.
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm))+
geom_point()
This mapping approach can get pretty powerful, as it can allow us to visualize many dimensions. For example, we can map a categorical variable to shape and/or color to show additional aspects of our data. In GWalkR we would this by dragging species to color (e.g. Figure 4).
In ggplot, we do this by mapping species onto the color aesthetic (e.g. Figure 5).
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species))+
geom_point()
We can use the same aesthetic mapping scheme to map a categorical variable onto the x-axis, as I show in Figure 6).
However, these figures are are often difficult to interpret because many points on top of each other can all blur together (aka over-plotting). To avoid over-plotting, we can make the data somewhat transparent (controlled by the attribute alpha
). There are a few solutions to this.
A jitterplot: We can spread the data across the x-axis (by changing to geom_jitter()
, as in 7A). When making a jitterplot, use the width
argument to control the spread of points on the x-axis.
A boxplot: As discussed previously, boxplots can show the spread of the data. We can make them in ggplot with geom_boxplot()
(Figure 7B). While boxplots do show the spread of data, they do not show individual data points and provide only a superficial view into the shape of the data.
A histogram: Histograms clearly display the shape of our data. You can make them with geom_histogram()
as in Figure 7C). R
will make some default decisions about the number of bins to display in a histogram, but you can make your own decision by either specifying the binwidth (geom_histogram(binwidth = THE_SIZE_OF_THE_BINS
),or the number of bins (geom_histogram(bins = THE_NUMBER_OF_BINS
). Note that when making a histogram we map our categorical variable onto the fill aesthetic – as opposed to color – as fills the bars, rather than coloring the lines around them. However, when we have more the distributions of categorical variables overlap, histograms can be hard to interpret.
A density plot density plots (made with geom_density()
as in 7D) show a continuous distribution fit by our data. These both preserve the shape of the data, and allow us to overlay distributions for different categories (so long as we introduce some transparency into our density plots with the alpha
argument). However, always remember that density plots are not our actual data but rather a distribution made to fit it.
I note that of these solutions, only the boxplot could be easily made in GWalkR (we could change the “geom”, by clicking on the boxplot logo in Marktype
).
# Code to make the plots below (minus some bells and whistles)
# The actual code is here
# https://raw.githubusercontent.com/ybrandvain/code4biostats/main/categorical_explanatory.R
library(palmerpenguins)
library(ggplot2)
# A) Jitter plot
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, color = species))+
geom_jitter(height = 0, width = .2, size = 3,alpha = .4)
# B) Boxplot
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, fill = species))+
geom_boxplot()
# C) Histogram
ggplot(data = penguins, aes(x = flipper_length_mm, fill = species))+
geom_histogram()
# D) Density plot
ggplot(data = penguins, aes(x = flipper_length_mm, fill = species))+
geom_density(alpha = .6)
At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.
Edward Tufte (Tufte 1990)
Edward Tufte, a major figure in the field of data visualization - popularized the concept of “small multiples” – showing data with the same structure across various comparisons. He argued that such visualizations can help our eyes make powerful comparisons (See quote above). The idea of small multiples redacted Tufte – among the most famous such figures, The Horse in Motion (Figure 8), shows a horse as it runs.
Similarly the lunar phase can be well visulaized by using small multiples (Figure 9).
We can easily harness the power of small multiples in ggplot with the facet_wrap()
and facet_grid()
functions.
Let’s revisit our species differences in bill length in penguins and see how this differs by sex. Lets try two different ways to do this – first faceting by sex (Figure 10), and then by species (Figure 11). I show this both ways, because there is not a “right” way, and it is usually best to try multiple visualizations to see which best highlights the patterns in the data. In my view, faceting by sex better highlights the fact that most of the difference in flipper length is between species, but in some species, there is a modest difference between sexes.
ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex
aes(x = species, y = flipper_length_mm, color = species))+
geom_jitter(height = 0 , width = .3, show.legend = FALSE)+ # No need for a legend - its on the x
facet_wrap(~ sex)
ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex
aes(x = sex, y = flipper_length_mm, color = sex))+
geom_jitter(height = 0 , width = .3, show.legend = FALSE)+ # No need for a legend - its on the x
facet_wrap(~ species)
We can add even more information here! We can compare across years to show the flipper length seems pretty similar, but perhaps modestly increasing, each year and that this patterns is shared for sexes and species (Figure 12).
ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex
aes(x = factor(year), y = flipper_length_mm, color = factor(year)))+
geom_jitter(height = 0 , width = .1, show.legend = FALSE)+ # No need for a legend - its on the x
facet_grid(sex~species, labeller = "label_both")+
scale_color_grey(start=0.8, end=0.2) +
theme_light()
The are a few ways to save plots in R.
The quick, dirty, and unprofessional way is to simply take a screen-grab. Note that this is imperfect because sometime weird shadows etc show up, and because it not vectorized, so it can get blurry etc.
Clicking the export button in the plot pane (Figure 13) is another approach. This can work well, and it allows you to easily pck formats etc. After saving in this way, be sure to open up the saved image to ensure that it looks like how you want it to.
The ggsave()
function allows you to save a plot with a script. You can specify file type, dimensions etc in the arguments to this function.
Often during data exploration and data cleaning, we are interested in learning more about specific data points. Whether we are trying to understand an outlier, identify patterns, or verify the integrity of our data, having an interactive tool can be incredibly valuable. The plotly
package is an easy way to make interactive plots in R for data exploration.
These interactive plots allow us to hover over individual data points to reveal additional information, zoom into areas of interest, and filter or highlight subsets of the data on the fly. This can greatly enhance exploratory data analysis (EDA) and quality control (QC).
For example, when examining a scatter plot of two variables, we can hover over a point to immediately see its exact values, helping us understand how it contributes to the overall trend or why it might be an outlier. In plotly, we can include additional information as part of the hover text without altering the visual aspects of the plot itself. This allows our interactive plot to tell us more about our data points, including attributes that aren’t directly plotted.
Returning to our penguins example in Figure 12, I add five new aesthetics (a through e) to include information about the penguins’ island of origin, bill depth and length, and body mass. While these are not plotted, they will be shown in the hover text, providing additional insights into the data (Figure 14).
library(plotly)
p <- ggplot(filter(penguins, !is.na(sex)),
aes(x = factor(year), y = flipper_length_mm, color = factor(year),
a = island,
b = bill_length_mm,
c = bill_depth_mm,
d = body_mass_g,
e = sex)) +
geom_jitter(height = 0, width = .1) +
facet_grid(sex ~ species, labeller = "label_both") +
scale_color_grey(start = 0.8, end = 0.2) +
theme_light()+
theme(legend.position = "none")
ggplotly(p)
There is no need to memorize anything, check out this handy cheat sheet!
Throughout the term we will continue to make more plots, and will revisit the idea of making explanatory plots in ggplot2 near the term’s end.