Chapter 5: Intro to ggplot

Making our own plots in R.

Motivating scenarios: Motivating scenarios: you have a fresh new data set and want to check it out. How do you go about looking into it?

Learning goals: By the end of this chapter you should be able to:

There is no external reading for this chapter, but watch the embedded vidoes, and complete all embedded excercises.

A quick intro to data visualization.

Recall that as bio-statisticians, we bring data to bear on critical biological questions, and communicate these results to interested folks. A key component of this process is visualizing our data.

They say “a picture is worth a thousand words,” similarly a clear graph can communicate complex patterns in our data.

Exploratory and explanatory visualizations

We generally think of two extremes of the goals of data visualization

The ggplot2 package in R is well suited for both purposes. Today we focus on exploratory visualization in ggplot2 because

  1. They are the starting point of all statistical analyses.
  2. You can do them with less ggplot2 knowledge.
  3. They take less time to make than explanatory plots.

Later in the term we will show how we can use ggplot2 to make high quality explanatory plots.

Centering plots on biology

Whether developing an explanatory or exploratory plot, you should think hard about the biology you hope to convey before jumping into a plot. Ask yourself

The answers to these questions should guide our data visualization strategy, as this is a key step in our statistical analysis of a dataset. The best plots should evoke an immediate understanding of the (potentially complex) data. Put another way, a plot should highlight both the biological question and its answer.

Before jumping into making a plot in R, it is often useful to take this step back, think about your main biological question, and take a pencil and paper to sketch some ideas and potential outcomes. I do this to prepare my mind to interpret different results, and to ensure that I’m using R to answer my questions, rather than getting sucked in to so much Ring that I forget why I even started. With this in mind, we’re ready to get introduced to ggploting!

The idea of ggplot

Figure 1: Watch this video about getting started with ggplot2 (It is 7 min and 17 sec long), from STAT 545.

As described in the video above ggplot is built on a framework for building plots called the grammar of graphics. A major idea here is that plots are made up of data that we map onto aesthetic attributes.

Lets unpack this sentence, because there’s a lot there. Say we wanted to make a very simple plot e.g. observations for categorical data, or a simple histogram for a single continuous variable. Here we are mapping this variable onto a single aesthetic attribute – the x-axis.

There’s way more nuance / awesomeness / power / ideas behind ggplot2 which we will reveal just in time over the term, but here we get going with the essential concepts. If you want to learn more about ggplot in one place, check out the ggplot2 book (Wickham 2016) and/or the socviz book (Healy 2018).

Mapping aesthetics to variables

So first we need to think of the variables we hope to map onto an aesthetic. We will first explore this idea by building a scatterplot.

Scatterplots

Two of the most familiar and commonly used aesthetics are x and y. When we have two continuous variables we usually map the explanatory variable onto the x axis and the response variable onto the y. We then tell ggplot that we want to display the data as points with geom_point().

Before doing this in ggplot, let us revisit the plots we can make in GWalkR. Figure 2 shows a GWalkR plot with body mass on the x and flipper length on the y, with data as points (“scatter” from the pull down menu).

A GWalkR plot with continuous variables on the x and y axis, and showing the data as points (note the black dot next to scatter).

Figure 2: A GWalkR plot with continuous variables on the x and y axis, and showing the data as points (note the black dot next to scatter).

Figure 3 does the same thing in ggplot syntax. Here we map body mass and flipper length onto the x and y aesthetics, respectively, and show the data as points with geom_point().

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm))+
  geom_point()
A ggplot mapping continuous variables onto the x and y axis, and showing the data as points.

Figure 3: A ggplot mapping continuous variables onto the x and y axis, and showing the data as points.

This mapping approach can get pretty powerful, as it can allow us to visualize many dimensions. For example, we can map a categorical variable to shape and/or color to show additional aspects of our data. In GWalkR we would this by dragging species to color (e.g. Figure 4).

A gwalkr plot with continuous variables on the x and y axes, and a categorical variable (species) show by color, with data as points.

Figure 4: A gwalkr plot with continuous variables on the x and y axes, and a categorical variable (species) show by color, with data as points.

In ggplot, we do this by mapping species onto the color aesthetic (e.g. Figure 5).

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species))+
  geom_point()
A ggplot mapping continuous variables onto the x and y axes, and a categorical variable (species) onto the color, and showing the data as points.

Figure 5: A ggplot mapping continuous variables onto the x and y axes, and a categorical variable (species) onto the color, and showing the data as points.

A categorical explanatory variable

We can use the same aesthetic mapping scheme to map a categorical variable onto the x-axis, as I show in Figure 6).

A ggplot mapping a categorical variable (sex) onto the x-axis, and a continuous variable (flipper length) onto the y axis, and showing the data as points.

Figure 6: A ggplot mapping a categorical variable (sex) onto the x-axis, and a continuous variable (flipper length) onto the y axis, and showing the data as points.

However, these figures are are often difficult to interpret because many points on top of each other can all blur together (aka over-plotting). To avoid over-plotting, we can make the data somewhat transparent (controlled by the attribute alpha). There are a few solutions to this.

I note that of these solutions, only the boxplot could be easily made in GWalkR (we could change the “geom”, by clicking on the boxplot logo in Marktype).

# Code to make the plots below (minus some bells and whistles)
# The actual code is here
# https://raw.githubusercontent.com/ybrandvain/code4biostats/main/categorical_explanatory.R
library(palmerpenguins)
library(ggplot2)

# A) Jitter plot
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, color = species))+
  geom_jitter(height = 0, width = .2, size = 3,alpha = .4)

# B) Boxplot
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, fill = species))+
  geom_boxplot()

# C) Histogram
ggplot(data = penguins, aes(x = flipper_length_mm,  fill = species))+
  geom_histogram()

# D) Density plot
ggplot(data = penguins, aes(x = flipper_length_mm, fill = species))+
  geom_density(alpha = .6)
Three ways to show data with a categorical explanatory variable. A) A jitter plot. B) A bxoplot. C) A histogram. D) A density plot.

Figure 7: Three ways to show data with a categorical explanatory variable. A) A jitter plot. B) A bxoplot. C) A histogram. D) A density plot.

Adding small multiples geom layers

At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.

Edward Tufte (Tufte 1990)

Edward Tufte, a major figure in the field of data visualization - popularized the concept of “small multiples” – showing data with the same structure across various comparisons. He argued that such visualizations can help our eyes make powerful comparisons (See quote above). The idea of small multiples redacted Tufte – among the most famous such figures, The Horse in Motion (Figure 8), shows a horse as it runs.

[The Horse in Motion](https://en.wikipedia.org/wiki/The_Horse_in_Motion) -- the first example of chronophotography -- is fantastic example of using *small multiples*.

Figure 8: The Horse in Motion – the first example of chronophotography – is fantastic example of using small multiples.

Similarly the lunar phase can be well visulaized by using small multiples (Figure 9).

Using *small multiples* to show the lunar phase moon over a month. From [this link](https://medium.com/@blakewilliford/information-design-9-29-ee995efb584e)

Figure 9: Using small multiples to show the lunar phase moon over a month. From this link

We can easily harness the power of small multiples in ggplot with the facet_wrap() and facet_grid() functions.

Let’s revisit our species differences in bill length in penguins and see how this differs by sex. Lets try two different ways to do this – first faceting by sex (Figure 10), and then by species (Figure 11). I show this both ways, because there is not a “right” way, and it is usually best to try multiple visualizations to see which best highlights the patterns in the data. In my view, faceting by sex better highlights the fact that most of the difference in flipper length is between species, but in some species, there is a modest difference between sexes.

ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex 
              aes(x = species, y = flipper_length_mm, color = species))+
  geom_jitter(height = 0 , width = .3, show.legend = FALSE)+  # No need for a legend - its on the x
  facet_wrap(~ sex)
Flipper length by sex and species (faceting by sex).

Figure 10: Flipper length by sex and species (faceting by sex).

ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex 
              aes(x = sex, y = flipper_length_mm, color = sex))+
  geom_jitter(height = 0 , width = .3, show.legend = FALSE)+  # No need for a legend - its on the x
  facet_wrap(~ species)
Flipper length by sex and species (faceting by species).

Figure 11: Flipper length by sex and species (faceting by species).

We can add even more information here! We can compare across years to show the flipper length seems pretty similar, but perhaps modestly increasing, each year and that this patterns is shared for sexes and species (Figure 12).

ggplot(filter(penguins, !is.na(sex)), # lets remove individuals of unknow sex 
       aes(x = factor(year), y = flipper_length_mm, color = factor(year)))+
    geom_jitter(height = 0 , width = .1, show.legend = FALSE)+  # No need for a legend - its on the x
    facet_grid(sex~species, labeller = "label_both")+
    scale_color_grey(start=0.8, end=0.2) +
    theme_light()
Flipper length by sex, species, and year (faceting by year).

Figure 12: Flipper length by sex, species, and year (faceting by year).

Saving plots

The are a few ways to save plots in R.

Click on the export tab in the plots pane to save a plot made in R.

Figure 13: Click on the export tab in the plots pane to save a plot made in R.

Interactive plots in plotly.

Often during data exploration and data cleaning, we are interested in learning more about specific data points. Whether we are trying to understand an outlier, identify patterns, or verify the integrity of our data, having an interactive tool can be incredibly valuable. The plotly package is an easy way to make interactive plots in R for data exploration.

These interactive plots allow us to hover over individual data points to reveal additional information, zoom into areas of interest, and filter or highlight subsets of the data on the fly. This can greatly enhance exploratory data analysis (EDA) and quality control (QC).

For example, when examining a scatter plot of two variables, we can hover over a point to immediately see its exact values, helping us understand how it contributes to the overall trend or why it might be an outlier. In plotly, we can include additional information as part of the hover text without altering the visual aspects of the plot itself. This allows our interactive plot to tell us more about our data points, including attributes that aren’t directly plotted.

Returning to our penguins example in Figure 12, I add five new aesthetics (a through e) to include information about the penguins’ island of origin, bill depth and length, and body mass. While these are not plotted, they will be shown in the hover text, providing additional insights into the data (Figure 14).

library(plotly)
p <- ggplot(filter(penguins, !is.na(sex)), 
            aes(x = factor(year), y = flipper_length_mm, color = factor(year),
                a = island,
                b = bill_length_mm,
                c = bill_depth_mm,
                d = body_mass_g,
                e = sex)) +
    geom_jitter(height = 0, width = .1) +
    facet_grid(sex ~ species, labeller = "label_both") +
    scale_color_grey(start = 0.8, end = 0.2) +
    theme_light()+
  theme(legend.position = "none")

ggplotly(p)

Figure 14: Plotly enables interactivity that facillitates both quality control and exploratory data analysis.

Quiz.

Figure 15: The accompanying quiz link

ggplot2: cheat sheet

There is no need to memorize anything, check out this handy cheat sheet!

And so much more

Throughout the term we will continue to make more plots, and will revisit the idea of making explanatory plots in ggplot2 near the term’s end.

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.
Tufte, Edward R. 1990. Envisioning Information. Cheshire, Conn.: Graphics Press.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

References