# Chapter 6 Visualizing data in R – An intro to ggplot

Motivating scenarios: Motivating scenarios: you have a fresh new data set and want to check it out. How do you go about looking into it?

**Learning goals: By the end of this chapter you should be able to:**

- Build a simple ggplot.

- Explain the idea of mapping data onto aesthetics, and the use of different geoms.

- Match common plots to common data type.

- Use geoms in ggplot to generate the common plots (above).

`mpg`

data.
## 6.1 A quick intro to data visualization.

Recall that as bio-statisticians, we bring data to bear on critical biological questions, and communicate these results to interested folks. A key component of this process is visualizing our data.

### 6.1.1 Exploratory and explanatory visualizations

We generally think of two extremes of the goals of data visualization

- In
**exploratory visualizations**we aim to identify any interesting patterns in the data, we also conduct quality control to see if there are patterns indicating mistakes or biases in our data, and to think about appropriate transformations of data. On the whole, our goal in exploratory data analysis is to understand the stories in the data.

- In
**explanatory visualizations**we aim to communicate our results to a broader audience. Here are goals are communication and persuasion. When developing explanatory plots we consider our audience (scientists? consumers? experts?) and how we are communicating (talk? website? paper?).

The `ggplot2`

package in `R`

is well suited for both purposes. **Today we focus on exploratory visualization in ggplot2 because**

- They are the starting point of all statistical analyses.

- You can do them with less
`ggplot2`

knowledge.

- They take less time to make than explanatory plots.

Later in the term we will show how we can use `ggplot2`

to make high quality explanatory plots.

### 6.1.2 Centering plots on biology

Whether developing an exploratory or exploratory plot, you should think hard about the biology you hope to convey before jumping into a plot. Ask yourself

- What do you hope to learn from this plot?

- Which is the response variable (we usually place that on the y-axis)?

- Are data numeric or categorical?
- If they are categorical are they ordinal, and if so what order should they be in?

The answers to these questions should guide our data visualization strategy, as this is a key step in our statistical analysis of a dataset. The best plots should evoke an immediate understanding of the (potentially complex) data. Put another way, a plot should highlight both the biological question and its answer.

**Before jumping into making a plot in R, it is often useful to take this step back, think about your main biological question, and take a pencil and paper to sketch some ideas and potential outcomes.** I do this to prepare my mind to interpret different results, and to ensure that I’m using R to answer my questions, rather than getting sucked in to so much Ring that I forget why I even started. With this in mind, we’re ready to get introduced to `ggplot`

ing!

My approach to figure-making in #ggplot ALWAYS begins with sketching out what I want the final product to look like. It feels a bit analog but helps me determine which #geom or #theme I need, what arrangement will look best, & what illustrations/images will spice it up. #rstats pic.twitter.com/GUjeEgqZxj

— Shasta E. Webb (@webbshasta) May 22, 2020

### Remembering out set up from last chapter

```
<- msleep %>%
msleep mutate(log10_brainwt = log10(brainwt),
log10_bodywt = log10(bodywt))
<- ggplot(data = msleep, aes(x = log10_brainwt)) # save plot
msleep_plot1
<- msleep_plot1 +
msleep_histogram geom_histogram(bins =10, color = "white")
```

## 6.2 Common types of plots

As we saw in the section, Centering plots on biology, we want our biological questions and the structure of the data to guide our plotting choices. **So, before we get started on making plots, we should think about our data.**

- What are the variable names?

- What are the types of variables?

- What are our motivating questions and how do the data map onto these questions?

- Etc…

Using the `msleep`

data set below, we briefly work through a rough guide on how the structure of our data can translate into a plot style, and how we translate that into a geom in ggplot. So the first step you should look at the data – either with the `view()`

function, or a quick `glimpse()`

and reflect on your questions before plotting. This also helps us remember the name and data type of each variable.

`glimpse(msleep)`

```
## Rows: 83
## Columns: 13
## $ name [3m[38;5;246m<chr>[39m[23m "Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tailed shrew", …
## $ genus [3m[38;5;246m<chr>[39m[23m "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus", "Callorhi…
## $ vore [3m[38;5;246m<chr>[39m[23m "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", NA, "carni", …
## $ order [3m[38;5;246m<chr>[39m[23m "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodactyla", "Pilo…
## $ conservation [3m[38;5;246m<chr>[39m[23m "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domesticated", "lc", …
## $ sleep_total [3m[38;5;246m<dbl>[39m[23m 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9.4, 10.0, 12…
## $ sleep_rem [3m[38;5;246m<dbl>[39m[23m NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, 1.5, 2.2, 2.…
## $ sleep_cycle [3m[38;5;246m<dbl>[39m[23m NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.3333333, NA,…
## $ awake [3m[38;5;246m<dbl>[39m[23m 11.90, 7.00, 9.60, 9.10, 20.00, 9.60, 15.30, 17.00, 13.90, 21.00, 18.70, 1…
## $ brainwt [3m[38;5;246m<dbl>[39m[23m NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.09820, 0.11500, …
## $ bodywt [3m[38;5;246m<dbl>[39m[23m 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14.000, 14.800…
## $ log10_brainwt [3m[38;5;246m<dbl>[39m[23m NA, -1.8096683, NA, -3.5376020, -0.3736596, NA, NA, NA, -1.1549020, -1.007…
## $ log10_bodywt [3m[38;5;246m<dbl>[39m[23m 1.6989700, -0.3187588, 0.1303338, -1.7212464, 2.7781513, 0.5854607, 1.3115…
```

Now we’re nearly ready to get started, but first, some caveats

These are vey preliminary exploratory plots – and you may need more advanced plotting R talents to make plots that better help you see patterns. We will cover these in Chapters YB ADD, where we focus on explanatory plots.

There are not always cookie cutter solutions, with more complex data you may need more complex visualizations.

That said, the simple visualization and R tricks we learn below are the essential building blocks of most data presentation. So, let’s get started!

### 6.2.1 One variable

With one variable, we use plots to visualize the relative frequency (on the y-axis) of the values it takes (on the x-axis).

gg-plotting one variable We map our one variable of interest onto x `aes(x = <x_variable>)`

, where we replace `<x_variable>`

with our x-variable. The mapping of frequency onto the y happens automatically.

#### One categorical variable

Say we wanted to know how many carnivores, herbivores, insectivores, and omnivores in the `msleep`

data set. From the output of the `glimpse()`

function above, we know that vore is a categorical variable, so we want a simple bar plot, which we make with `geom_bar()`

.

```
ggplot(data = msleep, aes(x = vore)) +
geom_bar()
```

We can also pipe data into `ggplot`

argument after doing stuff to the data. For example, the code below remove `NA`

values from our plot.

```
%>%
msleep filter(!is.na(vore)) %>%
ggplot(aes(x = vore)) +
geom_bar()
```

If the same data where presented as one categorical variable for vore (with each vore once) and another, `n`

, for counts.

`count(msleep, vore)`

```
## [38;5;246m# A tibble: 5 x 2[39m
## vore n
## [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m
## [38;5;250m1[39m carni 19
## [38;5;250m2[39m herbi 32
## [38;5;250m3[39m insecti 5
## [38;5;250m4[39m omni 20
## [38;5;250m5[39m [31mNA[39m 7
```

We could recreate figure 6.1 with `geom_col()`

. again mapping vore to the x-aesthetic, and now mapping count to the y aesthetic, by as follows:

```
count(msleep, vore) %>%
ggplot(aes(x = vore, y = n))+
geom_col()
```

#### One continuous variable

We are often interested to know how variable our data is, and to think about the shape of this variability. Revisiting our data on mammal sleep patterns, we might be interested to evaluate the variability in how long mammals sleep.

- Do all species sleep roughly the same amount?

- Is the data bimodal (with two humps)?

- Do some species sleep for an extraordinarily long or short amount of time?

We can look into this with a histogram or a density plot.

**One continuous variable: A histogram**

We use the histogram geom, `geom_histogram()`

, to make a histogram in R.

```
ggplot(msleep, aes(x = log10_brainwt))+
geom_histogram(bins = 10, color = "white") # Bins tells R we want 10 bins, and color = white tells R we want white lines between our bins
```

`## Warning: Removed 27 rows containing non-finite values (stat_bin).`

In a **histogram**, *each value on the x represents some interval of values of our categorical variable* (in this case, we had 10 bins, but we could have, for example, looked at sleeep in one hour with `binwidth = 1`

), while *y-values show how many observations correspond to an interval on the x*.

See this excellent write up if you want to learn more about histograms.

When making a histogram it is worth exploring numerous binwidths to ensure you’re not fooling yourself

**One continuous variable: A density plot**

We use the density geom, `geom_density()`

, to make a histogram in R.

```
ggplot(msleep, aes(x = log10_brainwt))+
geom_density(fill = "blue")
```

Sometimes we prefer a smooth density plot to a histogram, as this can allow us to not get too distracted by a few bumps (on the other hand, we can also miss important variability, so be careful). We again map total_sleep onto the x aesthetic, but now use `geom_density()`

.

### 6.2.2 Two variables

With two variables, we want to highlight the association between them. In the plots below, we show that how this is presented can influence our biological interpretation and take-home messages.

#### Two categorical variables

With two categorical variables, we usually add color to a barplot to identify the second group. We can choose to

- Stack bars (stacked barplot, the default behavior of
`geom_bar()`

) [Fig. 6.3],

- Have them next to one another (grouped barplot, add
`position = position_dodge(preserve = "single")`

to`geom_bar()`

) [Fig. 6.4], or

- Standardize them by proportion (add
`position = "fill"`

to`geom_bar()`

) [Fig. 6.5].

First, we process our data, making use of the tricks we learned in
Handling data in `R`

. To do so, we `filter()`

for not `NA`

diets, `add_count()`

to see how many species we have in each order, and `filter()`

for orders with five or more species with diet data.

```
# Data processing
<- msleep %>%
msleep_data_ordervore filter(!is.na(vore)) %>% # Only cases with data for diet
add_count(order) %>% # Find counts for each order
filter(n >= 5) # Lets only hold on to orders with 5 or more species with data
```

**Two categorical variables: A stacked bar plot**

```
ggplot(data = msleep_data_ordervore, aes(x = order, fill= vore))+
geom_bar()
```

Stacked barplots are best suited for cases when we’re primarily interested in total counts (e.g. how many species do we have data for in each order), and less interested in comparing the categories going into these counts. Rarely is this the best choice, so don’t expect to make too many stacked barplots.

**Two categorical variables: A grouped bar plot**

```
ggplot(data = msleep_data_ordervore, aes(x = order, fill= vore))+
geom_bar(position = position_dodge(preserve = "single"))
```

Grouped barplots are best suited for cases when we’re primarily interested in comparing the categories going into these counts. This is often the best choice, as we get to see counts. However the total number in each group is harder to see in a grouped than a stacked barplot (e.g. it’s easy to see that we have the same number of primates and carnivores in Fig. 6.3, while this is harder to see in Fig. 6.4).

**Two categorical variables: A filled bar plot**

```
ggplot(data = msleep_data_ordervore, aes(x = order, fill= vore))+
geom_bar(position = "fill")
```

Filled barplots are much stacked barplots standardized to the same height. In other words, they are like stacked bar plots without their greatest strength. This is rarely a good idea, except for cases with only two or three options for each of numerous categories.

#### 6.2.2.1 One categorical and one continuous variable.

**One categorical and one continuous variable: Multiple histograms**

A straightforward way to show the continuous values for different categories is to make a separate histogram for each numerous distributions is to make separate histograms for each category using the `geom_histogram()`

and `facet_wrap()`

functions in `ggplot`

.

```
<- ggplot(msleep_data_ordervore, aes(x= log10_bodywt))+
msleep_data_ordervore_hist geom_histogram(bins = 10)
+
msleep_data_ordervore_hist facet_wrap(~order, ncol = 1)
```

When doing this, be sure to aid visual comparisons simple by ensuring there’s only one column. Note how Figure 6.6 makes it much easier to compare distributions than does Figure 6.7.

```
+
msleep_data_ordervore_hist facet_wrap(~order, nrow = 1)
```

**One categorical and one continuous variable: Density plots**

```
ggplot(msleep_data_ordervore, aes(x= bodywt, fill = order))+
geom_density(alpha = .3)+
scale_x_continuous(trans = "log10")
```

While many histograms can be nice, they can also take up a lot of space. Sometime we can more succinctly show distributions for each group with numerous density plots (`geom_density()`

). While this can be succinct, it can also get too crammed, so have a look and see which display is best for your data and question.

**One categorical and one continuous variable: Boxplots, jitterplots etc..**

Histograms and density plots communicate the shapes of distributions, but we often hope to compare means and get a sense of variability.

**Boxplots**(Figure 6.9A) summarize distributions by showing all quartiles – often showing outliers with points. e.g.`ggplot(aes(x = order, y = bodywt)) +`

`geom_boxplot()`

.**Jitterplots**(Figure 6.9B) show all data points, spreading them out over the x-axis. e.g.`ggplot(aes(x = order, y = bodywt)) +`

`geom_jitter()`

.

We can combine both to get the best of both worlds (Figure 6.9C). e.g.

`ggplot(aes(x = order, y = bodywt)) + geom_boxplot() + geom_jitter()`

.

#### 6.2.2.2 Two continuous variables

```
ggplot(msleep_data_ordervore, aes(x = log10_bodywt, y = log10_brainwt))+
geom_point()
```

With two continuous variables, we want a graph that visually display the association between them. A scatterplot displays the explanatory variable n the x-axis, and the response variable on the y-axis. The scatterplot in figure 6.10, shows a clear increase in brain size with body size across mammal species when both are on \(log_{10}\) scales.

### 6.2.3 More dimensions

```
ggplot(msleep_data_ordervore,
aes(x = log10_bodywt, y = log10_brainwt, color = vore, shape = order))+
geom_point()
```

What if we wanted to see even more? Like let’s say we wanted to know if we found a similar relationship between brain weight and body weight across orders and/or if this relationship was mediated by diet. We can pack more info into these plots.

⚠️ Beware, sometimes shapes are hard to differentiate.⚠️ Facetting might make these patterns stand out.

```
ggplot(msleep_data_ordervore, aes(x = log10_bodywt, y = log10_brainwt, color = vore))+
geom_point()+
facet_wrap(~order, nrow = 1)
```

### 6.2.4 Interactive plots with the plotly package

Often when I get a fresh data set I want to know a bit more about the data points (to e.g. identify outliers or make sense of things). The plotly package is super useful for this, as it makes interactive graphs that we can explore.

```
# install.packages("plotly") first install plotly, if it's not installed yet
library(plotly) # now tell R you want to use plotly
# Click on the plot below to explore the data!
<- ggplot(msleep_data_ordervore,
big_plot aes(x = log10_bodywt, y = log10_brainwt,
color = vore, shape = order, label = name))+
geom_point()
ggplotly(big_plot)
```

#### Decoration vs information

```
ggplot(msleep_data_ordervore, aes(x = log10_bodywt, y = log10_brainwt))+
geom_point(color = "firebrick", size = 3, alpha = .5)
```

We have used the `aes()`

argument to provide information. For example, in Figure 5.15 we used color to show a diet by typing aes(…, color = vore). But what if we just want a fun color for data points. We can do this by specifying color outside of the aes argument. Same goes for other attributes, like size etc, or transparency (alpha)…

## 6.3 ggplot Assignment

Read the chapter

Watch the video about getting started with ggplot

Complete RStudio’s primer on data visualization basics.

Complete the glimpse intro (4.3.1) and the quiz.

Make three plots from the `mpg`

data and describe the patterns they highlight.

Fill out the quiz on canvas, which is very simlar to the one below.

### 6.3.1 ggplot2 quiz

## 6.4 ggplot2 review / reference

### 6.4.1 ggplot2: cheat sheet

There is no need to memorize anything, check out this handy cheat sheet!

#### 6.4.1.1 ggplot2: common functions, aesthetics, and geoms

##### The ggplot() function

- Takes arguments
`data =`

and`mapping =`

.

- We usually leave these implied and type e.g.
`ggplot(my.data, aes(...)`

) rather than`ggplot(data = my.data, mapping = aes(...)`

).

- We can pipe data into the
`ggplot()`

function, so my.data %>% ggplot(aes(…)) does the same thing as ggplot(my.data, aes(…)).

##### Arguments for aes() function

The `aes()`

function takes many potential arguments each of which specifies the aesthetic we are mapping onto a variable:

###### x, y, and label:

x: What is shown on the x-axis.

y: What is shown on the y-axis.

label: What is show as text in plot (when using geom_text())

##### Commonly used geoms

See Section 3.1 of the ggplot2 book for more (Grolemund and Wickham 2018).

`geom_histogram()`

: Makes a histogram.

`geom_density()`

: Makes a density plot.

`geom_point()`

: Makes points - ideal for a scatterplot.

`geom_jitter()`

: Maks jittered points - ideal for showing data when x is catgorical or discrete.

`geom_col()`

: or`geom_bar()`

: Makes a barplot from count data`geom_col()`

, or from all observations`geom_bar()`

.

`geom_line()`

: Connect observations with a line.

##### Faceting

Faceting allows us to use the concept of small multiples (Tufte 1983) to highlight patterns.

For one facetted variable: `facet_wrap(~ <var>, nocl = )`

`facet_grid(<var1>~ <var2>)`

, where one is shown by rows, and is shown by columns.
### References

*The Visual Display of Quantitative Information*. pub-gp:adr: pub-gp.

*The Analysis of Biological Data*. Third Edition.