3 Visualizing Your Data

3.1 What is a Visualization?

A data visualization is a method of representing data in a graphical format, useful both in communicating results of analyses and in exploring datasets to determine what analyses might be appropriate. At the end of the last chapter, we used R’s built in plot() function to make this graphic:

plot(iris$Sepal.Length, iris$Sepal.Width)

This is our first data visualization in R! Unfortunately, it has some obvious challenges - the axis labels are unhelpful, there’s no context as to what the graph is, and to be honest it’s a little bit ugly.

If we wanted to, we could spend some time cleaning up this graphic and making it better:

plot(iris$Sepal.Length,
     iris$Sepal.Width,
     xlab = "Sepal length (cm)",
      ylab = "Sepal width (cm)",
      pch = 19,
      las = 1,
     col = "deepskyblue4")

But doing anything much more than this gets very complicated very quickly. For this reason, I almost never use the base graphing functions in R - they’re just so complicated to use!

3.2 The Tidyverse Package

Thankfully enough, R has a ton of add-on softwares - called packages - which make graphing (and many other tasks) with R significantly easier. Let’s install some of the most common ones now:

install.packages("tidyverse")
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Note the quotes around “tidyverse” when you go to install it, but not when it’s inside of library(). The reason for this is a little complicated - basically, you don’t use quotes for things that are inside of R’s memory, like data, functions, and packages. You use quotes for everything else.

If you get an error saying “no package named tidyverse”, try reinstalling the package. It might take a few minutes to load.

What we just did was install a package called the tidyverse (with install.packages), and load it using library. Most common problems in R have already been solved by someone else, and most of those people have made their work publicly available for others to use in the form of a package. Packages only have to be installed once to be used - but you’ll have to call them using library() each time you restart R.

The tidyverse is a pretty unique example of a package - it actually contains six packages, most of which are essential to using R like a modern professional. The most important one for us right now is called ggplot2. Don’t worry about having to load it - library(tidyverse) automatically loads this package for you.

3.3 ggplot2

ggplot is an attempt to extend R’s basic graphics abilities to make publication-quality graphics faster and easier than ever before. In fact, we can make a version of our scatterplot above, just by typing:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point()

Remember, R is case sensitive!

There are five important steps that went into making that graph:

  • First, the ggplot() call tells R that we want to create a ggplot object
  • Second, the data = iris tells ggplot that everything we do should use the iris dataset
  • Third, the aes() specifies the aesthetics of the graph - what goes on the X and Y axes, but also any other data we want represented in our plot
  • Fourth, the + lets us add additional steps to our plot. Note that the + must always be at the end of a line - putting it at the start of a line will mess up your session! If you see a + in the console instead of a > after trying to plot something, this is most likely what happened - press your escape key to exit the command.
  • Finally, the geom tells ggplot what sort of graph we want. A geom is just the type of plot (or, well, the geometric object which represents data) - so geom_boxplot() generates a boxplot, while geom_col() makes a column chart. geom_point generates a scatterplot, but there are plenty of other options to choose from!x

3.3.1 Functions in ggplot

The ggplot() and geom_point calls are known as functions - a type of R object that, when given certain parameters, gives a certain output. Those parameters - in this plot, our data =, x =, and y = calls - are known as arguments.

Each of these arguments can have different values, if we want to change our graph. For instance, if we wanted to color and add a trendline for each species of iris, we could do the following:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

geom_smooth() adds a trendline to your graphs, with a shadow representing the 95% confidence interval around it. While some people refer to this as a line graph, it’s a separate thing entirely - a line graph connects the points, like this:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  geom_line()

For now, we’re going to stick with our pretty smoothed trendline.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Notice how our points here are a little hard to see, as they’re drawn under the line and shading? That’s because we call geom_smooth() after geom_point(), and ggplot adds things to the graph in the order we call them. If we wanted to make the points slightly easier to see, we could just flip the order we call the functions:

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Our graph makes a lot more sense now - sepal length and width seem to be correlated, but each species is different.

3.3.2 Changing Aesthetics

If we really wanted to, we could make other aesthetics also change with Species:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_smooth(aes(linetype = Species)) + 
  geom_point(aes(size = Species, shape = Species)) 
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

But that’s pretty ugly. We’ll get into graph best practices a little bit further into the unit - but generally speaking, a graph should contain exactly as much as it takes to get your point across, and no more. One aesthetic per variable is usually enough.

In an important exception to that rule, it’s generally well advised to use different shapes and colors at the same time. Colorblind viewers may not be able to discern the different colors you’re using - so varying the shape of your points or type of your lines helps make your graphics more accessible to the reader.

If you want, you can specify shapes using scale_shape functions, such as scale_shape_manual(). There are 25 shapes available for use in ggplot, each of which is named after a number - the number to the left of the shape in the figure below:

So if we wanted, we could specify shapes for each species in our dataset pretty easily! I’ve done so below. I’m also going to control the colors by hand - R has a ton of colors available, and you can go crazy picking the best colors for a graph. You can also specify colors by using hex codes (e.g., "#FFFFFF"), but be warned that you might not get an exact match of what you were looking for - R will match as closely as it can from the colors it has available.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point(aes(shape = Species), size = 3) + 
  scale_shape_manual(values = c(16, 17, 18)) + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) 

I also made the points a little bigger by specifying size = 3 - note that it isn’t in the aesthetics function, because it doesn’t care about any of the data. Things only go inside aes() if they’re responsive to changes in the data you’re graphing.

We can also vary the type of line that gets drawn when we use geom_smooth. This one only has six options, each of which has both a number and a name:

You can manually specify linetypes with scale_linetype functions, similar to what we did with shapes. You can use either the names or the numbers - just make sure that the names go inside of quotes, while the numbers don’t!

I’m going to make our same graph again, manually controlling the linetypes. I’m also going to get rid of that shadow - it represents the 95% confidence interval around the line (which we’ll discuss more in our statistics section), as identified via standard error. We can turn it off by setting se = FALSE in the geom_smooth() function call.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_smooth(aes(linetype = Species), size = 1, se = FALSE) + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  scale_linetype_manual(values = c("solid",
                                   "dashed",
                                   "twodash"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also combine both graphs into one, more useful graphic:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_smooth(aes(linetype = Species), size = 1, se = FALSE) + 
  geom_point(aes(shape = Species), size = 3) + 
  scale_shape_manual(values = c(16, 17, 18)) + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  scale_linetype_manual(values = c("solid",
                                   "dashed",
                                   "twodash"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Nifty! Hopefully you’re getting a sense of how ggplot allows you to build a graph by addition, as it draws whatever functions you’ve called in turn.

Note, by the way, that I’ve put aes() calls in both the ggplot() and geom functions. Geoms inherit from the ggplot() call - they’ll use whatever data and aesthetics are specified inside the parenthesis. However, if you want an aesthetic to only apply to one geom, you can put it inside that geom() call. This is pretty commonly used when an aesthetic only applies to one geom - for instance, our geom_smooth() can’t take a shape =.

You have to be careful with this power, though! Sometimes, defining geom-specific aesthetics will give you misleading or simply wrong visualizations. For instance, what would happen if we draw our lines based on the petal length of each species, rather than the sepal width?

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point(aes(shape = Species), size = 3) + 
  geom_smooth(aes(y = Petal.Length, linetype = Species), size = 1, se = FALSE) + 
  scale_shape_manual(values = c(16, 17, 18)) + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  scale_linetype_manual(values = c("solid",
                                   "dashed",
                                   "twodash"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Our plot makes no sense!

Lots of beginners are tripped up by this when they’re starting - a common assumption is that ggplot will add a second y-axis to the right hand of the plot. In reality, there is no way to graph two y-axes (of different values) on the same ggplot graph - and that’s on purpose, as dual y-axis plots are almost always misleading. It’s almost always better to just have two graphs next to each other, if you need to compare the data - though the linked article contains some other interesting suggestions.

Anyway, thinking back to our other graphic:

This graph is nice, but I think it could be even nicer. Specifically, there’s a lot of overlap between the versicolor and virginica species - it would be nice to see them side by side, rather than on the same plot.

3.3.3 Facetting

Luckily, ggplot makes this easy for us via what’s known as facets. By adding facet_wrap() to our plot, we’re able to split the three species onto their own graphs, while keeping the axes standardized.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point(size = 3) + 
  geom_smooth(size = 1, se = FALSE) + 
  facet_wrap(~ Species)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

That makes seeing the differences much easier! Note that I got rid of the different species aesthetics - now that the species are each on their own plot, each species having a different color and shape doesn’t add any information to the visualization.

facet_wrap() is very useful, in that it will automatically wrap our plots into however many rows and columns are required. If we want to be a little more specific in how our data is arranged, however, we can use facet_grid(). By specifying either rows = or cols =, we can finely control how our data is split:

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point(size = 3) + 
  geom_smooth(size = 1, se = FALSE) + 
  facet_grid(rows = vars(Species))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Heck, if we have two groups we want to compare, we can use both rows = and cols = at the same time! Unfortunately, iris doesn’t have two grouping variables in it - so I’m going to make another one (color):

iris2  <- iris
iris2$color <- rep(c("purple","red","black"), 50)
head(iris2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  color
## 1          5.1         3.5          1.4         0.2  setosa purple
## 2          4.9         3.0          1.4         0.2  setosa    red
## 3          4.7         3.2          1.3         0.2  setosa  black
## 4          4.6         3.1          1.5         0.2  setosa purple
## 5          5.0         3.6          1.4         0.2  setosa    red
## 6          5.4         3.9          1.7         0.4  setosa  black

As you can see, I’ve told R to replicate (or repeat, as I’ve always thought of it) the vector of purple, red, black 50 times - so about a third of each species will be in each color. Using that as our column grouping gives us:

ggplot(data = iris2, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point(size = 3) + 
  geom_smooth(size = 1, se = FALSE) + 
  facet_grid(rows = vars(Species), cols = vars(color))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

3.4 Diamonds

3.4.1 Visualizing Large Datasets

For this next exercise, we’re going to be using the diamonds dataset, which contains data about 54,000 different diamond sales. It looks like this:

head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

We can plot the price of each diamond against its weight (or carat) pretty easily, using geom_point() like before:

ggplot(diamonds, aes(carat, price)) + 
  geom_point()

Note that I’ve stopped explicitly writing data =, x =, and y =. Without that specification, R assumes that you’re providing arguments to the function in the order the function normally expects them - which, for ggplot(), is in the form ggplot(data, aes(x, y)). Most code you’ll find in the wild is written in this more compact format.

Anyway, back to the graph. It’s a bit of a mess! It’s hard to discern a pattern when all 54,000 points are plotted in the same area. We can make things a bit better by making the points transparent, by giving them a low alpha = value:

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 0.05)

This is somewhat better! We can see that there’s a correlation between price and carat - but it’s hard to tell exactly what the trend looks like. Plus, there’s a good amount of empty space on the graph, which we could probably make better use of.

3.4.2 Axis Transformations

We can consider transforming our axes to solve all these problems. For instance, if we plotted both our axes on log10 scales, we’d get the following graph:

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 0.05) + 
  scale_x_log10() + 
  scale_y_log10()

So we can see that, by log-transforming our variables, we get a linear-looking relationship in our data.

Now, I’m personally not a fan of log graphs - and you shouldn’t be, either. But you’ll sometimes have data that can’t be properly explained without logarithims - or bosses who won’t listen to reason. As such, it’s worth knowing how to make R plot things exactly as you want it to. Usually, however, it makes a lot more sense to plot your data without any transformations, and just use transformed values as needed in your analyses. We’ll discuss that more in chapters 5 and 6, however.

You can perform plenty of other axes transformations by specifying the trans argument inside of your scale function. For instance, if we wanted to use a natural log instead, we could type:

ggplot(diamonds, aes(carat, price)) + 
  geom_point(alpha = 0.05) + 
  scale_y_continuous(trans = "log") + 
  scale_x_continuous(trans = "log")

To learn more about transformations, you can read the documentation by typing ?scale_x_continuous() into the console.

3.6 Designing Good Graphics

These earlier charts are good examples of graphs we might use for data exploration, early in an analysis, to see what patterns we might be able to identify and tease out through further work. They aren’t, however, polished graphs ready for use in presentations. Graphics for communication require a little more attention to the principles of design.

Graphics for communication, at their essence, exist to make arguments and communicate your point. In order to do that, a graphic has to be both visually clean and easily understood, while at the same time containing exactly enough information to get a point across - and nothing more. Learning how to make graphics like this is a skill unto itself, and should be a skill you pay attention to even after finishing this unit. After all, it doesn’t matter how smart you are and how perfect your analysis is if you aren’t able to tell anyone about it afterwards!

The hard part about teaching graphic design is that it’s as much an art as a science - there is no one right way to make compelling graphics. What I’m going to teach in this section is as much my opinion as it is the correct way to do things - other textbooks and other people have their own preferred methods, none of which are inherently better or worse.

For instance, ggplot comes with a number of preinstalled themes which you can add to any given plot. For a complete list, click here. We’ll just demo a few of the most common ones, using our old iris scatterplot:

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) +
  theme_bw()

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) +
  theme_minimal()

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) +
  theme_classic()

Plenty of other packages introduce other ggplots for you to use. My personal favorite is cowplot. Written by Claus O. Wilke, it provides some really interesting new extensions to ggplot, and sets the default theme to something that generally looks better than ggplot’s defaults. If we install it now:

install.packages("cowplot")

And then load it:

library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange"))

This default is pretty similar to theme_classic(), except with different font sizes. However, if we add background_grid() to our plot:

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  background_grid()

We get what I consider to be the nicest looking default option R will give you.

If we want to override the default axis names, we can control that with labs():

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  background_grid() + 
  labs(x = "Sepal Length", y = "Sepal Width")

With labs, we can also give our graphs a title and caption. This is generally a bad idea - if you’re going to include a graph in a report or publication, you’ll want to typeset these outside of the image file - but it makes understanding these graphs a little easier.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  background_grid() + 
  labs(x = "Sepal Length", 
       y = "Sepal Width",
       title = "Sepal Width as a Function of Sepal Length",
       subtitle = "Data from R. A. Fischer's iris dataset, 1936",
       caption = "Made in R with ggplot2")

If we want to change anything about the theme (for instance, the text size or legend position), we can specify that in theme():

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  background_grid() + 
  labs(x = "Sepal Length", y = "Sepal Width") + 
  theme(text = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.position = "top")

And we can keep specifying what we want until we’re satisfied with our graph.

ggplot will also let us focus on specific parts of the data:

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(values = c("purple",
                                "black",
                                "orange")) + 
  background_grid() + 
  labs(x = "Sepal Length", y = "Sepal Width") + 
  scale_x_continuous(limits = c(5, 7)) + 
  scale_y_continuous(limits = c(2.5, 3.0))
## Warning: Removed 93 rows containing missing values (geom_point).

Of course, if you’re graphing things such as percentages, you should be careful about where you set your axes. Say we had a dataset where every 1 increase in some variable x saw a 1% increase in y, so that y increased almost 10% over the course of all x values. If you let ggplot set your axis defaults, you’d wind up with a perfect correlation:

df <- data.frame(x = 1:10, 
                 y = 61:70)
ggplot(df, aes(x, y)) + 
  geom_line() + 
  background_grid()

However, it’s probably more truthful to graph percentages on a 0-100 scale - doing so shows us that x has a weaker impact on y than the default would have us believe:

ggplot(df, aes(x, y)) + 
  geom_line() + 
  background_grid() + 
  scale_y_continuous(limits = c(0,100))

If there’s any part of your graph you want to change, try googling “change XX ggplot”. The first link will almost certainly have what you’re looking for. The reference guide is also a good place to go hunting for answers.

3.7 Saving Your Graphics

When you’re satisfied with your graph, simply call the ggsave() function to save it to whatever file you’re working in. The first argument to this function should be your graph’s desired file name, with the extension - ggplot can save graphs as pngs, jpegs, pdfs, and several other formats.

You can either add it to your workflow with +, or call it after you’ve plotted your graph - ggsave() will save whatever image was drawn last. For more information on specific things ggsave can do, type ?ggsave() into R.

3.8 More Resources

In addition to the ggplot documentation, I highly reccomend the ggplot book. Additionally, almost any problem can be solved by googling - just put “ggplot” at the end of whatever your question is, and odds are you’ll find the perfect solution.

3.9 Exercises

3.9.1 Graph the following:

  1. A boxplot of the iris data, with species on the x axis and sepal length on the y
  2. A scatterplot of the iris data, plotting sepal length against width, where the points get bigger as sepal width increases
  3. Can you change the color of the boxes in the graph you made for problem 1? Can you change the color of the lines?

3.9.2 Use a new dataset:

Also included in ggplot is the mpg dataset, containing fuel economy data for 38 different models of car.

  1. Use head() to examine the data. You can also type ?mpg to get documentation on what each variable represents.
  2. Is engine displacement (displ) correlated with highway miles per gallon (hwy)? Make a scatterplot to find out.
  3. What variables could we use to group the data? Does coloring points by any of these help explain the scatterplot from problem 2?
  4. What does the scatterplot look like if you make a scatterplot for cty plotted against hwy? Why? What geom could we use to better represent the data?

3.9.3 Looking ahead:

  1. What happens if you type in summary(mpg)? What do you think this output represents?
  2. What happens if you type in mpg[1]? How does this compare to mpg[[1]]?