6 Data Visualization with ggplot

A great resource for data visualuation in R is the R Graph Gallery. The examples and information below is a small sample of R data visualization basics.

6.1 ggplot

ggplot2 (referred to as ggplot) is a powerful graphics package that can be used to make very impressive data visualizations (see contributions to #TidyTueday on Twitter, for example). The following examples will make use of the Learning R Survey data, which has been partially processed (Chapters 2 and 3) and the palmerpenguins data set, as well as several of datasets included with R to show the basic principles of using ggplot. Then, we will put these basics together to make several beautiful visualizations.

6.2 Grammar of Graphics

The “gg” of ggplot refers to the “grammar of graphics”. For ggplot, this means a visualization must have specific elements to make a complete graphic, just as an utterance or written line must have specific elements to make a grammatically correct sentence.

A simple plot contains the following elements:

6.3 Data

There are several ways to refer to a data object in ggplot. You can call data within ggplot (e.g. ggplot(rsurvey) ) or you can call it outside of ggplot within a dplyr chain (e.g rsurvey %>% ggplot()). The advantage of this is you can easily manipulate data directly into ggplot without saving it as a data object:

You can also call data for individual shapes. This would allow you to use different data objects to form your graphic, or to use the same data object but filter it to display different information. The following graphic demonstrates this.

Please note this graphic is for demonstration purposes only and does not represent a useful data visualization.

A note about +

Across the Tidyverse, the %>% “pipe operator” is used to chain commands into simple codes. It stands for “and then”. However, in ggplot, despite being part of the tidyverse, different elements are connected with the plus sign +.

6.4 Aesthetics

Aesthetics are the way you connect data to the elements inside the graphic. Aesthetics tell ggplot what should be on the x-axis, what should be on the y-axis, and what the colors should be.

Different geometries (shapes) may have different aesthetics, but x, y, and color/fill are the most common.

  • color= is used for:
    • geom_point() - dots, circles, scatterplots
    • geom_line() - line charts
  • fill= is used for:
    • geom_col() / geom_bar() - column/bar charts
    • geom_area() - area charts

Using color= or fill= to refer to a categorical variable (called a “discrete” variable in ggplot) allows you to separate the shape by that category. Here is an example with and without specifiying a color:

6.5 Geometries

Geometries are the different shapes one can make using ggplot. They all start with geom_ and can be stacked together by simply using +. The first geom is always first layer and any additional layers are stacked on top of it. (See [Lollipop Charts][Lollipop Charts] for an example.)

6.5.1 Bar Charts

You can make bar charts with either geom_bar() or geom_col().

6.5.1.1 geom_bar

geom_bar requires:

  • an x-value and is useful if you are just getting **a count of the data*

geom_bar may also have:

  • a y-value and a stat= value if you want to specify how the y-value data should be shown
    • stat="identity"- gives a sum of all the values of y
    • stat="summary" - gives a mean of the values of y

Compare:

6.5.1.3 Horizontal Bar Chart

To make a horizontal bar chart, add coord_flip():

## Warning: Removed 2 rows containing non-finite values
## (stat_summary).
## No summary function supplied, defaulting to `mean_se()`

note: coord_flip() can be placed anywhere after ggplot()

6.5.1.4 Stacked Bar Chart

To make a stacked bar chart, include fill=

## No summary function supplied, defaulting to `mean_se()`

6.5.3 Boxplots

A boxplot uses geom_boxplot(). It requires x-values. Y-values are optional but useful if you want to compare multiple boxplots.

Note: Use coord_flip() to make it easier to read.

6.5.4 Scatterplots

We can use geom_point() for scatterplots. This requires x and y values, both continuous.

6.5.4.1 Point Size

Point size can be based on a single number or the data itself.

You can control overall point size by adding size= after aes():

You can make the values in the data also determine point size by using size= inside aes():

6.5.4.3 Shapes

Shapes are controlled like color and size. Inside aes() means that shapes are mapped to data. Outside aes() means there is one shape.

**A note on alpha=

alpha is called after aes() to set the transparency of overlapping shapes. An alpha of 0 is complete transparency while an alpha of 1 is no transparency. An alpha of .5, set above, is 50% transparency.

6.5.4.4 Scatterplots for Categorical Variables

If you want to make a scatterplot for categorical variables, you will simply get a line of dots for each variable unless you use geom_jitter(), which adds random fluctuation in the variables.

Compare:

You can also use position_jitter(width = NULL, height = NULL, seed = NA) inside of geom_point() to achieve a similar effect.

6.5.5 Barbell Charts

Barbell charts compare plot two related variables with a dot and show the distance between them with a line.

You can combine geom_point() with geom_linerange() to make a simple lollipop chart. geom_linerange() should be called first, as it must go below the dots layer for its line ends to be hidden by the dot. First, we will summarize the penguin data and then compare.

The following code builds the graphic by combining different data layers and different geometry layers.

## `summarise()` regrouping output by 'species' (override with `.groups` argument)
## `summarise()` regrouping output by 'species' (override with `.groups` argument)

6.5.6 Line Charts

You can use geom_line() for line charts to display values over time. geom_line() requires an additional group= aesthetic. If there should be only 1 line because there is only 1 time variable, then use group=1. If you want to split the lines based on another variable, use group=variable_name.

For the below example, we will use the AirPassengers data that comes with R and transform it into a dataframe following an example from StackOverflow

A line graph displaying a single line for year

## `summarise()` ungrouping output (override with `.groups` argument)

A line graph displaying 1 line per month

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

We can add labels to the ends of the line using geom_label() (see Labels) but the lines are very close together, so we will use ggrepel() instead. This gives the labels space and connects them with their lines.

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

6.5.7 Colors

For colors related to values in a data set, see Aesthetics

You can change the color of all the chart elements of a geometry using fill= outside of aes(). Here, color is not mapped to the data, thus it is inside geom_col but not in aes(). You can use R color names (e.g. “blue”, “black”, “grey80”), hex values (e.g. “#cccccc” or “#a85001”, or RGB values (e.g. rgb(0, 155, 255)).

## `summarise()` ungrouping output (override with `.groups` argument)

See here for a list of R color names.

You can also use other color palettes by installing viridis or hrbrmst themes

6.5.8 Labels

You can add labels with geom_label or geom_text. geom_text is just text and geom_label is text inside a rounded white box (this, of course, can be changed). Compare:

Note: Because there is no y value, these graphics use y=..count.. to get the total number and stat="count" to say you will use the sum in the aesthetic.

6.5.9 Multiple Plots

6.5.9.1 Faceting

You can break a graphic into multiple plots (or facets) using facet_wrap(~variable). Here is an example:

Note: The months are not in order. To put them in order, you would first need to use factor() inside a mutate() command.

You can control the number of rows/columns use nrow= or ncol=:

6.5.10 Themes

Themes control the overall look and feel of ggplot. If there is a specific theme, it is called using theme_name(). If you are modifying theme elements, you will use theme()