3 Data Visualization

After laboring yourselves to collect, clean, and analyze the data, you want to share the findings and engage the audience with insights about the significance of the study. But most people are not good at extracting insights from rows and columns of numbers. It is the job of the researchers to get important messages across to the audience so they can understand them with as little effort as possible.

Learning Objectives:

This chapter discusses the properties of various graphs, what they are good at and not so good at, and how to plot them in R using two different graphical systems. After finishing this chapter, you should be able to:

  • Choose the most appropriate graph based on the number of variables and their types.

  • Convey the message you want to convey to the audience embedded in the data through graphs.

  • Decorate graphs with labels, a main title, color, and other graphical attributes using the base R graphical system.

  • Create publication-quality graphs using the ggplot2 package.

3.1 Types of Graphs

The best way to choose the most appropriate graph depends on the data. Specially, the types of variables (numeric or categorical) and how many of them. You will plot graph for:

  1. One numeric variable.

  2. One categorical variable.

  3. One categorical and one numeric variables.

  4. Two or more numeric variables.

  5. A summary table.

3.2 Base R Graphics

This graphical system is built into the base R; no additional package is required. It is simple, ideal for creating graphs promptly for prototyping purposes and get the job done. The flip side is that the graphs may appear a bit primitive. Some features are hard to be implemented. Anyway, if plotting graphs using R is new to you, base R is a great starting point. You will understand the basic characteristics of different graphs without requiring too much technical details.

3.2.1 One Numeric Variable

You can plot a histogram of the numeric variable to see its distribution, such as the spread, the peak, symmetry, and flatness. If your concern is robust statistics, a boxplot will do the job.

3.2.1.1 Histogram

Let’s examine the distribution of iris’s Sepal.Width.


par(mfrow=c(2,2))

# Top left
hist(iris$Sepal.Width)

# Top right
hist(iris$Sepal.Width,
     xlab="Sepal Width",
     ylab = "Count",
     main = "Histogram of Sepal Width")

# Bottom left
hist(iris$Sepal.Width,
     col = 'skyblue', 
     border = 'white', 
     xlab="Sepal Width", 
     ylab = "Count", 
     main = "Color options")

# Bottom right
hist(iris$Sepal.Width, 
     breaks = seq(2,4.5,by=0.1), 
     col = 'skyblue', 
     border = 'white', 
     xlab="Sepal Width", 
     ylab = "Count", 
     main = "breaks option")
Base R Histogram. Top left: default. Top right: Customized X-Y labels and the main title. Bottom left: change color of bars and border. Bottom right: change the bin width  (`breaks=`)

Figure 3.1: Base R Histogram. Top left: default. Top right: Customized X-Y labels and the main title. Bottom left: change color of bars and border. Bottom right: change the bin width (breaks=)

For the bottom-right graph, the seq() function (1.3.1) is used to generate a list of numbers for the breaks parameter that represents x-ticks.

Note: par(mfrow=c(2,2)) splits the plotting area into a 2x2 grid such that multiple graphs can be grouped

3.2.1.2 Boxplot

If the message is robust statistics, such as median and quantiles, or the presence of outliers, the best graph is a boxplot. The y-axis represents the numeric variable, and the x-axis has no meaning.

par(mfrow=c(2,2))

# Top left
boxplot(iris$Sepal.Width)

# Top right
boxplot(iris$Sepal.Width,
        ylab="Sepal Width",
        main="Boxplot of Sepal Width")

# Bottom left
boxplot(iris$Sepal.Width,
            col = "skyblue",
            ylab="Sepal Width",
            outcol = 'red',
            outpch = 19,
            main="Color options")

# Bottom right
boxplot(iris$Sepal.Width, 
        horizontal = TRUE, 
        xlab="Sepal Width", 
        main="Horizontal Boxplot")
Boxplot. Top left: Default, Top right: labels and main title. Bottom left: color options. Bottom right: `horizontal = TRUE`

Figure 3.2: Boxplot. Top left: Default, Top right: labels and main title. Bottom left: color options. Bottom right: horizontal = TRUE

Anatomy of a Boxplot

Figure 3.3: Anatomy of a Boxplot

The upper and lower whiskers are defined as:

\[Upper\ Whisker = Q3 + 1.5 \times IQR\] \[Lower\ Whisker = Q1 - 1.5 \times IQR\] , where \(IQR=(Q3-Q1)\).

In practice, values that are greater or less than the upper or lower whiskers are treated as outliers.

Histogram offers a glimpse of how the data is distributed, displaying the peak(s), and long tail, if any, that is less obvious in a boxplot. However, pinpointing the median and outliers on a histogram is non-trivial. Therefore, it makes sense to combine the strengths of both plots in a single graph. To be sure the two plots are comparable, make sure to set the axes in the same range by the xlim and/or ylim= parameters. Here’s an example below:

par(mfrow=c(2,1))

hist(iris$Sepal.Width, 
     breaks = seq(2,4.5,by=0.1), 
     col = 'skyblue', 
     border = 'white', 
     xlim = c(2,4.5),
     xlab = "",
     ylab = "Count", 
     main = "Stacking Up Graphs")

boxplot(iris$Sepal.Width, 
        horizontal = TRUE, 
        ylim = c(2,4.5),
        xlab="Sepal Width", 
        main="")
Stacking up two graphs.

Figure 3.4: Stacking up two graphs.

Figure (3.4)) illustrated the added advantage of combining graphs. It is clear that the peak in the histogram is almost perfectly aligned with the median shown in the boxplot, indicating symmetric data distribution. Additionally, the outliers hardly visible from the histogram can easily be spotted in the boxplot. This example illustrates the synergy and additional perspectives created by combining different graphs.

3.2.2 One Categorical Variable

When it comes to categorical variables, we are usually interested in the number of samples fall in each category of a categorical variable. As discussed in Chapter 2, the tallying can be done using the table() function (2.1.8) or group_by() followed by summarize(n=n()) (2.2.10).

3.2.2.1 Barplot

Barplot is the standard choice for visualizing tallied data. The only input to the R’s barplot() graph function is a tabulated count of the categorical variable. Here’s an example:

par(mfrow=c(2,2))

# Top left
barplot(with(iris, table(Species)))

# Top right
barplot(with(iris, table(Species)), 
        xlab="Species", 
        ylab="Count", 
        main="A Barplot of Species")

# Bottom left
barplot(with(iris, table(Species)), 
        col='skyblue', 
        border='blue', 
        xlab="Species", 
        ylab="Count", 
        main="Color Options")

# Bottom right
barplot(with(iris, table(Species)), 
        horiz = TRUE, 
        col='skyblue', 
        border='blue', 
        xlab="Count", 
        ylab="Species", 
        main="A Horizontal Barplot")
A simple bar plot. Top left: Default. Top right: with tables

Figure 3.5: A simple bar plot. Top left: Default. Top right: with tables

Sometime you might want to reorder the categories along the x- or y-axis. By default, barplot() places the bars from left to right according to the current order of the levels. E.g., the levels of iris$Species is:

levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"

You can change the current order by updating the factor variable by the factor() function (1.2.5). Suppose, the desired order is virginica, setosa, and veriscolor.

iris$Species <- factor(iris$Species, 
                       levels = c("virginica", "setosa", "versicolor"), 
                       ordered = TRUE)
levels(iris$Species)
#> [1] "virginica"  "setosa"     "versicolor"
barplot(with(iris, table(Species)), 
        col='skyblue', 
        border='blue', 
        xlab="Species", 
        ylab="Count", 
        main="Reordered Species")
Reorder the Bars of a Barplot

Figure 3.6: Reorder the Bars of a Barplot

3.2.2.2 Pie Chart

Besides a barplot, a pie chart is also a good fit for displaying tallied data.

par(mfrow=c(2,2))

# Top left
pie(with(iris, table(Species)))

# Top left
pie(with(iris, table(Species)), 
    col = c('skyblue', 'orange', 'white'), 
    main="Species")

# Bottom right
counts <- with(iris, table(Species))
my_labs <- paste0(names(counts),"(",counts,")")
pie(counts, 
    labels = my_labs, 
    col = c('skyblue', 'orange', 'white'), 
    main = "Species Counts")


# Bottom right
counts <- with(iris, table(Species))
fractions <- round(counts/sum(counts),2)
my_labs <- paste0(names(counts),"(",fractions,")")
pie(counts, 
    labels = my_labs, 
    col = c('skyblue', 'orange', 'white'), 
    main = "Species Fractions")
A simple pie chart. Top left: Default. Top right: with tables

Figure 3.7: A simple pie chart. Top left: Default. Top right: with tables

3.2.3 One Categorical and One Numeric Variables

The categorical is usually served as a group identifier that splits the data into groups so the numerical variable can be compared between groups. This kind of graph is generally named side-by-side plot. You can choose a side-by-side boxplot or a side-by-side barplot. For example, visually compare the robust statistics between different species of iris.

3.2.3.1 Side-by-side Boxplot

par(mfrow=c(2,1))

boxplot(Sepal.Length ~ Species,
        data=iris, 
        main="A Side-by-Side Boxplot")

boxplot(Sepal.Length ~ Species,
        data=iris, 
        horizontal = TRUE,
        main="A Horizontal Side-by-Side Boxplot")
Side-by-side plot

Figure 3.8: Side-by-side plot

Sepal.Length ~ Species is read as “Sepal length by species”, the tilde symbol “~” means “by”.

However, there is glitch in the horizontal boxplot. The tick y-labels collided. It can be fixed by adding las=2 parameter to the boxplot() function.

boxplot(Sepal.Length ~ Species,
        data=iris, 
        horizontal = TRUE,
        las = 2,
        main="Y-tick Labels Reorientated")
Side-by-side plot

Figure 3.9: Side-by-side plot

3.2.3.2 Side-by-side Barplot

Suppose you want to visually compare the average sepal length among different iris species. You will begin to feel the cumbersome of base R, and appreciate the power of tidyverse. Anyway, you will see how to make such a graph with a new function, tapply(), in two steps.

First, make a table of average sepal length by species. Apparently, the table() function may help. But it does not work for this case, as it only tallies the number of samples by species. Hence, you need to use a new function tapply(). Here’s an example:

mean_data <- tapply(iris$Sepal.Length, iris$Species, FUN = mean)
mean_data
#>  virginica     setosa versicolor 
#>      6.588      5.006      5.936

How does tapply() work? The “t” represents table. The function applies the mean function to the input data iris$Sepal.Length grouped by iris$Species. In other words, the mean() is applied to the sepal lengths for each species separately.

And then, pass the mean_data object to barplot().

barplot(mean_data,
        xlab="Species", 
        ylab="Average Sepal.Length", 
        main="Side-by-side Barplot")
Side-by-side Barplot

Figure 3.10: Side-by-side Barplot

3.2.3.3 Side-by-side Barplot with Error Bar^*

*This topic is slightly advanced. It is much easier to use ggplot2. So, feel free to skip this and jump to 3.3.4.3.

The error bar represents the standard error of the mean (SEM). Here is the formula:

\[\huge{\bar{x}\pm\frac{s}{\sqrt{n}}}\] , where \(s\) and \(n\) are the standard deviation and sample size, respectively. In the bar plot you are going to plot, the height of the bar is \(\bar{x}\), and the error bars span above and below the bar’s top by \(\frac{s}{\sqrt{n}}\) units.

First, calculate the mean and standard of errors using the tapply() function.

mean_data <- tapply(iris$Sepal.Length, iris$Species, FUN = mean)
mean_data
#>  virginica     setosa versicolor 
#>      6.588      5.006      5.936
sem <- function(x) {sd(x)/sqrt(length(x))}
sem_data <- tapply(iris$Sepal.Length, iris$Species, FUN = sem)
sem_data
#>  virginica     setosa versicolor 
#> 0.08992695 0.04984957 0.07299762

First, create the bar plot and store the height of bars in the object bar_midpoints. Use the arrows() function create the error bars, where each error bar consists of the upper whisker \((x_{1}, y_{1})\) and lower whisker \((x_{0}, y_{0})\).


bar_midpoints <- barplot(mean_data,
                  xlab="Species", 
                  ylab="Average Sepal.Length", 
                  main="Side-by-side Barplot")

arrows(x0 = bar_midpoints,
       y0 = mean_data - sem_data, # Start of the bar
       x1 = bar_midpoints,
       y1 = mean_data + sem_data, # End of the bar
       angle = 90,                # Makes the ends of the bars flat
       code = 3,                  # Draws ticks at both ends
       length = 0.5, col='red')   # Sets the length of the ticks
Side-by-side Barplot with Error Bars

Figure 3.11: Side-by-side Barplot with Error Bars

Once again, if you found this confusing, see the ggplot version of error bars (Section 3.3.4.3).

3.2.3.4 Line Graph with Error Bar

A line graph is an alternative to a bar plot that is good at showing a trend. Here you see how to create a line graph overlaid with solid dots.

# Default line plot
plot(mean_data,
      type='b',
      pch=19,
      xlab="Species",
      ylab="Average Sepal.Length",
      main="Line Plot")
Line plot

Figure 3.12: Line plot

  • Use pch=19 to specify the shape of the data points. The value 19 denotes a solid dot (\(\bullet\)).

  • To join the dots together in a line plot, specify type='b', the b means to plot both dots and line.

But, there is a glitch in the x-tick.The x-tick should show the actual species names instead of number. It can be fixed in two steps: 1. Suppress the default x-tick labeling feature of the plot() function by setting the parameter xaxt="n". And then, paste the customized x-tick back by the axis() function. The species names can be obtained by names(mean_data).

# Fix x-tick labels
plot(mean_data,
      type='b',
      pch=19,
      xlab="Species",
      ylab="Average Sepal.Length",
      main="Change x-tick Labels",
      xaxt="n")
axis(side=1, at=1:length(mean_data), labels=names(mean_data))
Line plot with proper x-tick labels

Figure 3.13: Line plot with proper x-tick labels

Here’s how to add error bars to the line plot. It is similar to the barplot above, except that x0 and x1 should take the values 1,2,3.

# Add Error Bar
plot(mean_data,
      type='b',
      pch=19,
      xlab="Species", 
      ylab="Average Sepal.Length", 
      main="Change x-tick Labels",
      xaxt="n")
axis(side=1, at=1:length(mean_data), labels=names(mean_data))

arrows(x0 = seq(length(mean_data)),
       y0 = mean_data - sem_data, # Start of the bar
       x1 = seq(length(mean_data)),
       y1 = mean_data + sem_data, # End of the bar
       angle = 90,                # Makes the ends of the bars flat
       code = 3,                  # Draws ticks at both ends
       length = 0.05, col='red')   # Sets the length of the ticks
Line plot with error bars

Figure 3.14: Line plot with error bars

3.2.3.5 Overlapping Histogram^*

/* The code for creating this plot is a bit convoluted. It is easier to make it using ggplot2. Feel free to skip to 3.3.4.5.

Suppose you are interested in comparing the distribution of sepal lengths among the three iris species. Here are the steps:

  1. Split the data by individual species.

  2. Calculate the range for the x- and y-axis so as to ensure the histograms are in the same scale.

  3. Create a histogram for a data subset from step 1, which becomes the template of the final graph. Additionally, color is used to distinguish one species from the other.

  4. Overlay the histograms for other data subsets on top of the template.

Let’s do it step by step below:

# Step 1
setosa <- subset(iris, Species == "setosa")
versicolor <- subset(iris, Species == "versicolor")
virginica <- subset(iris, Species == "virginica")
# Step 2
x_range <- range(iris$Sepal.Length)
y_range <- c(0, 25) # 25 is arbitrary.

Below is the code for plotting the template histogram, and how additional histograms are added to it. To distinguish species, different color is attributed to different species. It is done by col = rgb(...). The rgb() function stands for red, green, and blue. It takes four parameters: the intensity of red, green, blue, and transparency. All are in the range of 0 and 1. E.g., rgb(1, 0, 0, 0.5) means 100% red, no green and blue at all, and transparency is 50%.

# Step 3
hist(setosa$Sepal.Length,
     main = "Overlapping Histogram of Sepal Length",
     xlab = "Sepal Length",
     col = rgb(1, 0, 0, 0.5), # Red with 50% transparency
     breaks = seq(4,9,by=0.25),
     xlim = x_range,
     ylim = y_range)

# Step 4
hist(versicolor$Sepal.Length, col=rgb(0, 1, 0, 0.5), add = TRUE) # green with 50% transparency
hist(virginica$Sepal.Length, col=rgb(0, 0, 1, 0.5), add = TRUE)  # blue with 50% transparency
Overlapping histogram

Figure 3.15: Overlapping histogram

3.2.4 Contingency Table

If a dataset comprises two categorical variables. A two-dimensional contingency table, or a 2D table, depicts the number of samples found for different categories of the two categorical variables. Their distribution provides the dependency of the two categorical variables. Let’s consider the example from the built-in dataset penguins, which has two contains two categorical variables: species and island.

Create a 2D table as below:

mytab <- with(penguins, table(species, island))
mytab
#>            island
#> species     Biscoe Dream Torgersen
#>   Adelie        44    56        52
#>   Chinstrap      0    68         0
#>   Gentoo       124     0         0

Obviously, species and island are not independent, as you can see not all three species can be found in each island. If species and island are independent, the counts in the contingency table should be below:

indep_tab
#>            island
#> species     Biscoe Dream Torgersen
#>   Adelie        74    55        23
#>   Chinstrap     33    25        10
#>   Gentoo        61    45        19

Two popular graphs are used to visualize counts from a contingency table: a side-by-side barplot, and mosaic plot.

barplot(mytab,
        beside=T,
        xlab = "Species",
        ylab = "Count",
        main = "A 2D Table",
        col = c("skyblue", "lightgreen", "lightcoral"),
        legend.text = rownames(mytab))
Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.

Figure 3.16: Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.


mosaicplot(mytab, main = "Mosaic Plot of Two Dependent Variables")
Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.

Figure 3.17: Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.


mosaicplot(indep_tab, main = "Mosaic Plot of Two Independent Variables")
Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.

Figure 3.18: Visualization of a 2D Contingency Table. A. A side-by-side barplot. B. Mosaic plot.

3.2.5 Two Numeric Variables

If you want to gain a gut feeling about the relationship between two numeric variables, a scatter plot is the best choice. Note that the relationship can be manifold, ranging from linear to cyclical. So, visualizing how the data points are distributed in a 2D plan is a quick and easy way in gaining intuition of the two variables. Importantly, it should be done before performing any statistical analysis, e.g., linear regression. Here is how to create a scatter plot that shows the relationship between sepal width and sepal length. Note that each point \((x_{i},y_{i})\) is a pair of values from one sample.


par(mfrow=c(1,2))

# Top left
plot(x=iris$Sepal.Width,
     y=iris$Sepal.Length,
     xlab = "Sepal Width",
     ylab = "Sepal Length",
     main = "Default")

# Top right
plot(x=iris$Sepal.Width,
     y=iris$Sepal.Length,
     pch=3,
     xlab = "Sepal Width",
     ylab = "Sepal Length",
     main = "Change the Point Style")
Visualizing the relationship between two numeric variables.

Figure 3.19: Visualizing the relationship between two numeric variables.


plot(x=iris$Sepal.Width,
     y=iris$Sepal.Length,
     pch=3,
     xlab = "Sepal Width",
     ylab = "Sepal Length",
     main = "Segregate Data Points by Color (Species)",
     col = c("skyblue", "lightgreen", "lightcoral")[as.numeric(iris$Species)])

legend("topright",
       legend = levels(iris$Species),
       col = c("skyblue", "lightgreen", "lightcoral"),
       pch = 3,
       title = "Species")
Color data points by species.

Figure 3.20: Color data points by species.

If your aim is to explore linear correlation between the two variables, let’s calculate the regression line between th two variables and add it to the scatter plot.

lm_model <- lm(Sepal.Length ~ Sepal.Width, data=iris)

# intercept
intercept <- lm_model$coefficients[1]

# slope
slope = lm_model$coefficients[2]

print(paste("Slope =", slope, "Intercept =", intercept))
#> [1] "Slope = -0.2233610611299 Intercept = 6.52622255089448"
plot(x=iris$Sepal.Width,
     y=iris$Sepal.Length,
     pch=3,
     xlab = "Sepal Width",
     ylab = "Sepal Length",
     main = "Add a Regression Line",
     col = c("skyblue", "lightgreen", "lightcoral")[as.numeric(iris$Species)])

legend("topright",
       legend = levels(iris$Species),
       col = c("skyblue", "lightgreen", "lightcoral"),
       pch = 3,
       title = "Species")

# Add a line
abline(a=intercept, b=slope, col='red')
Add a regression line

Figure 3.21: Add a regression line

3.2.6 Multi-panel Scatter Plot

A multi-panel scatter plot offers a glimpse of the relationship between all possible pairs of numeric variables. Suppose, you are interested in the pairwise relationship between all numeric variables in iris. It can easily be done using the pairs() plot function.

pairs(iris[,1:4], main="A Multi-Panel Scatter Plot")
A Multi-panel Scatter Plot

Figure 3.22: A Multi-panel Scatter Plot

Here, it concludes our journey of data visualization using base R functions. As you can see, some plots can be made easily using base R (e.g., multi-panel scatter plot), while others are complicated (e.g., barplot with error bars). To compare the pros and cons of the two graphical systems, let’s continue our journey to ggplot.

3.3 ggplot2

ggplot2 is bundled in tidyverse. So, you have to activate tidyverse before using ggplot2. Here, the focus is on using ggplot2to plot the graphs that were created using base R above. We will not repeat the discussion of strengths and weaknesses of each graph in this section. Readers should refer to the corresponding section above for the characteristics and functions of the graphs.

To use ggplot2, load it into R.

library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.1.0     
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

3.3.1 Grammer of ggplot2

The two ’g’s of ggplot2 stand for Grammar of Graphics. The graphical widgets are arranged in layers that provides the flexibility for decorating a graph by adding layers to it. The data layer forms the foundation of a graph. The Aesthetics defines the association between data variable(s) and the axes. Geometries are the types of graphs, e.g., line plot, histogram, etc. Facets split a graph into multi-panels. Statistics defines how to aggregate data, e.g., in proportion or take the face value in a barplot. Coordinates supports an X-Y Cartesian coordinate system, and a polar coordinate system. (See the Pie Chart section below) Themes offers different color schemes, font type, font size, etc.

Anatomy of `ggplot2`

Figure 3.23: Anatomy of ggplot2

3.3.2 One Numeric Variable

You can plot a histogram of the numeric variable to see its distribution, such as the spread, the peak, symmetry, and flatness. If your concern is robust statistics, a boxplot will do the job.

3.3.2.1 Histogram

Let’s redraw the histogram in Figure (3.1)) using ggplot2. The first line is the data layer that specifies the data source, and it must be a data frame (or tibble), e.g., iris. The aesthetic option designates the x-axis for Sepal.Width. The second line is the geometry of the graph. For a histogram, the function name is geom_histogram(). Here is the bare minimum of a ggplot2 histogram:

# Default
ggplot(data=iris, aes(x=Sepal.Width)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
`ggplot2` Histogram.

Figure 3.24: ggplot2 Histogram.

It is always a good idea to give the graph a title at the top so people what the key message is. Equally important is the x- and y-labels. This task can be done by the labs() function. By default, the title is left-justified. To center it, see the last line and the comment above it.

# Overwrite default x-/y-label and title
ggplot(data=iris, aes(x=Sepal.Width)) +
  geom_histogram() +
  labs(x="Sepal Width", y="Count", title="Histogram of Sepal Width") +
  # hjust = 0.5 means the horizontal adjustment is the midpoint.
  theme(plot.title = element_text(hjust = 0.5))
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
`ggplot2` Histogram - specify your own labels and title.

Figure 3.25: ggplot2 Histogram - specify your own labels and title.

Change the fill color and border color. It can be done by providing additional parameters to the geom_histogram().

# fill = filled color
# color = border color
ggplot(data=iris, aes(x=Sepal.Width)) +
  geom_histogram(fill="skyblue", color = "white") +
  labs(x="Sepal Width", y="Count", title="Color Options") +
  # hjust = 0.5 means the horizontal adjustment is the midpoint.
  theme(plot.title = element_text(hjust = 0.5))
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
`ggplot2` Histogram - choose your own color

Figure 3.26: ggplot2 Histogram - choose your own color

The bin width of the histogram is calculated automatically by ggplot2 according to the range of x. To overwrite it by the breaks parameter.

# breaks
ggplot(data=iris, aes(x=Sepal.Width)) +
  geom_histogram(fill="skyblue", color = "white", breaks=seq(2,4.5,by=0.1)) +
  labs(x="Sepal Width", y="Count", title="`breaks` option") +
  # hjust = 0.5 means the horizontal adjustment is the midpoint.
  theme(plot.title = element_text(hjust = 0.5))
`ggplot2` Histogram - choose your own color

Figure 3.27: ggplot2 Histogram - choose your own color

Optional topic

As you might notice that each of the four histograms occupies the entire plotting area instead of displaying in a 2x2 grid as Figure (3.1)). It is because ggolot2 does not work with par(mfrow(2,2)). To divide the plotting area into a grid that is compatible with ggplot2, use the patchwork package. Use the menu option Tools \(\rightarrow\) Install Packages to install it the very first time.

Another new concept is that a graph can be stored in a variable so it can be displayed later in the grid. In the example below, the four histograms above are initially stored in variables p1, p2, p3, and p4. See the example below:


library(patchwork)

# Default
p1 <- ggplot(data=iris, aes(x=Sepal.Width)) +
        geom_histogram()

# With labels and title
p2 <- ggplot(data=iris, aes(x=Sepal.Width)) +
        geom_histogram() +
        labs(x="Sepal Width", y="Count", title="Histogram of Sepal Width") +
        # hjust = 0.5 means the horizontal adjustment is the midpoint.
        theme(plot.title = element_text(hjust = 0.5))

# fill = filled color
# colour = border color
p3 <- ggplot(data=iris, aes(x=Sepal.Width)) +
        geom_histogram(fill="skyblue", color = "white") +
        labs(x="Sepal Width", y="Count", title="Color Options") +
        # hjust = 0.5 means the horizontal adjustment is the midpoint.
        theme(plot.title = element_text(hjust = 0.5))

# Change the number of bins
p4 <- ggplot(data=iris, aes(x=Sepal.Width)) +
        geom_histogram(fill="skyblue", color = "white", breaks=seq(2,4.5,by=0.1)) +
        labs(x="Sepal Width", y="Count", title="`breaks` option") +
        # hjust = 0.5 means the horizontal adjustment is the midpoint.
        theme(plot.title = element_text(hjust = 0.5))

# Arrange them in a 2x2 grid
# '+' places plots next to each other, '/' stacks them
(p1 + p2) / (p3 + p4)
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
`ggplot2` Grid System by `patchwork`. Top left: Default (bare minimum). Top right: with labels and title. Bottom left: change the color. Bottom right: change the number of bins.

Figure 3.28: ggplot2 Grid System by patchwork. Top left: Default (bare minimum). Top right: with labels and title. Bottom left: change the color. Bottom right: change the number of bins.

3.3.2.2 Boxplot

For an anatomy of a boxplot, reference the Figure (3.3)). The geometry function of a boxplot is geom_boxplot(). Unlike a histogram, the numeric variable is mapped to the y-axis for a vertical boxplot. If you want a horizontal boxplot, map the numeric variable to the x-axis, and don’t forget to specify the label for the x-axis.


library(patchwork)

# Default
p1 <- ggplot(data=iris, aes(y=Sepal.Width)) +
        geom_boxplot()

# Add y-label and title
p2 <- ggplot(data=iris, aes(y=Sepal.Width)) +
        geom_boxplot() +
        labs(y="Sepal Width", title = "Boxplot of Sepal Width") +
        theme(plot.title = element_text(hjust = 0.5))

# Change color
p3 <- ggplot(data=iris, aes(y=Sepal.Width)) +
        geom_boxplot(fill="skyblue", outlier.color = "red") +
        labs(y="Sepal Width", title = "Color options") +
        theme(plot.title = element_text(hjust = 0.5))

# Horizontal boxplot
p4 <- ggplot(data=iris, aes(x=Sepal.Width)) +
        geom_boxplot() +
        labs(x="Sepal Width", title = "Horizontal Boxplot") +
        theme(plot.title = element_text(hjust = 0.5))

(p1 + p2)/(p3 + p4)
Boxplot using `ggplot2`. Top left: default (bare minimum). Top right: with y-label and title. Bottom left: change color. Bottom right: horizontal boxplot.

Figure 3.29: Boxplot using ggplot2. Top left: default (bare minimum). Top right: with y-label and title. Bottom left: change color. Bottom right: horizontal boxplot.

3.3.3 One Categorical Variable

3.3.3.1 Barplot

Here we will show how to make a barplot and a piechart using ggplot2.


library(patchwork)

# Default
p1 <- ggplot(data=iris, aes(x=Species)) +
        geom_bar()

# With labels and title
p2 <- ggplot(data=iris, aes(x=Species)) +
        geom_bar() +
        labs(x="Species", y="Count", title = "A Barplot of Species") +
        theme(plot.title = element_text(hjust = 0.5))

# Color option
p3 <- ggplot(data=iris, aes(x=Species)) +
        geom_bar(fill="skyblue", color="blue") +
        labs(x="Species", y="Count", title = "A Barplot of Species") +
        theme(plot.title = element_text(hjust = 0.5))

# Horizontal barplot
p4 <- ggplot(data=iris, aes(y=Species)) +
        geom_bar(fill="skyblue", color="blue") +
        labs(y="Species", x="Count", title = "A Horizontal Barplot of Species") +
        theme(plot.title = element_text(hjust = 0.5))

(p1 + p2)/(p3 + p4)
Barplot using `ggplot2`. Top left: default (bare minimum).

Figure 3.30: Barplot using ggplot2. Top left: default (bare minimum).

3.3.3.2 Pie Chart

Although ggplots is portrayed as an improved method to compose graphs based upon its grammatical architecture, it does not always make plotting as simple as you might think, like the histograms and barplots illustrated above. A pie chart is one of those graphs in which geom_pie() does not exist! Plotting it is not as straightforward as the base R pie() function, as shown in Figure (3.7)). Despite that, it is worth to see how a pie chart can be plotted using foundational building-block functions. To give you a heads-up, ggplot2 views a pie chart as a twisted barplot. There is an internal logic to justify such a view but we will not discuss it in detail.

Suppose we want to create a pie chart like the one below:

Pie chart or a donut chart

Figure 3.31: Pie chart or a donut chart

Here we will walk through each of the intermediate steps until getting the final pie chart.

Let’s begin with a stacked barplot, where the bars are stacked up into a single column, and the fill-color is an attribute to distinguish different iris species (fill=Species). The stacked bar is placed at the position 2 units on the right from the origin (aes(x=2)), but the exact position does not matter.

ggplot(data=iris, aes(x=2, fill=Species)) +
  geom_bar()
A stacked barplot

Figure 3.32: A stacked barplot

Now, use your imagination to picture that the stacked bar is bent into a ring. Does it look like the pie chart in Figure (3.32))? To way to “bend” it in ggplot2 is to change the x-y Cartesian coordinate system to the polar (circular) coordinate system in which a point is characterized by two parameters: the distance from the origin (\(r\)), and the angle (\(\theta\)) from the horizontal polar axis in an counter-clockwise direction.

Polar Coordinate. (generated by Gemini)

Figure 3.33: Polar Coordinate. (generated by Gemini)

By adding this line coord_polar(theta="y", start = 0) in the code below, it transforms the height of the bar for each species into an angle (\(\theta\)) in the polar coordinate, where the zero position is the 12 o’clock (start = 0). In this example, one third of the samples is setosa. As such, the angle \(\theta\) is \(360^{\circ}\times\frac{1}{3}=120^{\circ}\). If the proportion is a quarter, the angle \(\theta=360^{\circ}\times\frac{1}{4}=90^{\circ}\).

ggplot(data=iris, aes(x="", fill=Species)) +
  geom_bar() + 
  coord_polar(theta="y", start = 0)
A Piechart by changing to the polar coordinate system.

Figure 3.34: A Piechart by changing to the polar coordinate system.

The final step is to open a hole in the pie chart if you like, for aesthetic reasons. But this modification is optional.

ggplot(iris, aes(x = 2, fill = Species)) +
  geom_bar(width = 1, stat = "count") +
  coord_polar(theta = "y", start = 0) +
  # This is the key part that creates the hole
  xlim(0.5, 2.5) +
  labs(title = "Donut Chart of Iris Species")
Piechart using `ggplot2`

Figure 3.35: Piechart using ggplot2

If you prefer to have a clean background, add the line theme_void(). Finally, here’s the desired pie chart. What a journey!

ggplot(iris, aes(x = 2, fill = Species)) +
  geom_bar(width = 1, stat = "count") +
  coord_polar(theta = "y", start = 0) +
  # This is the key part that creates the hole
  xlim(0.5, 2.5) +
  theme_void() +
  labs(title = "Donut Chart of Iris Species")
Piechart using `ggplot2` without no background.

Figure 3.36: Piechart using ggplot2 without no background.

3.3.4 One Categorical and One Numeric Variables

3.3.4.1 Side-by-side Boxplot

Here you want to compare the robust statistics among different species.

library(patchwork)

p1 <- ggplot(data=iris, aes(x=Species, y=Sepal.Length)) +
        geom_boxplot() +
        labs(title = "A Side-by-Side Boxplot") +
        theme(plot.title = element_text(hjust = 0.5))

p2 <- ggplot(data=iris, aes(x=Sepal.Length, y=Species)) +
        geom_boxplot() +
        labs(title = "A Horizontal Side-by-Side Boxplot") +
        theme(plot.title = element_text(hjust = 0.5))

p1/p2
Side-by-side boxplot

Figure 3.37: Side-by-side boxplot

3.3.4.2 Side-by-side Barplot

Here, you want to compare the average Sepal.Length among iris species. As discussed above, you calculate the averages and store them in a separate table.

iris_summary1 <- iris %>%
                  group_by(Species) %>%
                  summarize(mean_sepal_length = mean(Sepal.Length))
iris_summary1
#> # A tibble: 3 × 2
#>   Species    mean_sepal_length
#>   <ord>                  <dbl>
#> 1 virginica               6.59
#> 2 setosa                  5.01
#> 3 versicolor              5.94

And then, you can plot the summary table. The geometry function is geom_bar(stat = "identity"). The stat = "identity" parameter tells the geom_bar() function to set up the height of the bars according to the face value of mean_sepal_length.

ggplot(data=iris_summary1, aes(x=Species, y=mean_sepal_length)) +
  geom_bar(stat = "identity") +
  labs(x="Species", y="Average Sepal Length", title = "Side-by-side Barplot") +
  theme(plot.title = element_text(hjust = 0.5))
Side-by-side barplot

Figure 3.38: Side-by-side barplot

3.3.4.3 Side-by-side Barplot with Error Bar

The calculation of the error bar (i.e., the standard error se) requires standard deviation and sample size. Once se is calculated, it can be used to calculate the minimum (lower) and maximum (upper) of the error bar.

iris_summary2 <- iris %>%
                  group_by(Species) %>%
                  summarize(mean_sepal_length = mean(Sepal.Length),
                            N=n(),
                            se=sd(Sepal.Length)/sqrt(N)) %>%
                  # add the y minimum and y maximum columns
                  mutate(ymin = mean_sepal_length - se,
                         ymax = mean_sepal_length + se)
iris_summary2
#> # A tibble: 3 × 6
#>   Species    mean_sepal_length     N     se  ymin  ymax
#>   <ord>                  <dbl> <int>  <dbl> <dbl> <dbl>
#> 1 virginica               6.59    50 0.0899  6.50  6.68
#> 2 setosa                  5.01    50 0.0498  4.96  5.06
#> 3 versicolor              5.94    50 0.0730  5.86  6.01
ggplot(data=iris_summary2, aes(x=Species, y=mean_sepal_length)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2, color='red') +
  labs(x="Species", y="Average Sepal Length", title = "Side-by-side Barplot") +
  theme(plot.title = element_text(hjust = 0.5))
Side-by-side barplot with error bar

Figure 3.39: Side-by-side barplot with error bar

3.3.4.4 Line Graph with Error Bar

iris_summary3 <- iris %>%
                  group_by(Species) %>%
                  summarize(mean_sepal_length = mean(Sepal.Length),
                            N=n(),
                            se=sd(Sepal.Length)/sqrt(N)) %>%
                  mutate(ymin=mean_sepal_length-se,
                         ymax=mean_sepal_length+se)
iris_summary3
#> # A tibble: 3 × 6
#>   Species    mean_sepal_length     N     se  ymin  ymax
#>   <ord>                  <dbl> <int>  <dbl> <dbl> <dbl>
#> 1 virginica               6.59    50 0.0899  6.50  6.68
#> 2 setosa                  5.01    50 0.0498  4.96  5.06
#> 3 versicolor              5.94    50 0.0730  5.86  6.01
ggplot(iris_summary3, aes(x=Species, y=mean_sepal_length, group = 1)) +
  geom_line(color = 'skyblue') +
  geom_point(size = 3, color = 'skyblue') +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.1, linewidth = 1) +
  labs(x="Species", y="Sepal Length", title = "Line Plot with Erorr Bar") +
  theme(plot.title = element_text(hjust = 0.5))
Line plot with error bar

Figure 3.40: Line plot with error bar

3.3.4.5 Overlapping Histogram

Like the base R side-by-side histogram, you will set the transparency to 0.5 and bin width to 0.25. Note that there is no need to care about the x- and y-limit in ggplot2.

ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
  geom_histogram(alpha=0.5, binwidth = 0.25) +
  labs(x="Sepal Length", title = "Overlapping Histogram") +
  theme(plot.title = element_text(hjust = 0.5))
Side-by-side Histogram

Figure 3.41: Side-by-side Histogram

3.3.5 Contingency Table

3.3.6 Two Numeric Variables

Here, you are going to visualize the relationship between two numeric variables using a scatter plot. As said above, the revealed relationship can be linear and non-linear. Visualizing how data points are distributed on a scatter plot is the most intuitive way to understand their true relationship.

ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length)) +
  geom_point() +
  labs(x="Sepal Width", y="Sepal Length", title = "Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5))
Visualizing the relationship between two numeric variables using a scatter plot.

Figure 3.42: Visualizing the relationship between two numeric variables using a scatter plot.

If you wanted to emphasize their linear relationship, it can be done convincingly by adding a regression line on top of the scatter plot. In ggplot2, you can spare the effect to determine what the slope and intercept are by performing a linear regression. Instead, it can be done by adding geom_smooth(). See the code below:

ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x="Sepal Width", y="Sepal Length", title = "Scatter Plot with the Regression Line") +
  theme(plot.title = element_text(hjust = 0.5))
#> `geom_smooth()` using formula = 'y ~ x'
A scatter plot overlaid with a regression line.

Figure 3.43: A scatter plot overlaid with a regression line.

If you see this message “[38;5;232mgeom_smooth() using formula = ‘y ~ x’.[39m”, it is harmless, informational. ggplot2 is just to let you know what the response and predictor are.

3.3.7 Multi-panel Scatter Plot

3.4 Summary