2 Week2: Data Visualization I

library(gapminder)
library(tidyverse)
library(hrbrthemes)
library(nlme)

2.0.1 What is ggplot2?

  • The ggplot2 package is an R package for data visualization. With ggplot2, you can create nice plots with few lines(?) of codes as follows. Although the codes look complicated for now, you can create plots like this example very soon.
# Code chunk 1: an example plot using ggplot2
# This plot comes from 
# https://www.r-graph-gallery.com/320-the-basis-of-bubble-plot.html

# create data 
gdp <- gapminder %>% 
  filter(year=="2007") %>% 
  dplyr::select(-year) %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country))

# create a bubble plot
ggplot(data = gdp, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, fill = continent)) +
  geom_point(alpha = 0.5, shape = 21, color = "black") +
  scale_size(range = c(.1, 24), name = "Population (M)") +
  theme_minimal() +
  theme(legend.position = "right") +
  labs(
     subtitle = "Life Expectancy vs. GDP per Capita", 
     y = "Life Expectancy", 
     x = "GDP per Capita", 
     title = "Scatter Plot", 
     caption = "Source: gapminder"
  )

  • What can you tell from the plot?

  • ggplot2 is a part of tidyverse which is a collection of R packages designed for data science in R.

  • You can find an official R documentation for ggplot2 (or any R package)

2.0.2 Graphical components

  • ggplot2 (Hadley Wickham 2016) was developed to create a graphic by combining few graphical components (e.g., data, coordinate systems, geometric objects, aesthetics, facets, themes) based on the grammar of graphics (gg in ggplot2 stands for grammar of graphics). Hadley Wickham (2016) explains the grammar of graphics as follows:

“Wilkinson (2005) created the grammar of graphics to describe the deep features that underlie all statistical graphics. The grammar of graphics is an answer to a question: what is a statistical graphic? The layered grammar of graphics (Wickham, 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.”

Hadley Wickham (2016)

  • The graphical components of the grammar of graphics include
    • the data you want to visualize
    • the geometric objects (geoms for short; e.g., circles, lines, polygons etc.) that appear on the plot
    • a set of aesthetic mappings that describe how variables in the data are mapped to aesthetic properties (e.g., color, shape, linetype) of the geometric objects
    • a statistical transformation (stats for short) used to calculate the data values used in the plot
    • a position adjustment for arranging each geometric object on the plot
    • a scale (e.g., range of values) for each aesthetic mapping used
    • a coordinate system (coords for short) used to organize the geometric objects
    • the facets or groups of data shown in different plots

2.0.3 ggplot2 syntax

  • The basic idea of the ggplot2 package is to build a statistical graph by adding layers representing geometric objects (e.g., points, lines), and the aesthetic properties of the geometric objects (e.g., colors, line types) can be controlled by mapping (or relating) data to the aesthetic properties.

A statistical graph can be created by adding layers

  • You can create a plot by combining the graphical components of the grammar of graphics based on the following sytax. Essentially, you add layers representing graphical components to a plot with the character +. :
    • ggplot(data = <DATA>) +
      <GEOM FUNCTION>(mapping = aes(<AESTHETIC MAPPING>), stat = <STAT>, position = <POSITION>) +
      <COORDINATE FUNCTION> +
      <FACET FUNCTION> +
      <SCALE FUNCTION> + <THEME FUNCTION>

2.0.4 Learning objectives of this lecture

  • Students can explain the meaning of each graphical component
  • Students understand the basics ggplot2 syntax to combine graphical components
  • Students can create common plots using ggplot2

2.0.5 Installing and laoding ggplot2

  • You can install the ggplot2 package (or any package)
    • using install.packages() function
    • using the Packages pane
# Code chunk 2: install packages
# install.packages() will download and install packages 
# from CRAN-like repositories or from local files.
install.packages("ggplot2")
  • Once you install the ggplot2 package, you need to load the ggplot2 package onto memory whenever you use the package.
# Code chunk 3: load packages
# library() loads a package into memory
library(ggplot2)
  • The ggplot2 package is a part of tidyverse. You can install all the packages in the tidyverse and load the core tidyverse packages by running the following code.
# Code chunk 4
install.packages("tidyverse")
library(tidyverse) # ggplot2 will be also available 

2.0.6 Useful resources for learning ggplot2

2.1 Graphical components of ggplot2

  • Data
  • Geometric objects (geom for short)
  • Aesthetic mappings
  • Statistical transformations (stats for short)
  • Scales
  • A coordinate system (coord for short)
  • A faceting

2.1.1 Data

  • “The data are what you want to visualize.” (Hadley Wickham 2016)

  • In order to create plots from data, you first need to import (or read) your own data into R. We will talk about data importing using tidyverse’s readr package later. For now, let’s just assume that you already imported your data into R.

  • After importing your data, you may need to get your own data into a useful form for visualization and modeling. This process is often called data wrangling (H. Wickham and Grolemund 2016). For example, data wrangling include selecting variables and rows, creating new variables, and merging data sets. Tidyverse provides a set of R packages which help you to wrangle your data. In later lecture of this course, you will learn dplyr, tidyr, purrr, stringr packages in tidyverse for data wrangling. For now, let’s assume that we already have our data in the desired format and just focus in learning ggplot2.

# Code chunk 5
# diamonds is a built-in data in ggplot2
#, meaning that you can use diamonds data after you load ggplot2 package. 
# ?diamonds display the help document for data.
# diamonds is a tibble, which is data structure in tidyverse.
# we will talk data structure later. 
diamonds
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows
  • The interactive code chunk below was created by the learnr package. You can directly execute R code in this code chunk.
# Please type `diamonds` below and run
# There are 53,940 rows and 10 columns. Can you check these number when you type `diamonds`?
# "carat cut color ... " are the names of variables. 
# "<dbl> <ord> <ord> ... " are the `data type` which we will talk more later. 
# <dbl> indicates `double` (or real number such as 1.5)
# <ord> indicates `ordered factor` (e.g., small < medium < large)

2.1.2 Geometric objects (geoms)

  • “Geometric objects, geoms for short, represent what you actually see on the plot” (Hadley Wickham 2016). For example, geometric objects include points, lines, bars, etc in a plot.

  • Geometric objects can be added to a plot as a new layer using a geom function (i.e., geom_*()). For example, data points, which is a geometric object, can be added to a plot as a new layer using the geom_point() function.

  • In the cheatsheet, you will find many geom functions and their corresponding geometric objects. For example, we often use the following geom functions:

    • geom_point() for scatter plot
    • geom_line() for line plot
    • geom_histogram() for histogram
  • In ggplot2, you can create plots by

    1. calling the ggplot() function to begin a plot (or to initialize a ggplot object or to create a blank canvas) that you can add layers to
      • An object is a technical terminology. In fact, everything (e.g., vectors, matrices, data frames, lists, functions) in R is an object, a data structure having some values and functions. In computer science, “a data structure is a way to store and organize data in order to facilitate access and modifications” (Cormen et al. 2009). The official definition of an object in R can be found here. We will talk more about data structure in R later.
    2. specifying aesthetic mappings to specify how you want to map variables to visual aspects of geometric objects
      • Each geom has its own aesthetic properties. For example, the aesthetic properties of points in a plot include points’ color, size, and shape. You can change the color, size, and shape of points in a plot using aesthetic mapping.
    3. adding a new layer representing a geometric object to a plot using a geom function
  • In short, you create a ggplot object using the ggplot() function, and add additional layer using the geom functions as follows (check the “Basics” section in the cheatsheet):

# Code chunk 6: initialize a ggplot object
# ggplot() initializes a ggplot object.
# In ggplot(), you need to specify 1) a dataset, and 2) aesthetic mapping. 
# In the example below, x-position and y-position are mapped into carat and price variables. 
# No geometric object yet. 
ggplot(data = diamonds, aes(x = carat, y = price))

# Code chunk 7: add a geometric object
# geom_points() adds a new layer to a plot by drawing points to produce a scatter plot
ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()

# Code chunk 8: add another geometric object
# geom_smooth() adds an additional layer to the plot by drawing a smoothed line to capture the trend in the scatterplot
ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Exercise 2-1. Your first ggplot2 plot

  • Let’s create your first plot using mtcars data, which is a built-in data in R.

  • First, let’s check the variables in the mtcars by just typing mtcars in the interactive R code chunk below.

# type `mtcars` to display the dataset
# `?mtcars` will display a help document for `mtcars` dataset
  • Second, create a scatter plot between disp(x-axis) and mpg(y-axis) in the mtcars dataset with a smoothed line like the one in the Code chunk 8. If you type ?mtcars, then you can find mtcars is a dataset for motor trend car road test, mpg is a mile/gallon (how many miles can car go with a gallon), and disp is an engine size of a car.
# check the Code chunk 8
  • What can you tell from the plot about the relationship between fuel efficiency (i.e., mpg) and engine size (i.e., disp)?

2.1.3 Aesthetic mappings

  • Once you add geometric objects (e.g., points as in the above example) to a plot, you can change the visual (or aesthetic) properties of the geometric objects (e.g., color of the points). You can find some aesthetic properties in R here (e.g., lintypes of a line, shapes of a point, color of a point).

  • Often, you may want to change the visual (or aesthetic) properties of the geometric objects by mapping variables in the data to aesthetic properties of geoms using the following syntax: <aesthetic property> = <variable names>. For example, shape = cyl maps the shape aesthetic (e.g., circle, triangle, cross) of points to cyl variable (number of cylinders in the mpg dataset) so that you can encode additional information to a plot using different color of points.

  • “A set of aesthetic mappings describe how variables in the data are mapped to aesthetic properties of the layer” (Hadley Wickham 2016)

  • “To describe the way that variables in the data are mapped to things that we can perceive on the plot (the "aesthetics"), we use the aes function. The aes function takes a list of aesthetic-variable pairs like these: aes(x = weight, y = height, colour = age). Here we are mapping x-position to weight, y-position to height and colour to age. The first two arguments can be left without names, in which case they correspond to the x and y variables.” (Hadley Wickham 2016)

  • Aesthetic properties in ggplot2 include

    • position (e.g., x and y coordinates)
    • color (outside color)
    • fill (inside color)
    • shape (of points; e.g., circle, triangle)
    • linetype (e.g., solid line, dotted line)
    • size
    • alpha (transparency)
# Code chunk 9: aesthetic mapping
# `color = color` maps the variable `color` in the dataset to the `color` aesthetics of points to encode further information in the graphic. 
ggplot(data = diamonds, aes(x = carat, y = price, color = color)) + geom_point()

# Code chunk 10: aesthetic mapping
# `shape = cut` maps the `shape` aesthetics of points to the variable `cut` in the dataset to encode further information in the graphic. 
# Note that the graphic is not so informative because points are overplotted. Sometimes, facetting may handle overplotting.
# We will talk about facetting later in this lecture. 
ggplot(data = diamonds, aes(x = carat, y = price, shape = cut)) + geom_point()
## Warning: Using shapes for an ordinal variable is not advised

# Code chunk 11: set aesthetic properties to a constant 
# We can set aesthetic properties to a constant outside aes() function. 
ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point(color = "blue")

Exercise 2-2. Aesthetic mapping

  • mpg is similar to mtcars but is a built-in tibble in ggplot2. Plot hwy (mile per gallon: y axis) against displ (engine size: x axis)
    • Recall ggplot(data = <dataset name>, aes(<aesthetic mapping>)) + geom_*() ... is a basic format for creating a plot in ggplot2 (check the cheatsheet).
    • x and y positions are also aesthetic properties in ggplot2, meaning that you need x = <variable name for x axis> and y = <variable name for y axis>.
# check Exercise 2-1
  • Based on the plot from above, map the cyl variable (number of cylinders) to color aesthetics.
# check Code chunk 9 and 10
  • Explain what happens? Can you see the additional information are further encoded in the graph?

  • This is what happens when mapping x, y, and color aesthetics to hyw, displ, and cyl variables: R creates a new dataset that contains all the data to be displayed on the plot.

x y color
1.8 29 4
1.8 29 4
2.0 31 4
2.0 30 4
2.8 26 6
2.8 26 6
3.1 27 6
1.8 26 4
1.8 25 4
2.0 28 4

2.1.4 Scales

  • In the previous table, computers don’t know how to display colors based on 4, 6, … Computers need a hexadecimal code for colors such as FF6C91. The mapping from the data to the final values that computers can use to display aesthetics is called a scale. In this sense, a scale controls aesthetic mapping from data to aesthetics.
x y color
1.8 29 #FF6C91
1.8 29 #FF6C91
2.0 31 #FF6C91
2.0 30 #FF6C91
2.8 26 #00C1A9
2.8 26 #00C1A9
3.1 27 #00C1A9
1.8 26 #FF6C91
1.8 25 #FF6C91
2.0 28 #FF6C91
  • To control the aesthetic mapping, you can use a scale function (i.e., scale_*()). Please check the Scales section in the cheatsheet. For example, scale_x_continuous() and scale_y_continuous() allow us to change the default scales for continuous x and y aesthetics. If you see the reference guide, you will find scale_x_continuous() and scale_y_continuous() functions allow you to change the name, break points, limits, etc. of the continuous x and y axis.
# Code chunk 12: create a scatter plot
# you assign a name `p1` to a ggplot object. 
# By assigning a ggplot object to the variable `p1`, you can easily add different layers to `p1`. 
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p1

# Code chunk 13: change the axis labels
# add an additional layer to p1 object
p1 + 
  scale_x_continuous("Engine displacement (L)") + 
  scale_y_continuous("Highway MPG")

# Code chunk 14: change the axis labels
# use the short-cut labs()
p1 + labs(x = "Engine displacement (L)", y = "Highway MPG")

# Code chunk 15: modify the axis limits
p1 + scale_x_continuous(limits = c(2, 6))
## Warning: Removed 27 rows containing missing values (geom_point).

# Code chunk 16: modify the axis limits
# use the short hand functions `xlim()` and `ylim()`
p1 + xlim(2, 6)
## Warning: Removed 27 rows containing missing values (geom_point).

# Code chunk 17: modify the axis limits
#  choose where the `ticks` appear
p1 + scale_x_continuous(breaks = c(2, 4, 6))

# Code chunk 18: choose your own labels
p1 + scale_x_continuous(breaks = c(2, 4, 6), label = c("two", "four", "six"))

# Code chunk 19: add title and axis labels
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
  geom_point() + geom_smooth(method="lm") + 
  labs(title ="MPG vs Engine size", x = "Engine size", y = "MPG")
## `geom_smooth()` using formula 'y ~ x'

References

Cormen, Thomas H, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT press.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. springer.

Wickham, H., and G. Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. https://books.google.co.kr/books?id=vfi3DQAAQBAJ.