Chapter 3 Data visualization

Tien Thuy Bui, Msc

(Edited by: Nhan Thi Ho, MD, PhD)

3.1 Part I: Data visualization with ggplot2

3.1.1 7 layers of graphics

ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a data set, a set of geoms-visual marks that represent data points, and a coordinate system (Wickham et al. 2024).

Layer Description
Data The dataframe being plotted
Aesthetic The scales onto which we map our data.
Geometry The visual elements (type of plot) used for our data.
Facet Plotting small multiples.
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be plotted.
Themes All non-data ink.

3.1.2 Parameters for each element

Layer Parameter
Data [The variable of interest]
Aesthetic x, y, colour, fill, size, labels, alpha, shape, line width, line type
Geometry point, line, histogram, bar, boxplot, …
Facet columns, rows
Statistics binning, smoothing, …
Coordinates cartesian, polar, fixed, limits
Themes non-data ink.

3.1.3 GGplot2 layers - Data

  • The Iris flower data set
  • 3 iris species: setosa, versicolor, virginica

3.1.3.1 The Iris dataset

library(tidyverse)
library(gridExtra)
data(iris)
unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica
glimpse(iris)   # view the data frame
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7,…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4,…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5,…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4,…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa,…

3.1.4 GGplot2 layers - Aesthetics

Mapping Sepal length to X axis and Sepal Width to Y axis:

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))
g

No plot produced yet !

3.1.5 GGplot2 layers - Geometry

g <- ggplot(data = iris, mapping = aes (x = Sepal.Length, y = Sepal.Width)) +
      geom_jitter()
g

3.1.6 GGplot2 layers - Facet

g <- ggplot(data = iris, mapping = aes (x = Sepal.Length, y = Sepal.Width)) +
      geom_jitter(alpha = 0.6) + 
      facet_grid(. ~ Species)
g

3.1.7 GGplot2 layers - Statistics

g <- ggplot(data = iris, mapping = aes (x = Sepal.Length, y = Sepal.Width)) +
      geom_jitter(alpha = 0.6) + 
      facet_grid(. ~ Species) + 
      stat_smooth(method = "lm", se = F, col = "red")
g

3.1.8 GGplot2 layers - Coordinates

levels(iris$Species) <- c("Setosa", "Versicolor", "Virginica")
g <- ggplot(data = iris, mapping = aes (x = Sepal.Length, y = Sepal.Width)) +
      geom_jitter(alpha = 0.6) + 
      facet_grid(. ~ Species) + 
      stat_smooth(method = "lm", se = F, col = "red") +
      scale_x_continuous("Sepal Length (cm)", limits = c(4, 8) ) +
      scale_y_continuous("Sepal Width (cm)", limits = c(2, 5)) +
      coord_equal() # aspect ratio
g

3.1.9 GGplot2 layers - Themes

g <- ggplot(data = iris, mapping = aes (x = Sepal.Length, y = Sepal.Width)) +
      geom_jitter(alpha = 0.6) + 
      facet_grid(. ~ Species) + 
      stat_smooth(method = "lm", se = F, col = "red") +
      scale_x_continuous("Sepal Length (cm)", limits = c(4, 8) ) +
      scale_y_continuous("Sepal Width (cm)", limits = c(2, 5)) +
      coord_equal() + # aspect ratio
      theme_classic()
g

3.1.10 7 layers of graphics

Function Returns
Data The dataframe being plotted
Aesthetic The scales onto which we map our data.
Geometry The visual elements (type of plot) used for our data.
Facet Plotting small multiples.
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be plotted.
Themes All non-data ink.

3.1.11 Visible aesthetics

3.1.11.1 Mapping onto the X and Y axes

Scatter plot between sepal width and length.

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width))+
        geom_point()
g

They are discrete variables.

3.1.11.2 Mapping onto colors

unique(iris$Species)
## [1] Setosa     Versicolor Virginica 
## Levels: Setosa Versicolor Virginica

Species - a dataframe column, is mapped onto color - a visible aesthetic.

g1 <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species ))+
        geom_point() +
        ggtitle("aes mapping inside ggplot")

g2 <- ggplot(data = iris)+
        geom_point(mapping = aes(x = Sepal.Length, 
                                 y = Sepal.Width,
                                 color = Species )) +
        ggtitle("aes mapping inside geom_point")

grid.arrange(g1, g2, nrow = 1)

Mapping inside geom_ is only necessary if:
- All layers should not inherit the same aesthetics
- Mixing different data sources

3.1.11.3 Typical visible aesthetics

Aesthetic Description
x X axis position
y Y axis position
fill Fill color
color Color of points, outlines of other geoms
size Area or radius of points, thickness of lines
alpha Transparency
linetype line dash pattern
labels Text on a plot or axes
shape Shape

3.1.11.4 Using attributes - example

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width  )) +
        geom_point(color = "red", shape = 17)
g

The color attribute is set to “red” Attribute is set inside geom_*() .

Incorrect use:

g1 <- g + ggtitle("Attribute set outside aes")
g2 <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                        y = Sepal.Width  )) +
        geom_point(aes(color = "red")) +
        ggtitle("Incorrect! red is not a variable")

grid.arrange(g1, g2, nrow = 1)

3.1.11.5 What are Aesthetics? or Attributes?

Aesthetics are defined inside aes() in ggplot syntax and attributes are outside the aes().

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species)) +  # color mapped to Species variable
        geom_point(color = "red", shape = 17)   # color and shape set with constant value

We typically understand aesthetics as how something looks, color, size etc. But in ggplot’s world how things look is just an attribute.

Aesthetics do not refer how something looks, but to which variable, or dataframe column is mapped onto it. Attributes refer to the property of the plot, and is set at a constant value (e.g color = "red", or shape = 17 )

3.1.11.6 Appendix - scatter point shapes reference

R scatter dot symbol:

3.1.11.7 Modifying aesthetics

  • Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. - It doesn’t say how that should happen.
  • For example, when mapping a variable to shape with aes(shape = x) we do not say what shapes should be used. Similarly, aes(color = z) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale.

3.1.11.8 In ggplot2 scales include:

  • position
  • color and fill
  • size
  • shape
  • line type
3.1.11.8.1 Positions: Adjustment for overlapping:
  • identity
  • dodge
  • stack
  • fill
  • jitter
  • jitterdodge
  • nudge
3.1.11.8.2 position = “identity” and position = “jitter”

Let’s look at the dataframe again.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  Setosa
## 2          4.9         3.0          1.4         0.2  Setosa
## 3          4.7         3.2          1.3         0.2  Setosa
## 4          4.6         3.1          1.5         0.2  Setosa
## 5          5.0         3.6          1.4         0.2  Setosa
## 6          5.4         3.9          1.7         0.4  Setosa

Sepal.Length and Sepal.Width are rounded to 1 decimal place. position = "jitter" adds noise to the scatter plot.

g1 <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species ))+
        geom_point() +
        labs(title = "Default position = identity",
             subtitle = "150 points but overlapped")

g2 <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species ))+
        geom_point(position = "jitter") +
        ggtitle("position = jitter")
grid.arrange(g1, g2, nrow=1)

3.1.11.8.3 Control level of noise added by position_jitter()
posn_j <- position_jitter(width=0.1, seed = 136)   # set the seed for consistency between different runs
g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species ))+
        geom_point(position = posn_j)
g

3.1.12 Scale functions

  • Each of the aesthetic, which we map the data onto, has associated scale_ functions. We can access (and adjust) the scales via scale_* functions. For examples:
  • scale_x_*()
  • scale_y_*()
  • scale_color_*()
  • scale_fill_*()
  • scale_shape_*()
  • scale_linetype_*()
  • scale_size_*()

3.1.12.1 Scale functions examples

  • Change axis names and legend title
  • The third part of scale function must match the data we are using. In this example, the data on x-axis is continuous, and the Species (mapped to color) is discrete >> we have to use scale_x_continuous (instead of scale_x_discrete)
  • There are many arguments for the scale function
  • The first arguments are always the name of the scale. In this example, we name x-axis as “Sepal Length”, and we name the legend title as “Sepcies”
  • After that, most common are limit, breaks, expand, and label
  • scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. Values is a named vector of colors to use.
palette <- c(setosa = "#377EB8", versicolor = "#E41A1C", virginica="#99FF99")
g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species )) +
        geom_point(position = "jitter") +
        scale_x_continuous(name="Sepal Length") +
        scale_color_manual(name="Species", values = palette)
g

3.1.12.2 The limit argument

Describe the scale range. In this example, x-axis is extended from 2 to 8

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species )) +
        geom_point(position = "jitter") +
        scale_x_continuous("Sepal Length", limits = c(2, 8)) +
        scale_color_discrete("Species")
g

3.1.12.3 The breaks argument

Controls the tick mark position

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species )) +
        geom_point(position = "jitter") +
        scale_x_continuous("Sepal Length", limits = c(2, 8),
                           breaks = seq(2, 8, 3)) +
        scale_color_discrete("Species")
g

3.1.12.4 The expand argument

2-element vector is passes into expand argument. They are the additive elements to the 2 axes, so that there is a small between the data and the axes.

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species )) +
        geom_point(position = "jitter") +
        scale_x_continuous("Sepal Length", limits = c(2, 8),
                           breaks = seq(2, 8, 3), expand=c(0,0) ) +
        scale_color_discrete("Species")
g

3.1.12.5 The labels argument

Adjust the category names. In this example, the legends are re-labelled.

g <- ggplot(data = iris, mapping = aes(x = Sepal.Length, 
                                       y = Sepal.Width,
                                       color = Species )) +
        geom_point(position = "jitter") +
        scale_x_continuous("Sepal Length", limits = c(2, 8),
                           breaks = seq(2, 8, 3), expand=c(0,0) ) +
        scale_color_discrete("Species",
                             labels = c("Setosa", "Versicolour", "Virginica") )
g

3.1.13 Geom layer

37 geometries geom_*

abline contour dotplot jitter pointrange ribbon spoke
area count errorbar label polygon rug step
bar crossbar errorbarh line qq segment text
bin2d curve freqpoly linerange qq_line sf tile
blank density hex map quantile sf_label violin
boxplot density2d histogram path raster sf_text vline
col density_2d hline point rect smooth

3.1.14 Common plot types

Plot type Possible geom_
Scatter plot points, jitter, abline, smooth, count
Bar plot histogram, bar, col, errorbar
Line plot line, path

3.1.14.1 Scatter plots

  • Each geom can accept specific mappings.
  • For geom_point:
    • Essential mappings: x, y
    • Optional mapping: alpha, color, fill, shape, size, stroke (width of the border)

3.1.14.2 Geom-specific aesthetic mappings

# These result in the same plot!
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()

Control aesthetic mappings of each layer independently:

# These result in the same plot!
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))

3.1.14.3 Inherited aes

head(iris, 3) # Raw data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  Setosa
## 2          4.9         3.0          1.4         0.2  Setosa
## 3          4.7         3.2          1.3         0.2  Setosa
iris %>%
  group_by(Species) %>%
  summarise_all(mean) -> iris.summary

iris.summary # Summary statistics
## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 Setosa             5.01        3.43         1.46       0.246
## 2 Versicolor         5.94        2.77         4.26       1.33 
## 3 Virginica          6.59        2.97         5.55       2.03

With inherited aes:

  • Both data and aes are inherited from ggplot() in geom_* layer
  • When data mapping is updated, the remaining aes mapping remains
g <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
      # Inherits both data and aes from ggplot()
      geom_point(position = "jitter") +
      # Different data, but inherited aes
      geom_point(data = iris.summary, shape = 15, size = 5, alpha = 0.6) # alpha = transparency
g

3.1.14.4 adding geom_text() for annotation

  • The geom_text layer needs label aesthetic mapping
  • Again, color is inherited from the ggplot function
iris.lab <- data.frame(Species = levels(iris$Species),
                       Petal.Length = iris.summary$Petal.Length,
                       Petal.Width  = c(0.7, 0.9, 1.3) )

g + geom_text(data = iris.lab, 
              mapping = aes(label = Species),
              show.legend = F)

3.1.15 Bar plot

Plot type Possible geom_
Scatter plot points, jitter, abline, smooth, count
Bar plot histogram, bar, col, errorbar
Line plot line, path

3.1.15.1 Histogram - default values

Default number of bins is 30. We can pick better bin value by setting binwidth or bins

g1 <- ggplot(iris, aes(x = Sepal.Width)) +
        geom_histogram() +
        ggtitle("Default number of bins (30)")

g2 <- ggplot(iris, aes(x = Sepal.Width)) +
        geom_histogram(binwidth = 0.1, center = 0.05) +
        ggtitle("binwidth = 0.1")

grid.arrange(g1, g2, nrow = 1)

3.1.15.2 Multiple histograms - example with different species

Default position is “stack”

3.1.15.2.1 Problems?
g1 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
      geom_histogram(binwidth = 0.1, center = 0.05) + 
      ggtitle("Default position setting")

g2 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
      geom_histogram(binwidth = 0.1, center = 0.05, position = "stack") +
      ggtitle("position = stack")

grid.arrange(g1, g2, nrow = 1)

We cant say whether the histogram bars are stacked or overlapped onto each other

3.1.15.2.2 Solution: position = “dodge” and position = “fill”

How about = “identity”?

g1 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
        geom_histogram(binwidth = 0.1, center = 0.05, position = "dodge") + 
        ggtitle("position = dodge")
g2 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
        geom_histogram(binwidth = 0.1, center = 0.05, position = "fill") + 
        ggtitle("position = fill")
grid.arrange(g1, g2, nrow = 1)

To avoid: position = “identity”

g1 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
        geom_histogram(binwidth = 0.1, center = 0.05, position = "dodge") + 
        ggtitle("position = dodge")
g2 <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
        geom_histogram(binwidth = 0.1, center = 0.05, position = "identity") + 
        ggtitle("position = identity")
grid.arrange(g1, g2, nrow = 1)

3.1.15.3 Bar plot, with a categorical X-axis

  • Use geom_bar or geom_col
  • All positions from before are available
  • Two types:
    • Absolute counts
    • Distribution
Geom Stat Action
geom_bar() “count” Counts the number of cases at each x position
geom_col() “identity” Plot actual values

3.1.15.4 Barplot - absolute count

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  Setosa
## 2          4.9         3.0          1.4         0.2  Setosa
## 3          4.7         3.2          1.3         0.2  Setosa
## 4          4.6         3.1          1.5         0.2  Setosa
## 5          5.0         3.6          1.4         0.2  Setosa
## 6          5.4         3.9          1.7         0.4  Setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   Setosa    :50  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   Versicolor:50  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300   Virginica :50  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
ggplot(iris, aes(x=Species)) +
  geom_bar()

3.1.15.5 Plotting distributions - adding decriptive statistics

# Calculate Descriptive Statistics:
sepal_width_summary <- iris %>%
                        select(Species, Sepal.Width) %>%
                        gather(key, value, -Species) %>%
                        group_by(Species) %>%
                        summarise(avg = mean(value),
                        stdev = sd(value))
sepal_width_summary
## # A tibble: 3 × 3
##   Species      avg stdev
##   <fct>      <dbl> <dbl>
## 1 Setosa      3.43 0.379
## 2 Versicolor  2.77 0.314
## 3 Virginica   2.97 0.322
3.1.15.5.1 Adding error bar

To be continued at stat_* layer

# Calculate Descriptive Statistics:
ggplot(sepal_width_summary, aes(x = Species, y = avg)) +
  geom_col() +
  geom_errorbar(aes(ymin = avg - stdev, ymax = avg + stdev),
  width = 0.1)

Note: histogram is basically bar plot.

g1 <- ggplot(iris, aes(x = Sepal.Width)) +
        geom_histogram(binwidth = 0.1) +
        ggtitle("geom_histogram")

g2 <- ggplot(iris, aes(x = Sepal.Width, y = ..count..)) + # ..count.. is a hidden variable
        geom_bar() +
        ggtitle("geom_bar")

grid.arrange(g1, g2, nrow = 1)

3.1.16 Line plot

Plot type Possible geom_
Scatter plot points, jitter, abline, smooth, count
Bar plot histogram, bar, col, errorbar**
Line plot line, path

3.1.16.1 Examples with economics data set

unemployment and population statistics from the Federal Reserve Bank of St. Louis in the US. The data is contained in the ggplot2 package.

data(economics)
head(economics)
## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018
str(economics)
## spc_tbl_ [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date    : Date[1:574], format: "1967-07-01" "1967-08-01" "1967-09-01" ...
##  $ pce     : num [1:574] 507 510 516 512 517 ...
##  $ pop     : num [1:574] 198712 198911 199113 199311 199498 ...
##  $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
##  $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
##  $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...

Note: The date column is formatted as Date. This data type becomes handy when dealing with date, or date/time data. For converting data to Date formate, check out as.Date

Line plot showing the fraction of total population that is unemployed:

ggplot(economics, aes(x = date, y = unemploy/pop)) + # get fraction of total population that is unemployed
  geom_line()

3.1.16.2 Periods of recession

In addition to the economics dataset from before, we’ll also use the recess dataset for the periods of recession. The recess data frame contains 2 variables: the begin period of the recession and the end.

recess <- data.frame(begin = as.Date(c("1969-12-01","1973-11-01","1980-01-01","1981-07-01","1990-07-01")), 
                     end   = as.Date(c("1970-11-01","1975-03-01","1980-07-01","1982-11-01","1991-03-01")))


ggplot(economics, aes(x = date, y = unemploy/pop)) +
  geom_rect(data = recess,
            aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
            inherit.aes = FALSE, fill = "red", alpha = 0.2) +
  geom_line()

3.1.17 The annotate() function

  • The annotate function adds geoms to a plot, but unlike typical a geom function, the properties of the geoms are not mapped from variables of a data frame, but are instead passed in as vectors. No aesthetic mapping
  • This is useful for adding small annotations (such as text labels) or if you have your data in vectors, and for some reason don’t want to put them in a data frame.
  • The first argument is the type of geom to add, the other arguments correspond to the added geom
annotate(
  geom,
  
  # positioning aesthetics - you must specify at least one of these.
  x = NULL,   y = NULL,  
  xmin = NULL,  xmax = NULL,
  ymin = NULL,  ymax = NULL,
  xend = NULL,  yend = NULL,
  
  # Other arguments passed on to layer(). These are often aesthetics, like colour = "red"
  ...
)
  • Remember this plot with geom_text() ? A line that separates versicolor and virginica on this plane was added. Support vector machine was used to “create” this line. The plot object is passed to variable g
  • We want to add an annotation to indicate this is an SVM hyperplane, and a curved arrow that points to the line.

  • We use the annotate function to add a text geom, and a curve geom
x_start = 5; y_start = 0.75
x_end = 5.5; y_end = intercept_1 + x_end * slope_1
g + geom_abline(slope = slope_1, intercept = intercept_1, linetype = 2) + # add a SVM hyper plane
    annotate("text",
              x = x_start, y = y_start,
              label = "Support Vector\nMachine Hyperplane",
              vjust = 1, size = 3, color = "grey40") +
    annotate("curve", curvature = -0.1,
              x = x_start, y = y_start,
              xend = x_end, yend = y_end,
              # arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
              color = "grey40" )

3.1.18 Statistics layler

  • Two categories of functions
    • Called from within a geom
    • Called independently
  • Can be accessed via stat_*

3.1.18.1 Statistics with Geoms - histogram count

Another way to do histogram plot ! All methods count the time a specific sepal width value appears

g <- ggplot(iris, aes(x = Sepal.Width))
g1 <- g +
        geom_histogram(binwidth = 0.1) +
        ggtitle("geom_histogram")

g2 <- g +
        geom_bar(stat = "count") +
        ggtitle("geom_bar")

g3 <- g +
        stat_bin(binwidth = 0.1) +
        ggtitle("stat_bin")

grid.arrange(g1, g2, g3, nrow = 1)

3.1.18.2 Statistics with geom - smoothing

  • stat_smooth can be accessed with geom_smooth
  • Standard error is showed as the grey ribbon behind each smoothing line. It is 5% confident interval by default
g <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
        geom_point()

g1 <- g + geom_smooth() + ggtitle("geom_smooth")

g2 <- g + stat_smooth() + ggtitle("stat_smooth")

grid.arrange(g1, g2, nrow = 1)

Note: By default, loess regression is used. It is a non-parametric methods where least squares regression is performed in localized subsets, and used when n < 1000. We can change smoothing method with the method argument

3.1.18.2.1 More arguments with smoothing!
  • se = FALSE turns off the ribbon (5% confident interval by default)
  • span argument controls the degree of smoothing (size of sliding window in loess)
  • method argument controls smoothing method
g <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
        geom_point()

g1 <- g + stat_smooth(se = FALSE,
                      span = 0.4)

g2 <- g + stat_smooth(se = FALSE,
                      method = "glm",
                      fullrange = TRUE )

grid.arrange(g1, g2, nrow = 1)

3.1.18.2.2 Adding an smoothing line for all data

Use another stat_smooth layer, set a dummy variable “All” to color.

g <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
        geom_point() +
        stat_smooth(method = "lm", se = F, aes(color="All")) +  # add a dummy variable mapping
        stat_smooth(method = "loess", se = FALSE)
g

3.1.19 The geom_ / stat_ connection

stat_ geom_
stat_bin() geom_histogram()
stat_bin() geom_bar()
stat_bin() geom_freqpoly()
stat_smooth() geom_smooth()
stat_boxplot() geom_boxplot()
stat_bindot() geom_dotplot()
stat_bin2d() geom_bin2d()
stat_binhex() geom_hex()
stat_contour() geom_contour()
stat_quantile() geom_quantile()
stat_sum() geom_count()

3.1.19.1 Statistics outside geom

3.1.19.1.1 Basic plot
g <- ggplot(iris, aes(x = Species, y = Sepal.Length)) +
      geom_point(position = position_jitter(0.2))
g

3.1.19.1.2 Calulating statistics
# install.packages("Hmisc")
library(Hmisc)
set.seed(123)
xx <- rnorm(100)

# Hmisc
smean.sdl(xx, mult = 1)
##        Mean       Lower       Upper 
##  0.09040591 -0.82240997  1.00322179
## ggplot2
mean_sdl(xx, mult = 1)
##            y     ymin     ymax
## 1 0.09040591 -0.82241 1.003222
3.1.19.1.3 Calulating statistics with stat_summary
  • fun.data
  • fun.args
g <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
      stat_summary(fun = mean, geom = "point") +
      stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
g

3.1.19.1.4 95% confidence interval
#ggplot2
mean_cl_normal(xx)
##            y        ymin      ymax
## 1 0.09040591 -0.09071657 0.2715284
g <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
      stat_summary(fun.y = mean, geom = "point") +
      stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1)
g

3.1.19.2 Other stat_ functions

Function Returns
stat_summary() Summarise y values at distinct x values
stat_function() Compute y values from a function of x values
stat_qq() Perform calculations for a quantile-quantile plot
3.1.19.2.1 Example with stat_function
library(MASS)
head(mammals)
##                    body brain
## Arctic fox        3.385  44.5
## Owl monkey        0.480  15.5
## Mountain beaver   1.350   8.1
## Cow             465.000 423.0
## Grey wolf        36.330 119.5
## Goat             27.660 115.0
mam.new <- data.frame(body = log10(mammals$body))
ggplot(mam.new, aes(x = body)) +
  geom_histogram(aes( y = ..density..)) +
  geom_rug() +
  stat_function(fun = dnorm, colour = "red",
                args = list(mean = mean(mam.new$body),
                sd = sd(mam.new$body)))

3.1.19.2.2 QQ plot with stat_qq and stat_qq_line

By default, stat_qq_line fits with normal distribution.

mam.new <- data.frame(body = log10(mammals$body))
ggplot(mam.new, aes(sample = body)) +
  stat_qq(distribution = stats::qnorm) +
  stat_qq_line(distribution = stats::qnorm, color = "red", fullrange=T) 

3.1.20 Coordinates layer

  • plot dimensions: zoom-in, aspect ratio, polar coord (pie chart)

3.1.20.1 Zooming in using scale_ and coord_

  • What is the difference between the 2 zoomed-in plots?
g <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(alpha = 0.7) +
  geom_smooth()

g0 <- g + geom_vline(xintercept=4.5, linetype = 2) + 
          geom_vline(xintercept=5.5, linetype = 2) + 
          ggtitle("original plot")

g1 <- g + scale_x_continuous(limits = c(4.5, 5.5)) +
          ggtitle("Zoom in with scale_x_continuous()")

g2 <- g + coord_cartesian(xlim = c(4.5, 5.5)) +
          ggtitle("Zoom in with coord_()")

grid.arrange(g0, g1, g2, nrow=1)

  • The scale functions change the underlying dataset before applying regression, which affects calculations made by computed geoms
  • Coordinate functions make no changes to the dataset. In other words, regression is applied independently on the plot axis limit.
  • Which one gives more accurate information?

3.1.20.2 Data transformation with coord_ and scale_

  • We use the built-in msleep dataset for this example
  • Body weight is highly skewed to the right side
  • Log-transformation will give better visualization!
data(msleep)
head(msleep)
## # A tibble: 6 × 11
##   name         genus vore  order conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##   <chr>        <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
## 1 Cheetah      Acin… carni Carn… lc                  12.1      NA        NA      11.9 NA        50    
## 2 Owl monkey   Aotus omni  Prim… <NA>                17         1.8      NA       7    0.0155    0.48 
## 3 Mountain be… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6 NA         1.35 
## 4 Greater sho… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1  0.00029   0.019
## 5 Cow          Bos   herbi Arti… domesticated         4         0.7       0.667  20    0.423   600    
## 6 Three-toed … Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6 NA         3.85
ggplot(msleep, aes(bodywt, y = 1)) +
  geom_jitter() +
  scale_x_continuous(limits = c(0, 7000),
  breaks = seq(0, 7000, 1000))

  • Apply a log-10 transformation to the plot on x-axis
  • Giving 2 similar plots.
  • HOWEVER, there are still fundamental differences between coord_ and scale_
    • scale transforms the data before plotting.
    • coord does not make any change to the dataset.
g <- ggplot(msleep, aes(bodywt, y = 1)) +
      geom_jitter()
g1 <- g + scale_x_log10(limits = c(1e-03, 1e+04)) + 
          labs(title = "using scale_x_log10")
g2 <- g + coord_trans(x = "log10") +
          labs(title = "using coord_trans")

grid.arrange(g1, g2, nrow=1)

3.1.20.3 Polar coordinate and pie chart

  • Transforms stacked bar chart into pie chart
g <- ggplot(iris, aes(x=1, fill = Species)) +
      geom_bar()

g1 <- g + coord_polar(theta = "y") # transforms on y axis

grid.arrange(g, g1, nrow=1)

3.1.21 The facet layer

# using tidyr package
iris2 <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.")
head(iris2)
##   Species  Part Measure Value
## 1  Setosa Sepal  Length   5.1
## 2  Setosa Sepal  Length   4.9
## 3  Setosa Sepal  Length   4.7
## 4  Setosa Sepal  Length   4.6
## 5  Setosa Sepal  Length   5.0
## 6  Setosa Sepal  Length   5.4
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  Setosa
## 2          4.9         3.0          1.4         0.2  Setosa
## 3          4.7         3.2          1.3         0.2  Setosa
## 4          4.6         3.1          1.5         0.2  Setosa
## 5          5.0         3.6          1.4         0.2  Setosa
## 6          5.4         3.9          1.7         0.4  Setosa
# ggplot
g1 <- ggplot(iris2[iris2$Species=="setosa", ], aes(x=Measure, y= Value, color = Part)) +
        geom_jitter() 
g2 <- ggplot(iris2[iris2$Species=="versicolor", ], aes(x=Measure, y= Value, color = Part)) +
        geom_jitter() 
g3 <- ggplot(iris2[iris2$Species=="virginica", ], aes(x=Measure, y= Value, color = Part)) +
        geom_jitter() 
grid.arrange(g1,g2,g3, nrow=1)

# ggplot
ggplot(iris2, aes(x = Measure, y = Value, color = Part)) +
  geom_jitter() +
  facet_grid(cols = vars(Species))

Alternatively, using the facet layer:

  • Facets divide a plot into subplots based on the values of one or more discrete variables
  • Given categorical variables A and B, the code pattern is
plot +
  facet_grid(rows = vars(A), cols = vars(B))
  • Common legends for 3 subplots! The facet layer split the data into 3 subplots
# ggplot
ggplot(iris2, aes(x = Measure, y = Value, color = Part)) +
  geom_jitter() +
  facet_grid(cols = vars(Species))

3.1.21.1 Adding labels and orders with facets

  • Using msleep data
  • “Vore” labels are not formal
data(msleep)
head(msleep)
## # A tibble: 6 × 11
##   name         genus vore  order conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##   <chr>        <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
## 1 Cheetah      Acin… carni Carn… lc                  12.1      NA        NA      11.9 NA        50    
## 2 Owl monkey   Aotus omni  Prim… <NA>                17         1.8      NA       7    0.0155    0.48 
## 3 Mountain be… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6 NA         1.35 
## 4 Greater sho… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1  0.00029   0.019
## 5 Cow          Bos   herbi Arti… domesticated         4         0.7       0.667  20    0.423   600    
## 6 Three-toed … Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6 NA         3.85
head(msleep$vore, n = 20)
##  [1] "carni" "omni"  "herbi" "omni"  "herbi" "herbi" "carni" NA      "carni" "herbi" "herbi" "herbi"
## [13] "omni"  "herbi" "omni"  "omni"  "omni"  "carni" "herbi" "omni"
g <- ggplot(msleep, aes(bodywt, brainwt)) +
      geom_point(alpha = 0.6)+
      scale_x_log10() + scale_y_log10() +
      coord_fixed()
g

  • Easy: Add labels in ggplot. However, we need to alter the dataset, or create a new copy of it
  • Alternative: Relabel and rearrange factor variables in your dataframe
  • fct_recode
library(forcats)
msleep$vore = fct_recode(msleep$vore,
                          Carnivore = "carni",
                          Herbivore = "herbi",
                          Insectivore = "insecti",
                          Omnivore = "omni")

g <- ggplot(msleep, aes(bodywt, brainwt)) +
      geom_point(alpha = 0.6)+
      scale_x_log10() + scale_y_log10() +
      coord_fixed() +
      facet_grid(cols = vars(vore))
g

3.1.22 Theme layer

Non-data ink: visual elements that are not part of the data:

Visual element Modified using
Text element_text()
Line element_line()
Rectangle element_rect()

3.1.23 The text element

  • Modified via element_text()
  • A plot is composed of a combination of data, as well as non-data embellishments
  • These are text element in the plot. Each element has its own unique name
  • We can access all these text elements as arguments of the theme function
  • Hiercarchical naming indicates inheritance rules
theme(
text,
  axis.title,
    axis.title.x,
      axis.title.x.top,
      axis.title.x.bottom,
    axis.title.y,
        axis.title.y.left,
        axis.title.y.right,
  title,
      legend.title,
    plot.title,
    plot.subtitle,
    plot.caption,
    plot.tag,
  axis.text,
    axis.text.x,
      axis.text.x.top,
      axis.text.x.bottom,
    axis.text.y,
      axis.text.y.left,
      axis.text.y.right,
  legend.text,
  strip.text,
      strip.text.x,
      strip.text.y)

3.1.23.1 Adjusting theme elements

  • Call the appropriate argument inside the theme function and use the appropriate element_ function to specify the value to change
  • In this case, we use element_text() call
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_jitter(alpha = 0.6) +
  theme(axis.title = element_text(color = "blue"))

3.1.23.2 The line element

  • Modified via element_line()
  • The line elements include the tick marks, the axis line, and all grid line
  • These are line elements in the theme function
  • Hiercarchical naming indicates inheritance rules
theme(
line,
  axis.ticks,
    axis.ticks.x,
      axis.ticks.x.top,
      axis.ticks.x.bottom,
    axis.ticks.y,
      axis.ticks.y.left,
      axis.ticks.y.right,
  axis.line,
    axis.line.x,
      axis.line.x.top,
      axis.line.x.bottom,
    axis.line.y,
      axis.line.y.left,
      axis.line.y.right,
  panel.grid,
    panel.grid.major,
      panel.grid.major.x,
      panel.grid.major.y,
  panel.grid.minor,
      panel.grid.minor.x,
      panel.grid.minor.y)

3.1.23.3 The rectangle element

  • Modified via element_rect()
  • The rectangle elements include the borders and backgrounds
  • These are line elements in the theme function
  • Hiercarchical naming indicates inheritance rules
theme(
rect,
  legend.background,
  legend.key,
  legend.box.background,
  panel.background,
  panel.border,
  plot.background,
  strip.background,
    strip.background.x,
    strip.background.y)

3.1.23.4 Inheritance

  • For example, all text elements inherit from text.
  • If we change the text argument, all the downstream elements will be affected
theme(
text,
  axis.title,
    axis.title.x,
      axis.title.x.top,
      axis.title.x.bottom,
 ...)

3.1.23.5 element_blank()

  • Remove the item. The (removed) item will not be drawned at all
  • In this example, we remove all text, rect and line items. The plot is left with just the data !
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_jitter(alpha = 0.6) +
  theme(line = element_blank(),
        rect = element_blank(),
        text = element_blank())

3.1.23.6 Reusing theme by defining theme object

In this example, we change the font family, size, and title color. The theme is saved as an object.

theme_exp <- theme(text = element_text(family = "serif", size = 14),
              rect = element_blank(),
              panel.grid = element_blank(),
              title = element_text(color = "#8b0000"),
              axis.line = element_line(color = "black"))

g <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
      geom_jitter(alpha = 0.6) +
      scale_x_continuous("Sepal Length (cm)", limits = c(4,8), expand = c(0,0)) +
      scale_y_continuous("Sepal Width (cm)", limits = c(1.5,5), expand = c(0,0)) +
      scale_color_brewer("Species", palette = "Dark2", labels = c("Setosa", "Versicolor", "Virginica"))

grid.arrange(g+ggtitle("Default theme"), g + ggtitle("g + theme_exp") + theme_exp, nrow=1)

3.1.23.7 Reuse theme_exp in another plot

m <- ggplot(iris, aes(x = Sepal.Width)) +
      geom_histogram(binwidth = 0.1,
      center = 0.05)
m + theme_exp

3.1.23.8 Resuse themes with built-in themes

  • There are numerous built-in themes in ggplot2 package and the other packages (such as ggtheme)
  • Use the theme_*() functions to access the built-in themes in ggplot2
  • We can still modify any specific element.
g1 <- g + theme_classic()
g2 <- g + theme_classic() + theme(text = element_text(family = "serif"))

grid.arrange(g1+ggtitle("theme_classic()"), g2 + ggtitle("modified theme_classic()") + theme_exp, nrow=1)

3.1.24 Manage multiple plotting objects

  • We already see example with grid.arrange function in grid.Extra package
  • Here is a more structured introduction

3.1.24.1 A list of ggplot2 objects is passed to grid.arrange

plotIris <- function(species){
  df <- iris %>% subset(Species == species)
  g <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
        geom_point(position = "jitter") + 
        scale_x_continuous("Petal Length") + 
        scale_y_continuous("Petal Width") +
        ggtitle(species) + 
        theme_classic()
  return(g)
}

# create a list of ggplot2 objects
my_plots <- lapply(levels(iris$Species), plotIris)

length(my_plots)
## [1] 3
my_plots[[1]]

We can do the same thing by using facet_ layer, but this is for illustration purpose.

3.1.24.2 Combine plots with grid.arrange

library(gridExtra)
grid.arrange(my_plots[[1]], my_plots[[2]], my_plots[[3]], 
             nrow=1, ncol=3 )

3.1.24.3 Combine plots, a more efficient way with do.call

  • do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
  • The first argument to pass to do.call is the name of the function we want to apply. The arguments to the function is passed to a vector
  • We can pass the whole list (of ggplot2 object) at once in the the `args argument
library(gridExtra)
do.call(what = grid.arrange, 
        args = c(my_plots, nrow=1, ncol=3))

3.1.24.4 Combine plots with ggpubr

  • An very (simple and user-friendly) alternative to ggplot2 is ggpubr
  • In this example, we only explore the ggarrange function to combine multiple plots
  • One advantage of ggpubr::ggarrange is the panel label, and the option to set common legend. The gridExtra::grid.arrange can do this, but more complicated.
  • We can also pass the function and arguments to do.call
p <- ggpubr::ggarrange(my_plots[[1]], my_plots[[2]], my_plots[[3]], 
                        nrow=1, ncol=3,
                        common.legend = TRUE, legend = "bottom",
                        labels=c("A","B","C"))
p

3.1.24.5 Saving your plot (in high-resolution image)

ggsave(
  filename,
  plot = last_plot(),
  device = NULL,
  path = NULL,
  scale = 1,
  width = NA,
  height = NA,
  units = c("in", "cm", "mm"),
  dpi = 300,
  limitsize = TRUE,
  ...
)

Example:

p

# save to file
ggsave(filename = "iris_combine.png", 
       plot=p, 
       dpi=800, 
       width=9, height=3, units=("in"))

3.2 Part II: High dimensional data visualization

3.2.1 Main content

  • High-dimensional data

    • Feature projection / Manifold learning
    • 4 popular feature projection techniques: PCA, MDS, t-SNE, UMAP
  • Distribution plot

    • Within 1 variable:
      • Weighted box/ violin
      • Density
    • 2 Separate variables:
      • 2D density
      • Marginal histogram/ box plot
    • Population pyramid
  • Deal with large number of observations

    • Binned scatter
  • Deal with multi-dimensional data

    • Feature projection/ Manifold learning >> high-dimensional
    • Correlogram
    • SPLOM - Scatter PLOt Matrix
  • Map

    • Chorophleth
  • Animation

  • Honorable mentions

  • plotly

  • Feature projection/ Manifold learning

  • Principal Component Analysis

  • Multidimensional scaling

  • t-distributed stochastic neighbor embedding (t-SNE)

  • Uniform manifold approximation and projection (UMAP)

3.2.2 Example gene expression data

  • Expression levels of ~50,000 genes in 24 tissue samples
    • Collected from brain, kidney, liver, lung and thymus
    • 4 - 5 biological replicates
  • Classic example of high-dimensional data
    • Number of features >> number of samples
  • Special dataframe format:
    • Expression level of each gene is placed in 1 row
    • Each gene ~ one feature / variable
    • features(variables) placed in ROWS (generally, features are placed on columns)

  • Meta data file contains information of each sample
library(dplyr)
# restart R session
# .rs.restartR()
rm(list=ls())

# code starts from here
print(getwd())
## [1] "C:/Users/ADMIN/Desktop/biostat/bookbiodatar"
exp_df <- read.table("./GeneExpressionData/GeneExpression.txt", header = T)
rownames(exp_df) <- exp_df$GeneID
meta_df <- read.table("./GeneExpressionData/MetaData.txt", header = T)

# filter values that are too low
exp_df <- exp_df[apply(exp_df[,-1], 1, function(x) !all(x < 1)),]

# view the first few genes
head(exp_df)
##                          GeneID Brain_1 Brain_2 Brain_3 Brain_4 Brain_5 Kidney_1 Kidney_2 Kidney_3
## ENSG00000230424 ENSG00000230424    0.24    3.39    3.39    0.23    3.38     1.82     0.97     1.19
## ENSG00000138074 ENSG00000138074    2.06    0.00    8.92    2.40    8.87    19.32     0.00    19.69
## ENSG00000108797 ENSG00000108797    0.77    0.00    9.67    0.76    9.63     4.87     0.00     3.66
## ENSG00000125492 ENSG00000125492    0.00    1.75    6.95    0.00    6.92     0.00     0.00     0.00
## ENSG00000224298 ENSG00000224298    0.00    0.00    0.00    0.00    0.00     0.00     0.00     0.00
## ENSG00000172201 ENSG00000172201  187.40   48.27   48.27  185.93   48.07    51.53     0.00    55.75
##                 Kidney_4 Kidney_5 Liver_1 Liver_2 Liver_3 Liver_4 Lung_1 Lung_2 Lung_3 Lung_4 Lung_5
## ENSG00000230424     1.52     0.87    0.32    0.37    0.24    0.32   0.63   1.05   0.63   3.56   0.96
## ENSG00000138074    22.15     2.18   24.05   30.11  162.19   24.09  24.29   3.34   0.27   5.03   9.63
## ENSG00000108797     3.98     0.10    0.59    0.05    0.37    0.59   3.84   0.13   0.25   0.71   1.82
## ENSG00000125492     0.00     0.00    0.00    0.00    0.00    0.00   0.00   0.00   0.00   0.00   0.00
## ENSG00000224298     0.00     0.00    0.00    0.00    0.00    0.00   0.00   0.00   0.00   1.17   0.31
## ENSG00000172201    46.35    13.54    0.58    0.25    0.71    0.59  17.20  17.87   0.51   6.58  22.92
##                 Thymus_1 Thymus_2 Thymus_3 Thymus_4
## ENSG00000230424     1.08     1.47     1.23     1.02
## ENSG00000138074    20.10    15.46    21.80    18.11
## ENSG00000108797     2.91     1.80     2.63     3.91
## ENSG00000125492     0.00     0.00     0.03     0.03
## ENSG00000224298     0.24     0.05     0.00     0.00
## ENSG00000172201     1.25     2.04     1.07     3.19
dim(exp_df)
## [1] 26946    24
meta_df
##    SampleID Tissue Library       Lab Sequencer
## 1   Brain_1  Brain  Paired EastCoast     MySeq
## 2   Brain_2  Brain  Paired EastCoast     MySeq
## 3   Brain_3  Brain  Paired EastCoast     MySeq
## 4   Brain_4  Brain  Paired EastCoast     MySeq
## 5   Brain_5  Brain  Paired EastCoast     MySeq
## 6  Kidney_1 Kidney  Paired WestCoast     HiSeq
## 7  Kidney_2 Kidney  Paired EastCoast     MySeq
## 8  Kidney_3 Kidney  Paired WestCoast     HiSeq
## 9  Kidney_4 Kidney  Paired WestCoast     HiSeq
## 10 Kidney_5 Kidney  Paired EastCoast     MySeq
## 11   Lung_1   Lung  Single WestCoast     HiSeq
## 12   Lung_2   Lung  Paired WestCoast     HiSeq
## 13   Lung_3   Lung  Single WestCoast     HiSeq
## 14   Lung_4   Lung  Single WestCoast     HiSeq
## 15   Lung_5   Lung  Paired WestCoast     HiSeq
## 16  Liver_1  Liver  Single EastCoast     HiSeq
## 17  Liver_2  Liver  Single EastCoast     HiSeq
## 18  Liver_3  Liver  Single EastCoast     HiSeq
## 19  Liver_4  Liver  Single EastCoast     HiSeq
## 20 Thymus_1 Thymus  Paired WestCoast     HiSeq
## 21 Thymus_2 Thymus  Paired WestCoast     HiSeq
## 22 Thymus_3 Thymus  Paired WestCoast     HiSeq
## 23 Thymus_4 Thymus  Paired WestCoast     HiSeq

3.2.3 Basic exploratory plots

  • Let’s look at 2 samples: Brain_1 and Kidney_1 (it does not matter which two biological samples are selected, we just want to have a glimpse of our data)
  • Heavily skewed towards high value region
library(ggplot2)
library(gridExtra)

exp_df2 <- exp_df[ , c("Brain_1" , "Kidney_1") ]
rownames(exp_df2) <- exp_df$GeneID

# remove all zeros values for ease of plotting
exp_df2[exp_df2 == 0] <- NA
exp_df2 <- exp_df2 %>% na.omit()

# prepare data into ggplot2 format
exp_df3 <- exp_df2 %>% reshape2::melt()
colnames(exp_df3) <- c("Tissue", "Expression")
head(exp_df3)
##    Tissue Expression
## 1 Brain_1       0.24
## 2 Brain_1       2.06
## 3 Brain_1       0.77
## 4 Brain_1     187.40
## 5 Brain_1      10.46
## 6 Brain_1      41.97
# normal scale
p_scatter <- ggplot(exp_df2, aes(x = Brain_1, y = Kidney_1) ) +
              geom_point(alpha = 0.5, col = "grey60")
              

p_density <- ggplot(exp_df3, aes(x = Expression, col = Tissue) ) +
              geom_density() 

grid.arrange(p_scatter + ggtitle("scatter - normal scale"), 
             p_density + ggtitle("density - normal scale"), 
             nrow=1)

# add log scale
p_scatter2 <- p_scatter + scale_x_log10() + scale_y_log10()
p_density2 <- p_density + scale_x_log10()

grid.arrange(p_scatter2 + ggtitle("scatter - log scale"), 
             p_density2 + ggtitle("density - log scale"), 
             nrow=1)

3.2.4 Principal Component Analysis

3.2.4.1 Principal Component Analysis - recap

PCA is a technique used to ease data interpretation by performing dimensonality reduction while minimizing information loss.

PCA generates new uncorrelated variables (they are orthogonal to each other) that successively maximize variance. In other words, PCs can be seen as linear combinations of the original variables. PCA does not require prior knowledge on the data at hand.

Note that PCA is defined by the variance in the data. This depends on the original units of measurement. This implies that PCs will change if the units of measurement on one or more of the variables change. To avoid this issue, it is common practice to standardize variables. Each data value xij is both centred (subtract average x¯ ) and divided by the standard deviation sdj of the n observations of variable j . To obtain the standardized data matrix Z calculate:

Zij = (xij−mean(xj)) / sd(xj)

From a mathematical point of view, PCA solves a eigenvalue/eigenvector problem. Alternatively, it can be addressed by Singular Value Decomposition (SVD). Further details on PCA can be found, for instance, in this review by [Ian T. Jolliffe and Jorge Cadima] (https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202) .

Strength
- can be efficiently applied to large data sets
- has many extensions that deal with special situation such as sparse PCA when data has many missing values, kernel PCA can provide more robust solutions
- preserves global structures

Weaknesses
- may suffer from scale problems, i.e. when one variable dominates the variance simply because it is in a higher scale (for example if one variable is measured in mm instead of km); some of these scale problems can simply be dealt with by centering and scaling
- suffers from crowding in the presence of large numbers of observations, i.e. points do not form clusters but instead occupy the entire plotting space - is susceptible to big outliers

3.2.4.2 Principal Component Analysis - the prcomp object

  • we need to first obtain the projected features by prcomp() from stats package
  • Note that the prcomp() functions takes in data the format: features in columns, and entries in rows, so we need to first transpose our gene expression data matrix
  • Because the data is heavily skewed towards high value region, we begin by scaling the (transposed) dataframe
    • Scaling, a.k.a standardizing = subtract mean value, then divide by standard deviation
  • The prcomp object:
    • PCs are stored in $x
    • standard deviation of each principal component stored in $sdev
exp_transpose <- scale(t( log10(exp_df[ , -1] + 1) ))
colnames(exp_transpose) <- exp_df$GeneID
# all numeric
exp_transpose[1:6, 1:3]
##          ENSG00000230424 ENSG00000138074 ENSG00000108797
## Brain_1       -1.2326191     -0.94985951      -0.4515456
## Brain_2        1.7364848     -1.83695552      -1.2113332
## Brain_3        1.7364848     -0.01697913       1.9389523
## Brain_4       -1.2516360     -0.86629044      -0.4590849
## Brain_5        1.7311289     -0.02098708       1.9339544
## Kidney_1       0.6970258      0.55176740       1.1437662
pca <-prcomp(exp_transpose) 
glimpse(pca) # from dplyr package
## List of 5
##  $ sdev    : num [1:23] 94.1 57.7 54.6 48.2 47 ...
##  $ rotation: num [1:26946, 1:23] 0.00129 0.00637 0.00705 -0.0017 0.00127 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:26946] "ENSG00000230424" "ENSG00000138074" "ENSG00000108797" "ENSG00000125492" ...
##   .. ..$ : chr [1:23] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:26946] 9.93e-17 -8.79e-17 -8.45e-17 -4.83e-18 2.56e-17 ...
##   ..- attr(*, "names")= chr [1:26946] "ENSG00000230424" "ENSG00000138074" "ENSG00000108797" "ENSG00000125492" ...
##  $ scale   : Named num [1:26946] 0.1849 0.5475 0.3264 0.2687 0.0745 ...
##   ..- attr(*, "names")= chr [1:26946] "ENSG00000230424" "ENSG00000138074" "ENSG00000108797" "ENSG00000125492" ...
##  $ x       : num [1:23, 1:23] -21.28 -186.15 -5.62 -20.98 -5.57 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:23] "Brain_1" "Brain_2" "Brain_3" "Brain_4" ...
##   .. ..$ : chr [1:23] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
# the PCs
pca$x[1:6, 1:6]
##                  PC1         PC2      PC3         PC4         PC5        PC6
## Brain_1   -21.275899 123.2796160 51.38731   26.705569   18.515151  33.225701
## Brain_2  -186.154547 -43.0187101 86.95863    9.014743  -20.986444 -12.310376
## Brain_3    -5.619394  53.5835257 24.24167   20.763851 -118.054113   1.456039
## Brain_4   -20.983103 123.1629638 52.49617   19.921082   20.049763  33.509444
## Brain_5    -5.568717  52.5908305 24.12261   12.142667 -116.322185   2.508204
## Kidney_1  108.426660   0.4414464 32.24354 -103.340186    1.597505 -37.936119
# std by each PC
pca$sdev
##  [1] 9.412867e+01 5.769344e+01 5.456812e+01 4.822942e+01 4.696455e+01 4.282472e+01 3.546199e+01
##  [8] 2.473690e+01 2.427698e+01 2.125968e+01 2.067507e+01 1.856252e+01 1.693902e+01 1.596383e+01
## [15] 1.585947e+01 1.456333e+01 1.354064e+01 1.299947e+01 1.298516e+01 1.055195e+01 7.361181e+00
## [22] 6.258752e+00 1.927080e-13

3.2.4.3 Principal Component Analysis - variance explained by each PC

  • We investigate the variance explained by each PCs using a bar-plot.
# get variance of each PC
pr_var  <- (pca$sdev)^2  # variance = std ^ 2

# compute variance explained by each PC
prop_varex <- pr_var/sum(pr_var)
prop_varex
##  [1] 3.288134e-01 1.235260e-01 1.105054e-01 8.632364e-02 8.185517e-02 6.806045e-02 4.666937e-02
##  [8] 2.270891e-02 2.187234e-02 1.677333e-02 1.586352e-02 1.278732e-02 1.064834e-02 9.457578e-03
## [15] 9.334332e-03 7.870944e-03 6.804313e-03 6.271288e-03 6.257494e-03 4.132107e-03 2.010948e-03
## [22] 1.453721e-03 1.378177e-30
# Bring a bar plot - quick plot for vector object
barplot(prop_varex, names.arg = colnames(pca$x),
        xlab = "Principal Component",
        ylab = "Proportion of Variance Explained")

- In addition to the individual variance explained plots, also the cumulative variance explained is frequently looked at.

# For the cumulative variance explained:
plot(cumsum(prop_varex), 
     xlab = "Principal Component",
      ylab = "Cumulative Proportion of Variance Explained",
      type = "b")

  • The first two components explain about 50% of the data. This is not the best scenario, but I am taking this bet

3.2.4.4 PCA - the PC space

  • To this end we plot the reduction of data to two first components
  • As we are interested in the seperation of the tissues (which are the columns in our dataset), we need to map it to the meta data matrix.
# lets view the meta data gain
head(meta_df)
##   SampleID Tissue Library       Lab Sequencer
## 1  Brain_1  Brain  Paired EastCoast     MySeq
## 2  Brain_2  Brain  Paired EastCoast     MySeq
## 3  Brain_3  Brain  Paired EastCoast     MySeq
## 4  Brain_4  Brain  Paired EastCoast     MySeq
## 5  Brain_5  Brain  Paired EastCoast     MySeq
## 6 Kidney_1 Kidney  Paired WestCoast     HiSeq
  • We need a dataframe that is suitable for ggplot, and the “ggrepel” package may help (Slowikowski 2024).
# install.packages("ggrepel")

pca_df <- data.frame(
  PC1 = pca$x[, 3],
  PC2 = pca$x[, 2],
  Sample = rownames(pca$x),
  Tissue = meta_df[match(rownames(pca$x), meta_df$SampleID) , "Tissue"]
)

head(pca_df)
##               PC1         PC2   Sample Tissue
## Brain_1  51.38731 123.2796160  Brain_1  Brain
## Brain_2  86.95863 -43.0187101  Brain_2  Brain
## Brain_3  24.24167  53.5835257  Brain_3  Brain
## Brain_4  52.49617 123.1629638  Brain_4  Brain
## Brain_5  24.12261  52.5908305  Brain_5  Brain
## Kidney_1 32.24354   0.4414464 Kidney_1 Kidney
ggplot(pca_df, aes(x = PC1, y = PC2, label = Sample, col = Tissue)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5) +
  theme_bw()

  • Feel free to explore other pairs of PCs!

3.2.5 Multidimensional scaling (MDS)

3.2.5.1 Multidimensional scaling (MDS) - recap

  • Multidimensional scaling (MDS) is an algorithm that given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N, places each object into N-dimensional space such that the between-object distances are preserved as well as possible.
  • MDS is most often used as a visualization tool. It is best suited to the problem that involve real distances, i.e. city distances or geometry.
  • It yields the same result as PCA when Euclidean distances are used.
  • There are two general classes of MDS:
    • metric, for quantitative similarities.
    • non-metric (ordinal), uses qualitative similarities. Distances in ordinal MDS are on an ordinal scale
  • The two types refer two whether the input similarity matrix is based on a metric or not.

3.2.5.2 MDS - the MDS object

  • A classical, metric MDS implementation
# distance matrix
# exp_transpose <- scale(t( log10(exp_df[ , -1] + 1) ))

dist_exp <- dist(exp_transpose , method = "minkowski")

# get MDS
# k is the number of dim
mds <- cmdscale(dist_exp, eig = TRUE, k = 2)

glimpse(mds)
## List of 5
##  $ points: num [1:23, 1:2] 21.28 186.15 5.62 20.98 5.57 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:23] "Brain_1" "Brain_2" "Brain_3" "Brain_4" ...
##   .. ..$ : NULL
##  $ eig   : num [1:23] 194925 73228 65509 51174 48525 ...
##  $ x     : NULL
##  $ ac    : num 0
##  $ GOF   : num [1:2] 0.452 0.452
  • Projected features are in $points
##                 [,1]         [,2]
## Brain_1    21.275899 -123.2796160
## Brain_2   186.154547   43.0187101
## Brain_3     5.619394  -53.5835257
## Brain_4    20.983103 -123.1629638
## Brain_5     5.568717  -52.5908305
## Kidney_1 -108.426660   -0.4414464

3.2.5.3 MDS - visualize the projected features

# create the data frame for plotting
mds_df <- data.frame(
  MDS1 = mds$points[, 1],
  MDS2 = mds$points[, 2],
  Sample = rownames(mds$points),
  Tissue = meta_df[match(rownames(mds$points), meta_df$SampleID) , "Tissue"]
)

ggplot(mds_df, aes(
  x = MDS1, y = MDS2,
  label = Sample, col = Tissue
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

3.2.6 T-distributed Stochastic Neighbor Embedding (t-SNE)

3.2.6.1 T-distributed Stochastic Neighbor Embedding (t-SNE) - recap

  • T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimension reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The main application of t-SNE is for visualizations of large amount of data, as it can recover well-separated clusters.

  • This interactive article on “How to Use t-SNE Effectively”, explores some of the subtleties of interpreting t-SNE plots that are worth keeping in mind as you apply t-SNE to your own datasets. We’ll explore some of these using the MNIST digit dataset below.

3.2.6.2 The t-SNE object

  • Projected features are in $Y
  • Feel free to play around with the parameters
# install.packages("Rtsne")
library(Rtsne)

# exp_transpose <- scale(t( log10(exp_df[, -1] + 1) ))
set.seed(1)
tsne_p5 <- Rtsne(exp_transpose, dims = 2,
                  pca = TRUE, perplexity = 5,
                  theta = 0.0   )

glimpse(tsne_p5) 
## List of 14
##  $ N                  : int 23
##  $ Y                  : num [1:23, 1:2] 18.1 -100.5 59.8 27.9 56.1 ...
##  $ costs              : num [1:23] 0.0018 0.00146 0.006 0.0004 0.0051 ...
##  $ itercosts          : num [1:20] 60.9 58.7 57.9 59.2 60 ...
##  $ origD              : int 23
##  $ perplexity         : num 5
##  $ theta              : num 0
##  $ max_iter           : num 1000
##  $ stop_lying_iter    : int 250
##  $ mom_switch_iter    : int 250
##  $ momentum           : num 0.5
##  $ final_momentum     : num 0.8
##  $ eta                : num 200
##  $ exaggeration_factor: num 12
##  - attr(*, "class")= chr [1:2] "Rtsne" "list"
head(tsne_p5$Y)
##            [,1]      [,2]
## [1,]   18.09579 -72.09924
## [2,] -100.48971 -97.13867
## [3,]   59.81600 -39.61370
## [4,]   27.91713 -64.49774
## [5,]   56.11734 -28.05055
## [6,]   73.85981  73.43283

3.2.6.3 tSNE - visualize the projected features

# create the data frame for plotting
tsne_df <- data.frame(
  TSNE1 = tsne_p5$Y[, 1],
  TSNE2 = tsne_p5$Y[, 2],
  Sample = rownames(exp_transpose),
  Tissue = meta_df[match(rownames(exp_transpose), meta_df$SampleID) , "Tissue"]
)

ggplot(tsne_df, aes(
  x = TSNE1, y = TSNE2,
  label = Sample, col = Tissue
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

3.2.7 Uniform manifold approximation and projection (UMAP)

3.2.7.1 Uniform manifold approximation and projection (UMAP) - recap

  • Uniform Manifold Approximation and Projection (UMAP) produces a low dimensional representation of high dimensional data that preserves relevant structure. It searches for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure to the original high dimensional data.
  • UMAP is similar to t-SNE but preserves more global structure in the data. Its main use is visualization of a large amount of observations as it recovers well-separated clusters.

3.2.7.2 UMAP - the umap object

  • Feel free to play around with the parameters
  • Set seed before running to ensure consistant result with the same set of parameters
# install.packages("uwot")
library(uwot)

set.seed(1)
# exp_transpose <- scale(t( log10(exp_df[, -1] + 1) ))
umap <- umap(t(exp_df[, -1]),
              n_neighbors = 15,
              min_dist = 1, spread = 5)

head(umap)
##               [,1]      [,2]
## Brain_1   6.559289 -3.631291
## Brain_2   6.639607  1.873414
## Brain_3   5.734664  5.217261
## Brain_4   4.397399 -2.105661
## Brain_5   4.171923  2.813546
## Kidney_1 -1.427080 -1.526843

3.2.7.3 Visualizing UMAP features

# create the data frame for plotting
umap_df <- data.frame(
  UMAP1 = umap[, 1],
  UMAP2 = umap[, 2],
  Sample = rownames(exp_transpose),
  Tissue = meta_df[match(rownames(exp_transpose), meta_df$SampleID) , "Tissue"]
)

ggplot(umap_df, aes(
  x = UMAP1, y = UMAP2,
  label = Sample, col = Tissue
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

3.2.8 Distribution plot

3.2.8.1 Box plot: recap

Box plot is an excellent tool to study the distribution. It can also show the distributions within multiple groups, along with the median, range and outliers if any.

The dark line inside the box represents the median. The top of box is 75%ile and bottom of box is 25%ile. The end points of the lines (aka whiskers) is at a distance of 1.5*IQR, where IQR or Inter Quartile Range is the distance between 25th and 75th percentiles. The points outside the whiskers are marked as dots and are normally considered as extreme points.

# restart R session
# .rs.restartR()
# rm(list=ls())
# code
library(tidyverse)
library(dplyr)
data(msleep)
head(msleep) # from dplyr/ tidyverse package
## # A tibble: 6 × 11
##   name         genus vore  order conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##   <chr>        <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
## 1 Cheetah      Acin… carni Carn… lc                  12.1      NA        NA      11.9 NA        50    
## 2 Owl monkey   Aotus omni  Prim… <NA>                17         1.8      NA       7    0.0155    0.48 
## 3 Mountain be… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6 NA         1.35 
## 4 Greater sho… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1  0.00029   0.019
## 5 Cow          Bos   herbi Arti… domesticated         4         0.7       0.667  20    0.423   600    
## 6 Three-toed … Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6 NA         3.85
ggplot(data = msleep[!is.na(msleep$vore), ], mapping = aes(x = vore, y = sleep_total)) +
      geom_boxplot()

3.2.8.2 Problem with boxplot?

ggplot(data = msleep[!is.na(msleep$vore), ], mapping = aes(x = vore, y = sleep_total) ) +
      geom_boxplot() +
      geom_point(position = position_jitter(0.2) )

  • There are only 5 observations in insectivore !
  • Box plots dont show information about the number of observations

3.2.8.3 Weighted box plot

  • Set the width of each box relative to the N value of each group
  • varwidth = TRUE
ggplot(data = msleep[!is.na(msleep$vore), ], mapping = aes(x = vore, y = sleep_total) ) +
      geom_boxplot(varwidth = TRUE) +  # set width of the box
      geom_point(position = position_jitter(0.2) )

3.2.8.4 Violin plot

  • A violin plot is similar to box plot but it shows the density within groups.
  • It can be drawn using geom_violin().
  • Add weight: weight = ..n.. inside aes()
ggplot(data = msleep[!is.na(msleep$vore), ],
       mapping = aes(x = vore, y = sleep_total, fill = vore) ) +
      geom_violin( aes(weight = ..n..) )  # set weight for each violin

3.2.8.5 Density plot

  • Distribution of univariate data
  • Uses a kernel density estimate to show the probability density function of the variable
  • Kernel density esimate (KDE) (Everitt and Hothorn)
    • A sum of “bumps” placed at the observations.
    • The kernel functions determines the shape of the bumps
    • while the window width, h, determine the width
3.2.8.5.1 Kernel density esimate (KDE) - example
  • we consider a set of 8 values
  • blue verticle line by the rug_geom
x <- c(0, 1, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5)
x_df <- data.frame("x" = x)

ggplot(x_df, mapping = aes(x = x)) + 
  geom_rug(col = "blue", size = 3) +
  scale_x_continuous(breaks = seq(-2, 5.5, by = 0.5) ) +
  theme(rect = element_blank())

3.2.8.5.2 Kernel density esimate (KDE) - bumps
  • KDE calculates a normal distribution for each value, these are the bumps

3.2.8.5.3 Kernel density esimate (KDE) - Sum of bumps
  • Density plot = sum the y values for each bumps along the x- axis
  • Overlapping lines >> higher density

3.2.8.5.4 Kernel density esimate (KDE) - Bandwidth
  • Decrease bandwidth >> more subtle features in the data

3.2.8.5.5 Adjusting density plot

There are three parameters that we may be tempted to adjust in a density plot:

  • bw - the smoothing bandwidth to be used
  • adjust - apply a multiplier to the optimal bandwidth that ggplot calculates
  • kernel - kernel used for density estimation, defined as
    • “g” = gaussian
    • “r” = rectangular
    • “t” = triangular
    • “e” = epanechnikov
    • “b” = biweight
    • “c” = cosine
    • “o” = optcosine
set.seed(1)
small_data <- data.frame("x" = rnorm(100) )
# Get the bandwith
get_bw <- density(small_data$x)$bw

# Basic plotting object
p <- ggplot(small_data, aes(x = x)) +
  geom_rug() +
  coord_cartesian(ylim = c(0,0.5))

# Change bandwidth
p11 <- p + geom_density() 
p12 <- p + geom_density(adjust=0.25) + ggtitle("adjust=0.25")
p13 <- p + geom_density(bw=0.4*get_bw) + ggtitle("bw=0.4*get_bw")

grid.arrange(p11+ggtitle("Default bandwidth"), p12, p13, nrow=1)

# Change kernel function
p14 <- p + geom_density(kernel="r") + ggtitle("Rectangular kernel")
p15 <- p + geom_density(kernel="e") + ggtitle("Epanechnikov kernel")
grid.arrange(p11+ggtitle("Default kernel"), p14, p15, nrow=1)

3.2.8.6 Multiple density

  • Remember the msleep data? Insectivore only has 5 observations
rm(list = ls() )

library(tidyverse)
data(msleep)# from dplyr/ tidyverse package

ggplot(data = msleep[!is.na(msleep$vore), ],
       mapping = aes(x = vore, y = sleep_total, fill = vore) ) +
      geom_violin( aes(weight = ..n..) ) +  # set weight for each violin 
      geom_point(position = position_jitter(0.2) )

  • Density plot cannot show the information about group size
  • The insectivore blue curve is very abundant!
ggplot(data = msleep[!is.na(msleep$vore), ],
       mapping = aes(x = sleep_total, fill = vore) ) +
      geom_density(alpha = 0.35, col=NA)

  • Make density plots fancier with geom_density_ridges
library(ggridges)
ggplot(data = msleep[!is.na(msleep$vore), ],
       mapping = aes(x = sleep_total, fill = vore, y=vore) ) +
        geom_density_ridges(col=NA ) 

3.2.8.7 2D density

  • Data: the old faithful greyser Yellowstone national park.
  • First look
rm(list = ls() )
data("faithful")

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_jitter()

  • The data is bimodal on both axis !
# add 2D density
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_jitter() +
  stat_density2d ()

Another example with iris data:

# another example
data(iris)
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species ) ) + 
  geom_jitter() +
  stat_density2d ()

  • To adjust bandwidth: h = c(... , ...)
  • We can represent density level with colors
ggplot(faithful, aes(x = waiting, y = eruptions)) +
      geom_jitter() +
      stat_density2d (aes(col = ..level..), h = c(5, 0.5) ) +
      scale_color_gradient(low="blue", high="red")

3.2.8.8 Marginal histogram

The “ggExtra” package will help with this type of plot (Attali and Baker 2023).

# install.packages("ggExtra")
library(ggExtra)

g <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
      geom_jitter() +
      stat_density2d (aes(col = ..level..), h = c(5, 0.5) ) +
      scale_color_gradient(low="blue", high="red") +
      theme_bw()

ggMarginal(g, type = "histogram", fill="transparent") # from library(ggExtra)

3.2.8.9 Marginal boxplot

ggMarginal(g, type = "boxplot", fill="transparent")

3.2.8.10 Population pyramid

Population pyramids offer a unique way of visualizing how much population or what percentage of population fall under a certain category. The below pyramid is an excellent example of how many users are retained at each stage of a email marketing campaign funnel.

email_campaign_funnel <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
head(email_campaign_funnel)
##                                    Stage Gender     Users
## 1                     Stage 01: Browsers   Male -14927619
## 2              Stage 02: Unbounced Users   Male -12862663
## 3                Stage 03: Email Signups   Male -11361896
## 4              Stage 04: Email Confirmed   Male  -9411708
## 5         Stage 05: Campaign-Email Opens   Male  -8074317
## 6 Stage 06: Campaign-Email Clickthroughs   Male  -6958512
email_campaign_funnel[22:27,]
##                                     Stage Gender    Users
## 22                     Stage 01: Browsers Female 14226434
## 23              Stage 02: Unbounced Users Female 12276043
## 24                Stage 03: Email Signups Female 10850386
## 25              Stage 04: Email Confirmed Female  8999932
## 26         Stage 05: Campaign-Email Opens Female  7732693
## 27 Stage 06: Campaign-Email Clickthroughs Female  6666394
# X Axis Breaks and Labels 
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")

# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) +   # Fill column
                              geom_bar(stat = "identity", width = .6) +   # draw the bars
                              scale_y_continuous(breaks = brks,   # Breaks
                                                 labels = lbls) + # Labels
                              coord_flip() +  # Flip axes
                              labs(title="Email Campaign Funnel")

3.2.9 Deal with large number of observations

Remember this ugly scatter plot?

  • Over 50,000 scatter dots overlaping onto each other
  • Overplotting

One way to avoid over plotting: - Add color scaled by 2D density - The trick is to get the 2D kernel density first!

exp_df2$density <- densCols(log10(exp_df2[,1]), log10(exp_df2[,2]), 
                            colramp = colorRampPalette(rev(rainbow(10, end = 4/6)))) 

ggplot(exp_df2) +
    geom_point(aes(x = Brain_1, y = Kidney_1, col = density) ) +
    scale_x_log10() + scale_y_log10() +
    scale_color_identity() +
    theme_bw()

One way to avoid over plotting: - Alternatively, we can try geom_hexbin - Increase number of bins if we want mreo subtle details in the 2D density

ggplot(exp_df2) +
    geom_hex(aes(x = Brain_1, y = Kidney_1), bins = 100 ) +
    scale_x_log10() + scale_y_log10() +
    scale_fill_gradientn("", colours = rev(rainbow(10, end = 4/6))) +
    theme_bw()

3.2.9.1 Deal with multi-dimensional data

  • Feature projection/ Manifold learning >> high-dimensional
  • Correlogram
  • SPLOM - Scatter PLOt Matrix

3.2.9.2 Correlation matrix

  • Let’s examine the corellation of multiple continuous variables present in the same dataframe. This is conveniently implemented using the ggcorrplot package.
  • Let’s first explore the correlation matrix between the 4 variables of the Setosa species in the iris data set
# install.packages("ggcorrplot")
library(ggcorrplot)
rm(list = ls())

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
corr <- cor(iris[iris$Species == "setosa" ,-5], method = "pearson") # the last 
corr
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
## Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
## Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
## Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000

3.2.9.3 Correlogram

  • Let’s make the first simple correlogram with ggcorrplot
  • Problem?
    • Every dot is red, which might be difficult to distinguish
    • Our correlation matrix range from 0.18 to 1
    • The color scale legend is defaulted from -1 to 1
ggcorrplot(corr, method = "circle" , lab = TRUE, lab_size = 3) +
  ggtitle("Correlogram of Setosa")

range(corr)
## [1] 0.1777 1.0000
  • ggcorrplot is a customized from ggplot2
  • To change the color scale, we need to overwrite the existing color scale gradient coded in ggcorplot:
  • ggcorplot uses scale_fill_gradient2, so we need to use exactly this function for overwriting (Kassambara 2023).
# adding colors
p <- ggplot2::ggplot(
      data = corr,
      mapping = ggplot2::aes_string(x = "Var1", y = "Var2", fill = "value") )
      
p <- p + ggplot2::scale_fill_gradient2(   # change color scale here
    low = colors[1],
    high = colors[3],
    mid = colors[2],
    midpoint = 0,
    limit = c(-1, 1), # limit hard-wired from -1 to 1
    space = "Lab",
    name = legend.title
  )
# install.packages("ggcorrplot")
ggcorrplot(corr, method = "circle" , lab = TRUE, lab_size = 3) +
  scale_fill_gradient2(breaks = c(0, 0.5, 1) ,limit = c(0, 1), high = "red", low= "blue", mid = "white", midpoint = 0.5)  +
  ggtitle("Correlogram of Setosa")

3.2.9.3.1 Correlogram with ggplot2
# install.packages("reshape2")
corr2 <- reshape2::melt(corr, na.rm = TRUE)
head(corr2)
##           Var1         Var2     value
## 1 Sepal.Length Sepal.Length 1.0000000
## 2  Sepal.Width Sepal.Length 0.7425467
## 3 Petal.Length Sepal.Length 0.2671758
## 4  Petal.Width Sepal.Length 0.2780984
## 5 Sepal.Length  Sepal.Width 0.7425467
## 6  Sepal.Width  Sepal.Width 1.0000000
label <- round(x = corr2[, "value"], digits = 2)

ggplot(data = corr2,
       mapping = aes(x = Var1, y = Var2, fill = value) ) +
  geom_point(aes(size = value), shape = 21) +
  scale_size(range = c(4, 10)) +
  guides(size = FALSE) +  # remove size = value legend
  # add correlation values onto circles
  geom_text(aes_string(x = "Var1", y = "Var2"), size = 4, label=label) +
  # add color gradient
  scale_fill_gradient2(
      low = "blue", high = "red", mid = "white",
      midpoint = 0.5, limit = c(0, 1),
      space = "Lab",
      name = "Correlation value"
    ) +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank() ) 

3.2.9.3.2 Correlogram shortcuts with GGally
  • Feed the numeric dataframe directly into ggcorr
  • We dont have to compute the correlation matrix in the first place
  • Strictly reserved for correlation heatmap
library(GGally)

ggcorr(iris[iris$Species == "setosa" ,-5], 
       geom = "circle", label = TRUE, 
       limits = c(0, 1), midpoint = 0.5,
       min_size = 2, max_size = 10) 

3.2.9.4 SPLOM - Scatter PLOt Matrix

  • To better visualize the correlation relationship among multiple variables, we can check other types of plots
  • GGally::ggpairs is one of the user-friendliest ways to perform this type of analysis (Schloerke et al. 2024).
# library(GGally)

graph_corr <- ggpairs(iris, mapping = aes(color = Species), 
                      columns = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'), 
                      columnLabels = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')) 
graph_corr <- graph_corr + theme_minimal()

graph_corr

3.2.10 Choropleths Map

  • Choropleths
    • A type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income McIlroy et al. (2023).
  • Let’s see the data
# install.packages("maps")
# install.packages("ggmap")
# install.packages("mapproj")

library(maps)
library(ggmap)

# maps, ggplot2, and ggmap are pre-loaded
# Use map_data() to create usa and inspect
usa <- map_data("usa")
head(usa)
##        long      lat group order region subregion
## 1 -101.4078 29.74224     1     1   main      <NA>
## 2 -101.3906 29.74224     1     2   main      <NA>
## 3 -101.3620 29.65056     1     3   main      <NA>
## 4 -101.3505 29.63911     1     4   main      <NA>
## 5 -101.3219 29.63338     1     5   main      <NA>
## 6 -101.3047 29.64484     1     6   main      <NA>

3.2.10.1 Our basic USA polygon

  • Choropleths are basically polygon
  • As long as we have the coordiantes (eg. longitude and lattitude), we can project onto Choropleth map
  • To draw the map, we need to use geom_polygon() which will connect the points of latitude and longitude for you.
  • coord_map() projects a portion of the earth, which is approximately spherical, onto a flat 2D plane
p_map1 <- ggplot(usa, aes(x = long, y = lat, group = group, fill = region)) +
            geom_polygon(col = "grey") +
            coord_map() +
            theme_nothing()
p_map1

3.2.10.2 Map by states

  • The easiest way to obtain map polygons is through the maps package. Unfortunately there are only a few locations available, but if your region of interest is included they are extremely convenient.

  • The available maps of political boundaries are:

    • Global: world, world2
    • Country: france, italy, nz, usa
    • USA: county, state
  • The maps can be accessed via ggplot2::map_data(), which converts the map into a data frame containing the variables long and lat.

# Use map_data() to create state
state <- map_data("state")

# Map of states
ggplot(state, aes(x = long, y = lat, fill = region, group = group)) +
  geom_polygon(col = "white") +
  coord_map() +
  theme_nothing()

3.2.10.3 Population by states

  • Once the map information is converted into a data frame, we can merge this with another data frame containing some quantitative information, like the estimated population, and use that variable in our aesthetic mappings.
arrests <- USArrests 
arrests$region <- tolower(rownames(USArrests))
head(arrests)
##            Murder Assault UrbanPop Rape     region
## Alabama      13.2     236       58 21.2    alabama
## Alaska       10.0     263       48 44.5     alaska
## Arizona       8.1     294       80 31.0    arizona
## Arkansas      8.8     190       50 19.5   arkansas
## California    9.0     276       91 40.6 california
## Colorado      7.9     204       78 38.7   colorado
arrests_map <- dplyr::left_join(state, arrests, by = "region")

# Create the map
ggplot(arrests_map, aes(long, lat, group = group))+
  geom_polygon(aes(fill = Assault), color = "white")+
  coord_map() +
  scale_fill_distiller(palette = "PuBuGn") +
  ggthemes::theme_map()

3.2.11 Animation

  • gganimation

3.2.11.1 Static plot for gapminder data

  • Gapminder data
    • An excerpt of the data available at Gapminder.org.
    • For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
  • Static plot
    • Problem?
    • Too many observations!
# install.packages("gapminder")
# install.packages("gganimate")
# install.packages("gifski") #  for rendering gif output

library(gapminder)
data(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
theme_set(theme_bw())
p <- ggplot(gapminder, aes(x = gdpPercap, y=lifeExp, size = pop, colour = continent)) +
  geom_point(show.legend = TRUE, alpha = 0.7) +
  scale_size(range = c(2, 12)) +
  coord_trans(x = "log10") +
  scale_x_continuous(breaks = c(1000, 5000, 10000, 100000), labels = c("1k", "5k", "10k", "100k") )
  labs(x = "GDP per capita", y = "Life expectancy")
## $x
## [1] "GDP per capita"
## 
## $y
## [1] "Life expectancy"
## 
## attr(,"class")
## [1] "labels"
p

  • Even when we facet by continet, there are still a lot of overlapped dots
p + facet_wrap(~continent)

3.2.11.2 Motion chart

Motion charts may be made with the packages below Ooms, Kornel Lesiński, and Authors of the dependency Rust crates (2024).

  • Here we make a basic form of a motion chart, which is an animation that cycles through the years
  • The resulting plot allows us to see how both GDP and life expectncy have both increase across the 5 continents
  • transition_time() is the key to create gganiamte object!
# install.packages("gapminder")
# install.packages("gganimate")
# install.packages("gifski") #  for rendering gif output

library(gganimate)
library(gifski)

# Here comes the gganimate specific bits
p1 <- p + facet_wrap(~continent) +
  transition_time(year) +
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  ease_aes('linear')

# render animation
# fps = frame rate
#p_animation1 <- animate(p1, duration = 5, fps = 20, width = 400, height = 400, renderer = gifski_renderer())
#anim_save("gapminder1.gif", p_animation1)
#p_animation1

The plot from render animation:

  • Let the view follow the data in each frame
  • view_follow()
# Here comes the gganimate specific bits
p2 <- p1 + view_follow()

# render animation
# fps = frame rate
#p_animation2 <- animate(p2, duration = 5, fps = 20, width = 400, height = 400, renderer = gifski_renderer())
#anim_save("gapminder2.gif", p_animation2)
#p_animation2

The plot from render animation:

3.2.12 Interactive plot with Plotly

3.2.12.1 Let’s see the static plot again

library(gapminder)
data(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
plt <- ggplot(data=gapminder,
              aes(x=gdpPercap, y=lifeExp)) +
        geom_point(aes(color=country), show.legend = FALSE) +
        scale_x_log10() +
        scale_color_viridis_d() +
        facet_wrap(~continent)
plt

  • Plotly has its own grammatical structure (Sievert et al. 2024).
  • Here, we use a short cut to convert ggplot object into plotly object
# install.packages("plotly")
library(plotly)

ply <- ggplotly(plt )
ply
4060801e+031e+041e+054060801e+031e+041e+051e+031e+041e+05
AfghanistanAlbaniaAlgeriaAngolaArgentinaAustraliaAustriaBahrainBangladeshBelgiumBeninBoliviaBosnia and HerzegovinaBotswanaBrazilBulgariaBurkina FasoBurundiCambodiaCameroonCanadaCentral African RepublicChadChileChinaColombiaComorosCongo, Dem. Rep.Congo, Rep.Costa RicaCote d'IvoireCroatiaCubaCzech RepublicDenmarkDjiboutiDominican RepublicEcuadorEgyptEl SalvadorEquatorial GuineaEritreaEthiopiaFinlandFranceGabonGambiaGermanyGhanaGreeceGuatemalaGuineaGuinea-BissauHaitiHondurasHong Kong, ChinaHungaryIcelandIndiaIndonesiaIranIraqIrelandIsraelItalyJamaicaJapanJordanKenyaKorea, Dem. Rep.Korea, Rep.KuwaitLebanonLesothoLiberiaLibyaMadagascarMalawiMalaysiaMaliMauritaniaMauritiusMexicoMongoliaMontenegroMoroccoMozambiqueMyanmarNamibiaNepalNetherlandsNew ZealandNicaraguaNigerNigeriaNorwayOmanPakistanPanamaParaguayPeruPhilippinesPolandPortugalPuerto RicoReunionRomaniaRwandaSao Tome and PrincipeSaudi ArabiaSenegalSerbiaSierra LeoneSingaporeSlovak RepublicSloveniaSomaliaSouth AfricaSpainSri LankaSudanSwazilandSwedenSwitzerlandSyriaTaiwanTanzaniaThailandTogoTrinidad and TobagoTunisiaTurkeyUgandaUnited KingdomUnited StatesUruguayVenezuelaVietnamWest Bank and GazaYemen, Rep.ZambiaZimbabwegdpPercaplifeExpAfricaAmericasAsiaEuropeOceania

3.2.12.2 Saving your plotly object

# install.packages("htmlwidgets")
htmlwidgets::saveWidget(ply, "gapminder_plotly.html")

3.3 Reference materials

The following resource has been adapted for this chapter:

References

Attali, Dean, and Christopher Baker. 2023. ggExtra: Add Marginal Histograms to Ggplot2, and More Ggplot2 Enhancements. https://github.com/daattali/ggExtra.
Kassambara, Alboukadel. 2023. Ggcorrplot: Visualization of a Correlation Matrix Using Ggplot2. http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2.
McIlroy, Doug, Ray Brownrigg, Thomas P Minka, and Roger Bivand. 2023. Mapproj: Map Projections. https://CRAN.R-project.org/package=mapproj.
Ooms, Jeroen, Kornel Lesiński, and Authors of the dependency Rust crates. 2024. Gifski: Highest Quality GIF Encoder. https://r-rust.r-universe.dev/gifski.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to Ggplot2. https://ggobi.github.io/ggally/.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2024. Plotly: Create Interactive Web Graphics via Plotly.js. https://plotly-r.com.
Slowikowski, Kamil. 2024. Ggrepel: Automatically Position Non-Overlapping Text Labels with Ggplot2. https://ggrepel.slowkow.com/.
Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, Carson Sievert, and Kenton Russell. 2023. Htmlwidgets: HTML Widgets for r. https://github.com/ramnathv/htmlwidgets.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.