Chapter 3 Correlation


Heatmap Normalization

Figure 3.1: Heatmap Normalization

3.1 Scatterplot


A Scatterplot displays the relationship between 2 numeric variables. Each dot represents an observation. Their position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables.

3.1.0.1 Using the ggplot2 Package

Scatterplots are built with ggplot2 thanks to the geom_point() function. Discover a basic use case in graph #272, and learn how to custom it with next examples below. Basic scatterplot with R and ggplot2. A scatterplot displays the values of two variables along two axes. It shows the relationship between them, eventually revealing a correlation.

A scatterplot displays the values of two variables along two axes. It shows the relationship between them, eventually revealing a correlation.

Here the relationship between Sepal width and Sepal length of several plants is shown.

It illustrates the basic utilization of ggplot2 for scatterplots:

  1. Provide a dataframe.
  2. Tell which variable to show on x and y axis.
  3. Add a geom_point() to show points.
# library
library(ggplot2)
 
# The iris dataset is provided natively by R
#head(iris)
 
# basic scatterplot
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_point()

3.1.1 Custom ggplot2 Scatterplot

This post is dedicated to customization you can apply to a scatterplot built with ggplot2.

This post follows the previous basic scatterplot with ggplot2. It shows the kind of customization you can apply to circles thanks to the geom_point() options:

  • color: the stroke color, the circle outline
  • stroke: the stroke width fill: color of the circle inner part
  • shape: shape of the marker. See list in the ggplot2 section
  • alpha: circle transparency, [0->1], 0 is fully transparent color: the stroke color, the circle outline
  • size: circle size

Note: These options will be uniform among markers if you put it in the geom_point() call. You can also map them to a variable if put inside the aes() part of the code.

# library
library(ggplot2)
 
# Iris dataset is natively provided by R
#head(iris)
 
# use options!
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_point(
        color="orange",
        fill="#69b3a2",
        shape=21,
        alpha=0.5,
        size=6,
        stroke = 2
        )

3.1.2 Using theme_ipsum

Note that applying the theme_ipsum of the hrbrthemes package is always a good option.

# library
library(ggplot2)
library(hrbrthemes)

# Iris dataset is natively provided by R
#head(iris)
 
# use options!
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_point(
        color="black",
        fill="#69b3a2",
        shape=22,
        alpha=0.5,
        size=6,
        stroke = 1
        ) +
    theme_ipsum()

3.1.3 Map a Variable to Marker Feature in ggplot2 Scatterplot

ggplot2 allows to easily map a variable to marker features of a scatterplot. This section explaines how it works through several examples, with explanation and code.

3.1.3.1 Basic Example

Here is the magic of ggplot2: the ability to map a variable to marker features. Here, the marker color depends on its value in the field called Species in the input data frame.

Note that the legend is built automatically.

# load ggplot2
library(ggplot2)
library(hrbrthemes)

# mtcars dataset is natively available in R
# head(mtcars)
 
# A basic scatterplot with color depending on Species
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + 
    geom_point(size=6) +
    theme_ipsum()

3.1.4 Works with any Aesthetics

Map variables to any marker features. For instance, specie is represente below using transparency (left), shape (middle) and size (right).

# load ggplot2
library(ggplot2)
library(hrbrthemes)

# Transparency
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, alpha=Species)) + 
    geom_point(size=6, color="#69b3a2") +
    theme_ipsum()

# Shape
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, shape=Species)) + 
    geom_point(size=6) +
    theme_ipsum()

# Size
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, shape=Species)) + 
    geom_point(size=6) +
    theme_ipsum()

3.1.5 Mapping to Several Features

Last but not least, note that you can map one or several variables to one or several features. Here, shape, transparency, size and color all depends on the marker Species value.

# load ggplot2
library(ggplot2)
library(hrbrthemes)

# A basic scatterplot with color depending on Species
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, shape=Species, alpha=Species, size=Species, color=Species)) + 
    geom_point() +
    theme_ipsum()

3.1.6 Add Text Labels with ggplot2

This document is dedicated to text annotation with ggplot2. It provides several examples with reproducible code showing how to use function like geom_label and geom_text.

3.1.6.1 Adding Text with geom_text()

This example demonstrates how to use geom_text() to add text as markers. It works pretty much the same as geom_point(), but add text instead of circles. A few arguments must be provided:

  • label: What text you want to display.
  • nudge_x and nudge_y: Shifts the text along X and Y axis.
  • check_overlap: Tries to avoid text overlap. Note that a package called ggrepel extends this concept further.
# library
library(ggplot2)
 
# Keep 30 first rows in the mtcars natively available dataset
data=head(mtcars, 30)
 
# 1/ add text with geom_text, use nudge to nudge the text
ggplot(data, aes(x=wt, y=mpg)) +
  geom_point() + # Show dots
  geom_text(
    label=rownames(data), 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

3.1.7 Add Labels with geom_label()

geom_label() works pretty much the same way as geom_text(). However, text is wrapped in a rectangle that you can customize (see next example).

# library
library(ggplot2)
 
# Keep 30 first rows in the mtcars natively available dataset
data=head(mtcars, 30)
 
# 1/ add text with geom_text, use nudge to nudge the text
ggplot(data, aes(x=wt, y=mpg)) +
  geom_point() + # Show dots
  geom_label(
    label=rownames(data), 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

3.1.8 Add One Text Label Only

Of course, you don’t have to label all dots on the chart. You can also add a piece of text on a specific position. Since we’re here, note that you can custom the annotation of geom_label with label.padding, label.size, color and fill as described below:

# library
library(ggplot2)
 
# Keep 30 first rows in the mtcars natively available dataset
data=head(mtcars, 30)
 
# Add one annotation
ggplot(data, aes(x=wt, y=mpg)) +
  geom_point() + # Show dots
  geom_label(
    label="Look at this!", 
    x=4.1,
    y=20,
    label.padding = unit(0.55, "lines"), # Rectangle size around label
    label.size = 0.35,
    color = "black",
    fill="#69b3a2"
  )

3.1.9 Add Labels for a Selection of Marker

Last but not least, you can also select a group of marker and annotate them only. Here, only car with mpg > 20 and wt > 3 are annotated thanks to a data filtering in the geom_label() call.

# library
library(ggplot2)
library(dplyr)
library(tibble)

# Keep 30 first rows in the mtcars natively available dataset
data=head(mtcars, 30)

# Change data rownames as a real column called 'carName'
data <- data %>%
  rownames_to_column(var="carName")
  
# Plot
ggplot(data, aes(x=wt, y=mpg)) +
  geom_point() + 
  geom_label( 
    data=data %>% filter(mpg>20 & wt>3), # Filter data first
    aes(label=carName)
  )

3.1.10 Ggplot2 Scatterplot with Rug

This section demonstrates how to build a scatterplot with rug with R and ggplot2. Adding rug gives insight about variable distribution and is especially helpful when markers overlap.

3.1.10.1 Adding Rug with geom_rug()

A scatterplot displays the relationship between 2 numeric variables. You can easily add rug on X and Y axis thanks to the geom_rug() function to illustrate the distribution of dots.

Note you can as well add marginal plots to show these distributions.

# library
library(ggplot2)

# Iris dataset
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# plot
ggplot(data=iris, aes(x=Sepal.Length, Petal.Length)) +
  geom_point() +
  geom_rug(col="steelblue",alpha=0.1, size=1.5)

3.1.11 Marginal Distribution with ggplot2 and ggExtra

This section explains how to add marginal distributions to the X and Y axis of a ggplot2 scatterplot. It can be done using histogram, boxplot or density plot using the ggExtra library.

3.1.11.1 Basic use of ggMarginal()

Here are 3 examples of marginal distribution added on X and Y axis of a scatterplot. The ggExtra library makes it a breeze thanks to the ggMarginal() function. Three main types of distribution are available: histogram, density and boxplot.

Three additional examples to show possible customization:

  • Change marginal plot size with size.
  • Custom marginal plot appearance with all usual parameters. show only one marginal plot with margins = 'x' or margins = 'y'.
# library
library(ggplot2)
library(ggExtra)
 
# The mtcars dataset is proposed in R
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# classic plot :
p <- ggplot(mtcars, aes(x=wt, y=mpg, color=cyl, size=cyl)) +
      geom_point() +
      theme(legend.position="none")
 
# with marginal histogram
p1 <- ggMarginal(p, type="histogram")
 
# marginal density
p2 <- ggMarginal(p, type="density")
 
# marginal boxplot
p3 <- ggMarginal(p, type="boxplot")

3.1.12 More Customization

Three additional examples to show possible customization:

  • Change marginal plot size with size
  • Custom marginal plot appearance with all usual parameters
  • Show only one marginal plot with margins = ‘x’ or margins = ‘y’
# library
library(ggplot2)
library(ggExtra)
 
# The mtcars dataset is proposed in R
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# classic plot :
p <- ggplot(mtcars, aes(x=wt, y=mpg, color=cyl, size=cyl)) +
      geom_point() +
      theme(legend.position="none")
 
# Set relative size of marginal plots (main plot 10x bigger than marginals)
p1 <- ggMarginal(p, type="histogram", size=10)
 
# Custom marginal plots:
p2 <- ggMarginal(p, type="histogram", fill = "slateblue", xparams = list(  bins=10))
 
# Show only marginal plot for x axis
p3 <- ggMarginal(p, margins = 'x', color="purple", size=4)

3.1.13 Linear Model and Confidence Interval in ggplot2

Display the result of a linear model and its confidence interval on top of a scatterplot. A ggplot2 implementation with reproducible code.

3.1.13.1 Linear Trend

Adding a linear trend to a scatterplot helps the reader in seeing patterns. ggplot2 provides the geom_smooth() function that allows to add the linear trend and the confidence interval around it if needed (option se=TRUE).

Note: The method argument allows to apply different smoothing method like glm, loess and more. See the doc for more.

# Library
library(ggplot2)
library(hrbrthemes)

# Create dummy data
data <- data.frame(
  cond = rep(c("condition_1", "condition_2"), each=10), 
  my_x = 1:100 + rnorm(100,sd=9), 
  my_y = 1:100 + rnorm(100,sd=16) 
)

# Basic scatter plot.
p1 <- ggplot(data, aes(x=my_x, y=my_y)) + 
  geom_point( color="#69b3a2") +
  theme_ipsum()
 
# with linear trend
p2 <- ggplot(data, aes(x=my_x, y=my_y)) +
  geom_point() +
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum()

# linear trend + confidence interval
p3 <- ggplot(data, aes(x=my_x, y=my_y)) +
  geom_point() +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum()
p1

p2

p3

3.1.14 Using Base R

Base R is also a good option to build a scatterplot, using the plot() function. The chart #13 below will guide you through its basic usage. Following examples allow a greater level of customization.

3.1.14.1 Basic Scatterplot in Base R

A very basic scatterplot built with base R and the plot() function. Explanation and code provided.

3.1.14.2 Most Basic Scatterplot

The plot() function of R allows to build a scatterplot. Both numeric variables of the input dataframe must be specified in the x and y argument.

# Create data
data = data.frame(
  x=seq(1:100) + 0.1*seq(1:100)*sample(c(1:10) , 100 , replace=T),
  y=seq(1:100) + 0.2*seq(1:100)*sample(c(1:10) , 100 , replace=T)
)

# Basic scatterplot
plot(x=data$x, y=data$y)

3.1.14.3 Customizations

Here is a description of the most common customization:

  • cex: circle size
  • xlim and ylim: limits of the X and Y axis
  • pch: shape of markers. See all here.
  • xlab and ylab: X and Y axis labels
  • col: marker color
  • main: chart title
# Create data
data = data.frame(
  x=seq(1:100) + 0.1*seq(1:100)*sample(c(1:10) , 100 , replace=T),
  y=seq(1:100) + 0.2*seq(1:100)*sample(c(1:10) , 100 , replace=T)
)

# Basic scatterplot
plot(data$x, data$y,
     xlim=c(0,250) , ylim=c(0,250), 
     pch=18, 
     cex=2, 
     col="#69b3a2",
     xlab="value of X", ylab="value of Y",
     main="A simple scatterplot"
     )

3.1.15 Map the Marker Color to a Categorical Variable

# the iris dataset is provided by R natively

# Create a color palette
library(paletteer)
colors <- paletteer_c(package = "ggthemes", palette = "Green-Blue-White", n = 3)

# Scatterplot with categoric color scale
plot(
  x = iris$Petal.Length, 
  y = iris$Petal.Width,
  bg = colors[ unclass(iris$Species) ],
  cex = 3,
  pch=21
)

3.1.16 Map the Marker Color to a Numeric Variable

# the iris dataset is provided by R natively

# Create a color palette
library(paletteer)
nColor <- 20
colors <- paletteer_c(package = "viridis", palette = "inferno", n = nColor)

# Transform the numeric variable in bins
rank <- as.factor( as.numeric( cut(iris$Petal.Width, nColor)))

# Scatterplot with color gradient
plot(
  x = iris$Petal.Length, 
  y = iris$Petal.Width,
  bg = colors[ rank ],
  cex = 3,
  pch=21
)

3.1.17 Scatterplot with Polynomial Curve Fitting

This example describes how to build a scatterplot with a polynomial curve drawn on top of it. First of all, a scatterplot is built using the native R plot() function. Then, a polynomial model is fit thanks to the lm() function.

First of all, a scatterplot is built using the native R plot() function. Then, a polynomial model is fit thanks to the lm() function. It is possible to have the estimated Y value for each step of the X axis using the predict() function, and plot it with line().

It is a good practice to add the equation of the model with text().

Note: You can also add a confidence interval around the model as described in chart #45.

x <- runif(300,  min=-10, max=10) 
y <- 0.1*x^3 - 0.5 * x^2 - x + 10 + rnorm(length(x),0,8) 
 
# plot of x and y :
plot(x,y,col=rgb(0.4,0.4,0.8,0.6),pch=16 , cex=1.3) 
 
# Can we find a polynome that fit this function ?
model <- lm(y ~ x + I(x^2) + I(x^3))
 
#Features of this model :
#summary(model)
#model$coefficients
#summary(model)$adj.r.squared
 
# For each value of x, I can get the value of y estimated by the model, and add it to the current plot !
myPredict <- predict( model ) 
ix <- sort(x,index.return=T)$ix
lines(x[ix], myPredict[ix], col=2, lwd=2 )  

# Add the features of the model to the plot
coeff <- round(model$coefficients , 2)
text(3, -70 , paste("Model : ",coeff[1] , " + " , coeff[2] , "*x"  , "+" , coeff[3] , "*x^2" , "+" , coeff[4] , "*x^3" , "\n\n" , "P-value adjusted = ",round(summary(model)$adj.r.squared,2)))

3.1.18 Polynomial Curve Fitting and Confidence Interval

This example follows the previous scatterplot with polynomial curve. It extends this example, adding a confidence interval.This example follows the previous chart #44 that explained how to add polynomial curve on top of a scatterplot in base R.

Here, a confidence interval is added using the polygon() function.

# We create 2 vectors x and y. It is a polynomial function.
x <- runif(300, min=-30, max=30) 
y <- -1.2*x^3 + 1.1 * x^2 - x + 10 + rnorm(length(x),0,100*abs(x)) 

# Basic plot of x and y :
plot(x,y,col=rgb(0.4,0.4,0.8,0.6), pch=16 , cex=1.3 , xlab="" , ylab="") 

# Can we find a polynome that fit this function ?
model <- lm(y ~ x + I(x^2) + I(x^3))

# I can get the features of this model :
#summary(model)
#model$coefficients
#summary(model)$adj.r.squared

#For each value of x, I can get the value of y estimated by the model, and the confidence interval around this value.
myPredict <- predict( model , interval="predict" )

#Finally, I can add it to the plot using the line and the polygon function with transparency.
ix <- sort(x,index.return=T)$ix
lines(x[ix], myPredict[ix , 1], col=2, lwd=2 )
polygon(c(rev(x[ix]), x[ix]), c(rev(myPredict[ ix,3]), myPredict[ ix,2]), col = rgb(0.7,0.7,0.7,0.4) , border = NA)

3.1.19 Lattice XY Plot Function

The xyplot() function of the lattice package allows to build a scatterplot for several categories automatically. The lattice library offers the xyplot() function. It builds a scatterplot for each levels of a factor automatically.

The lattice library offers the xyplot() function. It builds a scatterplot for each levels of a factor automatically.

It is actually the ancestor of the geom_wrap() function of ggplot2 than you can see in action here.

# Library
library(lattice)

# create data :
sample <- paste(rep("sample_",40) , seq(1,40) , sep="")
specie <- c(rep("carot" , 10) , rep("cumcumber" , 10) , rep("wheat" , 10) , rep("Potatoe" , 10) )
gene1 <- c( seq(5,14)+rnorm(10 , 4 , 1) , seq(5,14)+rnorm(10 , 4 , 1) , seq(5,14)+rnorm(10 , 4 , 1) , seq(5,14)+rnorm(10 , 4 , 1) )
gene2 <- c( seq(5,14)+rnorm(10 , 4 , 1) , seq(5,14)+rnorm(10 , 2 , 0.2) , seq(5,14)+rnorm(10 , 4 , 4) , seq(5,14)+rnorm(10 , 4 , 3) )
data <- data.frame(sample,specie,gene1,gene2)
 
# Make the graph
xyplot(gene1 ~ gene2 | specie , data=data , pch=20 , cex=3 , col=rgb(0.2,0.4,0.8,0.5) )

3.1.20 Correlation between Discrete Variable

Studying the relationship between 2 discrete variables is complicated since an usual scatterplot suffers overplotting. Here is a workaround using base R.

3.1.20.1 Scatterplot with Variable Size

An usual scatterplot would suffer over plotting when used for discrete variables: dots would be drawn on top of each other, making the chart unreadable. The workaround suggested here makes dot size proportional to the number of data points behind it. On top of that, the exact number can be represented in the bubble thanks to the text() function.

#Let's create 2 discrete variables 
a <- c(1,1,3,4,5,5,1,1,2,3,4,1,3,2,1,1,5,1,4,3,2,3,1,0,2)
b <- c(1,2,3,5,5,5,2,1,1,3,4,3,3,4,1,1,4,1,4,2,2,3,0,0,1)
 
#I count the occurence of each couple of values. Eg : number of time a=1 and b=1, number of time a=1 and b=2 etc...
AA <- xyTable(a,b)
 
#Now I can plot this ! I represent the dots as big as the couple occurs often
coeff_bigger <- 2
plot(AA$x , AA$y , cex=AA$number*coeff_bigger  , pch=16 , col=rgb(0,0,1,0.5) , xlab= "value of a" , ylab="value of b" , xlim=c(0,6) , ylim=c(0,6) )
text(AA$x , AA$y , AA$number )

#Note : It's easy to make a function that will compute this kind of plot automaticaly :
represent_discrete_variable <- function(var1, var2 , coeff_bigger){
  AA=xyTable(var1,var2)
  plot(AA$x , AA$y , cex=AA$number*coeff_bigger  , pch=16 , col="chocolate1" , xlab= "value of a" , ylab="value of b" )
  text (AA$x , AA$y , AA$number )
}

3.1.20.2 Other Workarounds

Other workarounds could be considered in this situation:

3.1.21 Use mtext() to Write Text in Margin

This document describes how to use the mtext() function to add text in the plot margin. Usefull to add title on a multi chart.

The mtext() function allows to write text in one of the four margins of the current figure region or one of the outer margins of the device region.

Here, the figure is first split thanks to par(mfrow()). Then, only one title is added and centered using mtext().

#Dummy data 
Ixos <- rnorm(4000,100,30)
Primadur <- Ixos+rnorm(4000 , 0 , 30)
 
#Divide the screen in 1 line and 2 columns
par(
  mfrow=c(1,2), 
  oma = c(0, 0, 2, 0)
) 
 
#Make the margin around each graph a bit smaller
par(mar=c(4,2,2,2))
 
# Histogram and Scatterplot
hist(Ixos,  main="" , breaks=30 , col=rgb(0.3,0.5,1,0.4) , xlab="height" , ylab="nbr of plants")
plot(Ixos , Primadur,  main="" , pch=20 , cex=0.4 , col=rgb(0.3,0.5,1,0.4)  , xlab="primadur" , ylab="Ixos" )
 
#And I add only ONE title :
mtext("Primadur : Distribution and correlation with Ixos", outer = TRUE, cex = 1.5, font=4, col=rgb(0.1,0.3,0.5,0.5) )

3.1.22 Customizations

Here is a description of the most common customization:

  • cex: circle size
  • xlim and ylim: limits of the X and Y axis
  • pch: shape of markers. See all here.
  • xlab and ylab: X and Y axis labels
  • col: marker color
  • main: chart title
# Create data
data = data.frame(
  x=seq(1:100) + 0.1*seq(1:100)*sample(c(1:10) , 100 , replace=T),
  y=seq(1:100) + 0.2*seq(1:100)*sample(c(1:10) , 100 , replace=T)
)

# Basic scatterplot
plot(data$x, data$y,
     xlim=c(0,250) , ylim=c(0,250), 
     pch=18, 
     cex=2, 
     col="#69b3a2",
     xlab="value of X", ylab="value of Y",
     main="A simple scatterplot"
     )

3.1.23 The split_screen() Function of R

This document explains how to use the split_screen() function of R to divide your device in several parts, one for each chart.

The split_screen() function allows to divide the window in several chart sections.

However,

#Create data
a <- seq(1,29)+4*runif(29,0.4)
b <- seq(1,29)^2+runif(29,0.98)
 
# I divide the screen in 2 line and 1 column only
my_screen_step1 <- split.screen(c(2, 1))
 
# I add one graph on the screen number 1 which is on top :
screen(my_screen_step1[1])
plot( a,b , pch=20 , xlab="value of a" , cex=3 , col=rgb(0.4,0.9,0.8,0.5) )
 
 
# I divide the second screen in 2 columns :
my_screen_step2 <- split.screen(c(1, 2), screen = my_screen_step1[2])
screen(my_screen_step2[1])
hist(a, border=F , col=rgb(0.2,0.2,0.8,0.7) , main="" , xlab="distribution of a")
screen(my_screen_step2[2])
hist(b, border=F , col=rgb(0.8,0.2,0.8,0.7) , main="" ,  xlab="distribution of b")

3.1.24 Use par mfrow to Split Screen

The par() function allows to set the mfrow() parameters to cut the charting window in several section.

3.1.24.1 Most Basic Scatterplot

The par() function allows to set parameters to the plot. The mfrow() parameter allows to split the screen in several panels. Subsequent charts will be drawn in panels.

You have to provide a vector of length 2 to mfrow(): number of rows and number of columns.

Note: mfcol() does the same job but draws figure by columns instead of by row.

Alternative: See the layout() function for more complex layout creation.

#Create data
a <- seq(1,29)+4*runif(29,0.4)
b <- seq(1,29)^2+runif(29,0.98)
 
#Divide the screen in 2 columns and 2 lines
par(mfrow=c(2,2))
 
#Add a plot in each sub-screen !
plot( a,b , pch=20)
plot(a-b , pch=18)
hist(a, border=F , col=rgb(0.2,0.2,0.8,0.7) , main="")
boxplot(a , col="grey" , xlab="a")

3.1.25 Base R Graph Parameters: A Cheatsheet.

This section aims to remind the options offered to customize a graph in base R. Understand in a sec how to use lwd, pch, type, lty, cex, and more. Base R offers many option to customize the chart appearance. Basically everthing is doable with those few options:

Base R offers many option to customize the chart appearance. Basically everthing is doable with those few options:

  • cex: shape size
  • lwd: line width
  • col: control colors
  • lty: line type
  • pch: marker shape
  • type: link between dots

Note: visit the cheatsheet section for more.

par(mar=c(3,3,3,3))
num <- 0 ; 
num1 <- 0
plot(0,0 , xlim=c(0,21) , ylim=c(0.5,6.5), col="white" , yaxt="n" , ylab="" , xlab="")
 
#fill the graph
for (i in seq(1,20)){
  points(i,1 , pch=i , cex=3)
  points(i,2 , col=i , pch=16 , cex=3)
  points(i,3 , col="black" , pch=16 , cex=i*0.25)
  
  #lty
  if(i %in% c(seq(1,18,3))){
        num=num+1
    points(c(i,i+2), c(4,4) , col="black" , lty=num , type="l" , lwd=2)
        text(i+1.1 , 4.15 , num)
        }
  
  #type and lwd 
  if(i %in% c(seq(1,20,5))){
    num1=num1+1
    points(c(i,i+1,i+2,i+3), c(5,5,5,5) , col="black"  , type=c("p","l","b","o")[num1] , lwd=2)
    text(i+1.1 , 5.2 , c("p","l","b","o")[num1] )
    points(c(i,i+1,i+2,i+3), c(6,6,6,6) , col="black"  , type="l",  lwd=num1)
    text(i+1.1 , 6.2 , num1 )
 
    }
  }
 
#add axis
axis(2, at = c(1,2,3,4,5,6), labels = c("pch" , "col" , "cex" , "lty", "type" , "lwd" ), 
     tick = TRUE, col = "black", las = 1, cex.axis = 0.8)

3.1.26 Special Use Case: Manhattan Plots

A Manhattan plot is a particular type of scatterplot used in genomics. The X axis displays the position of a genetic variant on the genome. Each chromosome is usually represented using a different color. The Y axis shows p-value of the association test with a phenotypic trait.

A Manhattan plot is a specific type of scatter plot widely used in genomics to study GWAS results (Genome Wide Association Study). Each point represents a genetic variant. The X axis shows its position on a chromosome, the Y axis tells how much it is associated with a trait. This page reviews how to make a Manhattan plot with R, and displays a couple of variations.

Basic The manhattan function is straightforward: it just needs to have 4 columns identified properly, and does a proper job.

# Load the library
library(qqman)

# Make the Manhattan plot on the gwasResults dataset
manhattan(gwasResults, chr="CHR", bp="BP", snp="SNP", p="P" )

3.1.26.1 SNP of Interest

A common task is to highlight a group of SNP on the Manhattan plot. For example it is handy to show which SNP are part of the clumping result. This is an easy task with qqman once you have identified the SNPs of interest.

# A list of SNP of interest is provided with the library:
snpsOfInterest
##   [1] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" "rs3006" "rs3007" "rs3008"
##   [9] "rs3009" "rs3010" "rs3011" "rs3012" "rs3013" "rs3014" "rs3015" "rs3016"
##  [17] "rs3017" "rs3018" "rs3019" "rs3020" "rs3021" "rs3022" "rs3023" "rs3024"
##  [25] "rs3025" "rs3026" "rs3027" "rs3028" "rs3029" "rs3030" "rs3031" "rs3032"
##  [33] "rs3033" "rs3034" "rs3035" "rs3036" "rs3037" "rs3038" "rs3039" "rs3040"
##  [41] "rs3041" "rs3042" "rs3043" "rs3044" "rs3045" "rs3046" "rs3047" "rs3048"
##  [49] "rs3049" "rs3050" "rs3051" "rs3052" "rs3053" "rs3054" "rs3055" "rs3056"
##  [57] "rs3057" "rs3058" "rs3059" "rs3060" "rs3061" "rs3062" "rs3063" "rs3064"
##  [65] "rs3065" "rs3066" "rs3067" "rs3068" "rs3069" "rs3070" "rs3071" "rs3072"
##  [73] "rs3073" "rs3074" "rs3075" "rs3076" "rs3077" "rs3078" "rs3079" "rs3080"
##  [81] "rs3081" "rs3082" "rs3083" "rs3084" "rs3085" "rs3086" "rs3087" "rs3088"
##  [89] "rs3089" "rs3090" "rs3091" "rs3092" "rs3093" "rs3094" "rs3095" "rs3096"
##  [97] "rs3097" "rs3098" "rs3099" "rs3100"
# Let's highlight them, with a bit of customization on the plot
manhattan(gwasResults, highlight = snpsOfInterest)

3.1.26.2 Annotate

You probably want to know the name of the SNP of interest: the ones with a high pvalue. You can automatically annotate them using the annotatePval argument:

manhattan(gwasResults, annotatePval = 0.01)

3.1.26.3 Qqplot

It is a good practice to draw a qqplot from the output of a GWAS. It allows to compare the distribution of the pvalue with an expected distribution by chance. Its realisation is straightforward thanks to the qq function:

qq(gwasResults$P)

3.1.27 Highly Customizable with ggplot2

If you want to access a maximum level of customization it is sometimes good to build your plot from scratch. Here is an example using dplyr and ggplot2.

3.1.27.1 Basic

First of all, we need to compute the cumulative position of SNP.

don <- gwasResults %>% 
  
  # Compute chromosome size
  group_by(CHR) %>% 
  summarise(chr_len=max(BP)) %>% 
  
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  dplyr::select(-chr_len) %>%
  
  # Add this info to the initial dataset
  left_join(gwasResults, ., by=c("CHR"="CHR")) %>%
  
  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=BP+tot)

Then we need to prepare the X axis. Indeed we do not want to display the cumulative position of SNP in bp, but just show the chromosome name instead.

axisdf = don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

Ready to make the plot using ggplot2:

ggplot(don, aes(x=BPcum, y=-log10(P))) +
    
    # Show all points
    geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
    scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +
    
    # custom X axis:
    scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
    scale_y_continuous(expand = c(0, 0) ) +     # remove space between plot area and x axis
  
    # Custom the theme:
    theme_bw() +
    theme( 
      legend.position="none",
      panel.border = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
    )

3.1.27.2 Highlight SNPs

Let’s suppose the you have a group of SNP that you want to highlight on the plot. This can be done following almost the same procedure. We just need to add them a flag in the dataframe, and use the flag for the color:

# List of SNPs to highlight are in the snpsOfInterest object
# We will use ggrepel for the annotation
library(ggrepel)

# Prepare the dataset
don <- gwasResults %>% 
  
  # Compute chromosome size
  group_by(CHR) %>% 
  summarise(chr_len=max(BP)) %>% 
  
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  dplyr::select(-chr_len) %>%
  
  # Add this info to the initial dataset
  left_join(gwasResults, ., by=c("CHR"="CHR")) %>%
  
  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=BP+tot) %>%

  # Add highlight and annotation information
  mutate( is_highlight=ifelse(SNP %in% snpsOfInterest, "yes", "no")) %>%
  mutate( is_annotate=ifelse(-log10(P)>4, "yes", "no")) 

# Prepare X axis
axisdf <- don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

# Make the plot
ggplot(don, aes(x=BPcum, y=-log10(P))) +
    
    # Show all points
    geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
    scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +
    
    # custom X axis:
    scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
    scale_y_continuous(expand = c(0, 0) ) +     # remove space between plot area and x axis

    # Add highlighted points
    geom_point(data=subset(don, is_highlight=="yes"), color="orange", size=2) +
  
    # Add label using ggrepel to avoid overlapping
    geom_label_repel( data=subset(don, is_annotate=="yes"), aes(label=SNP), size=2) +

    # Custom the theme:
    theme_bw() +
    theme( 
      legend.position="none",
      panel.border = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
    )

3.1.27.3 A Note about Speed

A common problem in genomics is the high number of data points. It is not unusual to make a GWAS with millions of SNPs, which makes the plot very long to build. However, it is important to notice that the vast majority of these SNPs have a high p-value and thus do not interest us for the analysis.

A quick work around is thus to plot only SNP with a p-value below a given threshold (~0.05). The graphic will be as informative, but made in seconds. The filtering is straightforward. For example with dplyr:

gwasResults %>% 
  filter(-log10(P)>1)

Decreasing the number of data points has another interest: it allows to switch to an interactive version.

3.1.28 Switch to an Interactive Version with plotly

plotly is an HTML widget: an R library that allows to call javascript under the hood to create interactive visualizations. The good thing with plotly is that it can transform a ggplot2 graphic in an interactive version using the ggplotly function. Let’s apply it to our manhattan plot.

  • Note 1: You probably want to filter your data before doing an interactive version. Having thousands of points will slow down the graphic, and you surely don’t care about SNP with a high p-value.

  • Note 2: the Manhattanly library is another good way to make an interactive manhattan plot. It wraps the plotly library, so you will have less code to type than the example below, but less customization available.

  • Note 3: Interactivity allows to: zoom on a specific region of the graphic, hover a SNP, move axis, export figure as png.

library(plotly)

# Prepare the dataset
don <- gwasResults %>% 
  
  # Compute chromosome size
  group_by(CHR) %>% 
  summarise(chr_len=max(BP)) %>% 
  
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  dplyr::select(-chr_len) %>%
  
  # Add this info to the initial dataset
  left_join(gwasResults, ., by=c("CHR"="CHR")) %>%
  
  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=BP+tot) %>%

  # Add highlight and annotation information
  mutate( is_highlight=ifelse(SNP %in% snpsOfInterest, "yes", "no")) %>%

  # Filter SNP to make the plot lighter
  filter(-log10(P)>0.5)
  
# Prepare X axis
axisdf <- don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

# Prepare text description for each SNP:
don$text <- paste("SNP: ", don$SNP, "\nPosition: ", don$BP, "\nChromosome: ", don$CHR, "\nLOD score:", -log10(don$P) %>% round(2), "\nWhat else do you wanna know", sep="")

# Make the plot
p <- ggplot(don, aes(x=BPcum, y=-log10(P), text=text)) +
    
    # Show all points
    geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
    scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +
    
    # custom X axis:
    scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
    scale_y_continuous(expand = c(0, 0) ) +     # remove space between plot area and x axis
    ylim(0,9) +

    # Add highlighted points
    geom_point(data=subset(don, is_highlight=="yes"), color="orange", size=2) +
  
    # Custom the theme:
    theme_bw() +
    theme( 
      legend.position="none",
      panel.border = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
    )
ggplotly(p, tooltip="text")

3.1.29 Circular Version with CMplot

The CMplot library by Lilin Yin is a good choice if you want to make a circular version of your manhattanplot. I believe than doing a circular version makes sense: it gives less space to all the non significant SNPs that do not interest us, and gives more space for the significant association. Moreover, the CMplot makes their realization straightforward.

library(CMplot)      #install.packages("CMplot")
CMplot(gwasResults, plot.type="c", r=1.6, cir.legend=TRUE,
        outward=TRUE, cir.legend.col="black", cir.chr.h=.1 ,chr.den.col="orange", file="jpg",
        memo="", dpi=300, chr.labels=seq(1,22))
CMplot(pig60K,type="p",plot.type="d",bin.size=1e6,chr.den.col=c("darkgreen", "yellow", "red"),file="jpg",memo="",dpi=300,
    file.output=TRUE,verbose=TRUE,width=9,height=6)

This section can be located at YinLiLin Github.

3.1.29.1 A high-quality drawing tool designed for Manhattan plot of genomic analysis

3.1.29.2 Installation

CMplot is available on CRAN, so it can be installed with the following R code:

#install.packages("CMplot")
library("CMplot")

# if you want to use the latest version on GitHub:
source("https://raw.githubusercontent.com/YinLiLin/CMplot/master/R/CMplot.r")

There are two example datasets attached in CMplot, users can export and view the details by following R code:

data(pig60K)   #calculated p-values by MLM
data(cattle50K)   #calculated SNP effects by rrblup
head(pig60K)

          SNP Chromosome Position    trait1     trait2     trait3
1 ALGA0000009          1    52297 0.7738187 0.51194318 0.51194318
2 ALGA0000014          1    79763 0.7738187 0.51194318 0.51194318
3 ALGA0000021          1   209568 0.7583016 0.98405289 0.98405289
4 ALGA0000022          1   292758 0.7200305 0.48887140 0.48887140
5 ALGA0000046          1   747831 0.9736840 0.22096836 0.22096836
6 ALGA0000047          1   761957 0.9174565 0.05753712 0.05753712

head(cattle50K)

   SNP chr    pos Somatic cell score  Milk yield Fat percentage
1 SNP1   1  59082        0.000244361 0.000484255    0.001379210
2 SNP2   1 118164        0.000532272 0.000039800    0.000598951
3 SNP3   1 177246        0.001633058 0.000311645    0.000279427
4 SNP4   1 236328        0.001412865 0.000909370    0.001040161
5 SNP5   1 295410        0.000090700 0.002202973    0.000351394
6 SNP6   1 354493        0.000110681 0.000342628    0.000105792

As the example datasets, the first three columns are names, chromosome, position of SNPs respectively, the rest of columns are the pvalues of GWAS or effects of GS/GP for traits, the number of traits is unlimited. Note: if plotting SNP_Density, only the first three columns are needed.

Now CMplot could handle not only Genome-wide association study results, but also SNP effects, Fst, tajima’s D and so on.


Total 50~ parameters are available in CMplot, typing ?CMplot can get the detail function of all parameters.


3.1.30 SNP-Density Plot

CMplot(pig60K,type="p",plot.type="d",bin.size=1e6,chr.den.col=c("darkgreen", "yellow", "red"),file="jpg",memo="",dpi=300,
    file.output=TRUE,verbose=TRUE,width=9,height=6)
# users can personally set the windowsize and the min/max of legend by:
# bin.size=1e6
# bin.range=c(min, max)
# memo: add a character to the output file name
# chr.labels: change the chromosome names


3.1.31 Circular-Manhattan Plot

3.1.31.1 (1) Genome-Wide Association Study(GWAS)

CMplot(pig60K,type="p",plot.type="c",chr.labels=paste("Chr",c(1:18,"X","Y"),sep=""),r=0.4,cir.legend=TRUE,
        outward=FALSE,cir.legend.col="black",cir.chr.h=1.3,chr.den.col="black",file="jpg",
        memo="",dpi=300,file.output=TRUE,verbose=TRUE,width=10,height=10)

CMplot(pig60K,type="p",plot.type="c",r=0.4,col=c("grey30","grey60"),chr.labels=paste("Chr",c(1:18,"X","Y"),sep=""),
      threshold=c(1e-6,1e-4),cir.chr.h=1.5,amplify=TRUE,threshold.lty=c(1,2),threshold.col=c("red",
      "blue"),signal.line=1,signal.col=c("red","green"),chr.den.col=c("darkgreen","yellow","red"),
      bin.size=1e6,outward=FALSE,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,width=10,height=10)

#Note:
1. if signal.line=NULL, the lines that crosse circles won't be added.
2. if the length of parameter 'chr.den.col' is not equal to 1, SNP density that counts 
   the number of SNP within given size('bin.size') will be plotted around the circle.

3.1.31.2 (2) Genomic Selection/Prediction(GS/GP)

CMplot(cattle50K,type="p",plot.type="c",LOG10=FALSE,outward=TRUE,col=matrix(c("#4DAF4A",NA,NA,"dodgerblue4",
         "deepskyblue",NA,"dodgerblue1", "olivedrab3", "darkgoldenrod1"), nrow=3, byrow=TRUE),
         chr.labels=paste("Chr",c(1:29),sep=""),threshold=NULL,r=1.2,cir.chr.h=1.5,cir.legend.cex=0.5,
         cir.band=1,file="jpg", memo="",dpi=300,chr.den.col="black",file.output=TRUE,verbose=TRUE,
         width=10,height=10)
        
#Note: parameter 'col' can be either vector or matrix, if a matrix, each trait can be plotted in different colors.


3.1.32 Single_track Rectangular-Manhattan Plot

3.1.32.1 Genome-Wide Association Study(GWAS)

CMplot(pig60K,type="p",plot.type="m",LOG10=TRUE,threshold=NULL,file="jpg",memo="",dpi=300,
    file.output=TRUE,verbose=TRUE,width=14,height=6,chr.labels.angle=45)
# 'chr.labels.angle': adjust the angle of labels of x-axis (-90 < chr.labels.angle < 90).

3.1.32.2 Amplify Signals on pch, cex and col

CMplot(pig60K, plot.type="m", col=c("grey30","grey60"), LOG10=TRUE, ylim=c(2,12), threshold=c(1e-6,1e-4),
        threshold.lty=c(1,2), threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,
        chr.den.col=NULL, signal.col=c("red","green"), signal.cex=c(1.5,1.5),signal.pch=c(19,19),
        file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,width=14,height=6)

#Note: if the ylim is setted, then CMplot will only plot the points among this interval,
#       ylim can be vector or list, if it is a list, different traits can be assigned with
#       different range at y-axis.

3.1.32.3 Attach Chromosome Density on the bottom of Manhattan Plot

CMplot(pig60K, plot.type="m", LOG10=TRUE, ylim=NULL, threshold=c(1e-6,1e-4),threshold.lty=c(1,2),
        threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,bin.size=1e6,
        chr.den.col=c("darkgreen", "yellow", "red"),signal.col=c("red","green"),signal.cex=c(1.5,1.5),
        signal.pch=c(19,19),file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,
        width=14,height=6)
        
#Note: if the length of parameter 'chr.den.col' is bigger than 1, SNP density that counts 
   the number of SNP within given size('bin.size') will be plotted.

3.1.32.4 Highlight a Group of SNPs on pch, cex, type, and col

signal <- pig60K$Position[which.min(pig60K$trait2)]
SNPs <- pig60K$SNP[pig60K$Chromosome==13 & 
        pig60K$Position<(signal+1000000)&pig60K$Position>(signal-1000000)]
CMplot(pig60K, plot.type="m",LOG10=TRUE,col=c("grey30","grey60"),highlight=SNPs,
        highlight.col="green",highlight.cex=1,highlight.pch=19,file="jpg",memo="",
        chr.border=TRUE,dpi=300,file.output=TRUE,verbose=TRUE,width=14,height=6)
#Note:
'highlight' could be vector or list, if it is a vector, all traits will use the same highlighted SNPs index, 
if it is a list, the length of the list should equal to the number of traits.
highlight.col, highlight.cex, highlight.pch can be value or vector, if its length equals to the length of highlighted SNPs,
each SNPs have its special colour, size and shape.

SNPs <-  pig60K[pig60K$trait2 < 1e-4, 1]
CMplot(pig60K,type="h",plot.type="m",LOG10=TRUE,highlight=SNPs,highlight.type="p",
        highlight.col=NULL,highlight.cex=1.2,highlight.pch=19,file="jpg",memo="",
        dpi=300,file.output=TRUE,verbose=TRUE,width=14,height=6,band=0.6)

SNPs <-  pig60K[pig60K$trait2 < 1e-4, 1]
CMplot(pig60K,type="p",plot.type="m",LOG10=TRUE,highlight=SNPs,highlight.type="h",
        col=c("grey30","grey60"),highlight.col="darkgreen",highlight.cex=1.2,highlight.pch=19,
        file="jpg",dpi=300,file.output=TRUE,verbose=TRUE,width=14,height=6)

3.1.32.5 Visualize only oOne Chromosome

CMplot(pig60K[pig60K$Chromosome==13, ], plot.type="m",LOG10=TRUE,col=c("grey60"),highlight=SNPs,
        highlight.col="green",highlight.cex=1,highlight.pch=19,file="jpg",memo="", 
        threshold=c(1e-6,1e-4),threshold.lty=c(1,2),threshold.lwd=c(1,2), width=9,height=6,
        threshold.col=c("red","blue"),amplify=FALSE,dpi=300,file.output=TRUE,verbose=TRUE)

3.1.32.6 Add Genes or SNP Names around the Highlighted SNPs

SNPs <- pig60K[pig60K[,5] < (0.05 / nrow(pig60K)), 1]
genes <- paste("GENE", 1:length(SNPs), sep="_")
set.seed(666666)
CMplot(pig60K[,c(1:3,5)], plot.type="m",LOG10=TRUE,col=c("grey30","grey60"),highlight=SNPs,
        highlight.col=c("red","blue","green"),highlight.cex=1,highlight.pch=c(15:17), highlight.text=genes,      
        highlight.text.col=c("red","blue","green"),threshold=0.05/nrow(pig60K),threshold.lty=2,   
        amplify=FALSE,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,width=14,height=6)
#Note:
'highlight', 'highlight.text', 'highlight.text.xadj', 'highlight.text.yadj' could be vector or list, if it is a vector, 
all traits will use the same highlighted SNPs index and text, if it is a list, the length of the list should equal to the number of traits.
the order of 'highlight.text' must be consistent with 'highlight'
highlight.text.cex: value or vecter, control the size of added text
highlight.text.font: value or vecter, control the font of added text
highlight.text.xadj: value or vecter or list for multiple traits, -1, 0, 1 limited, control the position of text around the highlighted SNPs,
                     -1(left), 0(center), 1(right)
highlight.text.yadj: value or vector or list for multiple traits, same as above, -1(down), 0(center), 1(up)

3.1.32.7 Genomic

Selection/Prediction(GS/GP) or other none p-values

CMplot(cattle50K, plot.type="m", band=0.5, LOG10=FALSE, ylab="SNP effect",threshold=0.015,
        threshold.lty=2, threshold.lwd=1, threshold.col="red", amplify=TRUE, width=14,height=6,
        signal.col=NULL, chr.den.col=NULL, file="jpg",memo="",dpi=300,file.output=TRUE,
        verbose=TRUE,cex=0.8)
#Note: if signal.col=NULL, the significant SNPs will be plotted with original colors.

cattle50K[,4:ncol(cattle50K)] <- apply(cattle50K[,4:ncol(cattle50K)], 2, 
         function(x) x*sample(c(1,-1), length(x), rep=TRUE))
CMplot(cattle50K, type="h",plot.type="m", band=0.5, LOG10=FALSE, ylab="SNP effect",ylim=c(-0.02,0.02),
        threshold.lty=2, threshold.lwd=1, threshold.col="red", amplify=FALSE,cex=0.6,
        chr.den.col=NULL, file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)

#Note: Positive and negative values are acceptable.

3.1.33 Multi_tracks Rectangular-Manhattan Plot

SNPs <- list(
	pig60K$SNP[pig60K$trait1<1e-6],
	pig60K$SNP[pig60K$trait2<1e-6],
	pig60K$SNP[pig60K$trait3<1e-6]
)
CMplot(pig60K, plot.type="m",multracks=TRUE,threshold=c(1e-6,1e-4),threshold.lty=c(1,2), 
        threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,bin.size=1e6,
        chr.den.col=c("darkgreen", "yellow", "red"), signal.col=c("red","green","blue"),
        signal.cex=1, file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,
        highlight=SNPs, highlight.text=SNPs, highlight.text.cex=1.4)
#Note: if you are not supposed to change the color of signal, 
          please set signal.col=NULL and highlight.col=NULL.

3.1.33.1 a. All Traits in One Axes

3.1.33.2 b. All Traits in Separated Axes:


3.1.34 Single_track Q-Q Plot

CMplot(pig60K,plot.type="q",box=FALSE,file="jpg",memo="",dpi=300,
    conf.int=TRUE,conf.int.col=NULL,threshold.col="red",threshold.lty=2,
    file.output=TRUE,verbose=TRUE,width=5,height=5)

3.1.35 Multi_tracks Q-Q plot

pig60K$trait1[sample(1:nrow(pig60K), round(nrow(pig60K)*0.80))] <- NA
pig60K$trait2[sample(1:nrow(pig60K), round(nrow(pig60K)*0.25))] <- NA
CMplot(pig60K,plot.type="q",col=c("dodgerblue1", "olivedrab3", "darkgoldenrod1"),threshold=1e-6,
        ylab.pos=2,signal.pch=c(19,6,4),signal.cex=1.2,signal.col="red",conf.int=TRUE,box=FALSE,multracks=
        TRUE,cex.axis=2,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,ylim=c(0,8),width=5,height=5)

3.1.35.1 a. All Traits in a Axes

3.1.35.2 b. all traits in separated axes:


3.2 Heatmap


A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. This page displays many examples built with R, both static and interactive.

3.2.0.1 Using THE heatmap() Function

The heatmap() function is natively provided in R. It produces high quality matrix and offers statistical tools to normalize input data, run clustering algorithm and visualize the result with dendrograms. It is one of the very rare case where I prefer base R to ggplot2.

3.2.0.2 Most Basic Heatmap

How to do it: below is the most basic heatmap you can build in base R, using the heatmap() function with no parameters. Note that it takes as input a matrix. If you have a data frame, you can convert it to a matrix with as.matrix(), but you need numeric variables only.

How to read it: each column is a variable. Each observation is a row. Each square is a value, the closer to yellow the higher. You can transpose the matrix with t(data) to swap X and Y axis.

Note: as you can see this heatmap is not very insightful: all the variation is absorbed by the hp and disp variables that have very high values compared to the others.

# The mtcars dataset:
data <- as.matrix(mtcars)

# Default Heatmap
heatmap(data)

3.2.1 Normalization

Normalizing the matrix is done using the scale argument of the heatmap() function. It can be applied to row or to column. Here the column option is chosen, since we need to absorb the variation between column.

# Use 'scale' to normalize
heatmap(data, scale="column")

3.2.2 Dendrogram and Reordering

You may have noticed that order of both rows and columns is different compare to the native mtcar matrix. This is because heatmap() reorders both variables and observations using a clustering algorithm: it computes the distance between each pair of rows and columns and try to order them by similarity.

Moreover, the corresponding dendrograms are provided beside the heatmap. We can avoid it and just visualize the raw matrix: use the Rowv and Colv arguments as follow.

# No dendrogram nor reordering for neither column or row
heatmap(data, Colv = NA, Rowv = NA, scale="column")

3.2.3 Color Palette

There are several ways to custom the color palette:

  • Use the native palettes of R: terrain.color(), rainbow(), heat.colors(), topo.colors() or cm.colors().
  • Use the palettes proposed by RColorBrewer. See list of available palettes here.
# 1: native palette from R
heatmap(data, scale="column", col = cm.colors(256))

heatmap(data, scale="column", col = terrain.colors(256))

# 2: Rcolorbrewer palette
library(RColorBrewer)
coul <- colorRampPalette(brewer.pal(8, "PiYG"))(25)
heatmap(data, scale="column", col = coul)

3.2.4 Custom Layout

You can custom title & axis titles with the usual main and xlab/ylab arguments (left).

You can also change labels with labRow/colRow and their size with cexRow/cexCol

# Add classic arguments like main title and axis title
heatmap(data, Colv = NA, Rowv = NA, scale="column", col = coul, xlab="variable", ylab="car", main="heatmap")

# Custom x and y labels with cexRow and labRow (col respectively)
heatmap(data, scale="column", cexRow=1.5, labRow=paste("new_", rownames(data),sep=""), col= colorRampPalette(brewer.pal(8, "Blues"))(25))

3.2.5 Add Color beside Heatmap

Often, heatmap intends to compare the observed structure with an expected one.

You can add a vector of color beside the heatmap to represents the expected structure using the RowSideColors argument.

# Example: grouping from the first letter:
my_group <- as.numeric(as.factor(substr(rownames(data), 1 , 1)))
colSide <- brewer.pal(9, "Set1")[my_group]
colMain <- colorRampPalette(brewer.pal(8, "Blues"))(25)
heatmap(data, Colv = NA, Rowv = NA, scale="column" , RowSideColors=colSide, col=colMain   )

3.2.6 Using geom_title() from ggplot2

ggplot2 also allows to build heatmaps thanks to geom_tile(). However, I personally prefer the heatmap() function above since only it offers option for normalization, clustering and Dendrogram.

3.2.7 ggplot2 Heatmap

This document provides several examples of heatmaps built with R and ggplot2. It describes the main customization you can apply, with explanation and reproducible code.

Note: The native heatmap() function provides more options for data normalization and clustering. Consider it as a valuable option.

3.2.7.1 Most Basic Heatmap with ggplot2

This is the most basic heatmap you can build with R and ggplot2, using the geom_tile() function. Input data must be a long format where each row provides an observation. At least 3 variables are needed per observation:

  • x: position on the X axis.
  • y: position on the Y axis.
  • fill: the numeric value that will be translated in a color.
# Library
library(ggplot2)

# Dummy data
x <- LETTERS[1:20]
y <- paste0("var", seq(1,20))
data <- expand.grid(X=x, Y=y)
data$Z <- runif(400, 0, 5)
 
# Heatmap 
ggplot(data, aes(X, Y, fill= Z)) + 
  geom_tile()

3.2.8 Control Color Palette

Color palette can be changed like in any ggplot2 chart. Above are 3 examples using different methods:

  • scale_fill_gradient() to provide extreme colors of the palette.
  • scale_fill_distiller() to provide a ColorBrewer palette.
  • scale_fill_viridis() to use Viridis. Do not forget discrete=FALSE for a continuous variable.
# Library
library(ggplot2)
library(hrbrthemes)

# Dummy data
x <- LETTERS[1:20]
y <- paste0("var", seq(1,20))
data <- expand.grid(X=x, Y=y)
data$Z <- runif(400, 0, 5)
 
# Give extreme colors:
ggplot(data, aes(X, Y, fill= Z)) + 
  geom_tile() +
  scale_fill_gradient(low="white", high="blue") +
  theme_ipsum()

# Color Brewer palette
ggplot(data, aes(X, Y, fill= Z)) + 
  geom_tile() +
  scale_fill_distiller(palette = "RdPu") +
  theme_ipsum()

# Color Brewer palette
library(viridis)
ggplot(data, aes(X, Y, fill= Z)) + 
  geom_tile() +
  scale_fill_viridis(discrete=FALSE) +
  theme_ipsum()

3.2.9 From Wide Input Format

It is a common issue to have a wide matrix as input, as for the volcano dataset. In this case, you need to tidy it with the gather() function of the tidyr package to visualize it with ggplot.

# Library
library(ggplot2)
library(tidyr)
library(tibble)
library(hrbrthemes)
library(dplyr)

# Volcano dataset
#volcano

# Heatmap 
volcano %>%
  
  # Data wrangling
  as_tibble() %>%
  rowid_to_column(var="X") %>%
  gather(key="Y", value="Z", -1) %>%
  
  # Change Y to numeric
  mutate(Y=as.numeric(gsub("V","",Y))) %>%

  # Viz
  ggplot(aes(X, Y, fill= Z)) + 
    geom_tile() +
    theme_ipsum() +
    theme(legend.position="none")

3.2.10 Turn it Interactive with plotly

One of the nice feature of ggplot2 is that charts can be turned interactive in seconds thanks to plotly. You just need to wrap your chart in an object and call it in the ggplotly() function.

Often, it is a good practice to custom the text available in the tooltip.

Note: try to hover cells to see the tooltip, select an area to zoom in.

# Library
library(ggplot2)
library(hrbrthemes)
library(plotly)

# Dummy data
x <- LETTERS[1:20]
y <- paste0("var", seq(1,20))
data <- expand.grid(X=x, Y=y)
data$Z <- runif(400, 0, 5)
 
# new column: text for tooltip:
data <- data %>%
  mutate(text = paste0("x: ", x, "\n", "y: ", y, "\n", "Value: ",round(Z,2), "\n", "What else?"))

# classic ggplot, with text in aes
p <- ggplot(data, aes(X, Y, fill= Z, text=text)) + 
  geom_tile() +
  theme_ipsum()

ggplotly(p, tooltip="text")
# save the widget
# library(htmlwidgets)
# saveWidget(pp, file=paste0( getwd(), "/HtmlWidget/ggplotlyHeatmap.html"))

3.2.11 Interactive Heatmaps from R

Three options exist to build an interactive heatmap from R:

  • plotly: As described above, plotly allows to turn any heatmap made with ggplot2 interactive.

  • d3heatmap: A package that uses the same syntax as the base R heatmap() function to make interactive version.

  • heatmaply: The most flexible option, allowing many different kind of customization. See the code of the chart beside here.

3.2.11.1 Definition

A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above.

Here is an example showing 8 general features like population or life expectancy for about 30 countries in 2015. Data come from the French National Institute of Demographic Studies.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(plotly)
library(d3heatmap)

# Load data 
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/multivariate.csv", header=T, sep=";")
colnames(data) <- gsub("\\.", " ", colnames(data))

# Select a few country
data <- data %>% 
  filter(Country %in% c("France", "Sweden", "Italy", "Spain", "England", "Portugal", "Greece", "Peru", "Chile", "Brazil", "Argentina", "Bolivia", "Venezuela", "Australia", "New Zealand", "Fiji", "China", "India", "Thailand", "Afghanistan", "Bangladesh", "United States of America", "Canada", "Burundi", "Angola", "Kenya", "Togo")) %>%
  arrange(Country) %>%
  mutate(Country = factor(Country, Country))

# Matrix format
mat <- data
rownames(mat) <- mat[,1]
mat <- mat %>% dplyr::select(-Country, -Group, -Continent)
mat <- as.matrix(mat)

# Heatmap
#d3heatmap(mat, scale="column", dendrogram = "none", width="800px", height="80Opx", colors = "Blues")

library(heatmaply)
p <- heatmaply(mat, 
        dendrogram = "none",
        xlab = "", ylab = "", 
        main = "",
        scale = "column",
        margins = c(60,100,40,20),
        grid_color = "white",
        grid_width = 0.00001,
        titleX = FALSE,
        hide_colorbar = TRUE,
        branches_lwd = 0.1,
        label_names = c("Country", "Feature:", "Value"),
        fontsize_row = 5, fontsize_col = 5,
        labCol = colnames(mat),
        labRow = rownames(mat),
        heatmap_layers = theme(axis.line=element_blank())
        )
p

3.2.11.2 What for

Heatmap is really useful to display a general view of numerical data, not to extract specific data point. In the graphic above, the huge population size of China and India pops out for example.

Heatmap is also useful to display the result of hierarchical clustering. Basically, clustering checks what countries tend to have the same features on their numeric variables, what countries are similar. The usual way to represent the result is to use dendrogram. This type of chart can be drawn on top of the heatmap:

p <- heatmaply(mat, 
        #dendrogram = "row",
        xlab = "", ylab = "", 
        main = "",
        scale = "column",
        margins = c(60,100,40,20),
        grid_color = "white",
        grid_width = 0.00001,
        titleX = FALSE,
        hide_colorbar = TRUE,
        branches_lwd = 0.1,
        label_names = c("Country", "Feature:", "Value"),
        fontsize_row = 5, fontsize_col = 5,
        labCol = colnames(mat),
        labRow = rownames(mat),
        heatmap_layers = theme(axis.line=element_blank())
        )
p
# save the widget
# library(htmlwidgets)
# saveWidget(p, file= "~/Desktop/R-graph-gallery/HtmlWidget/heatmapInter.html")

3.2.12 Heatmap for Time Series

Heatmaps can be a very good alternative to visualize time series, especially when the time frame you study is repeating, like weeks. Here is a customized example, but visit the time series section for more.

A submission by John MacKintosh who visualized meteorological data using a heatmap built with ggplot2. Initial code is stored on github and displayed below:

3.2.13 The Hourly Heatmap

A heatmap used to display time series with R and ggplot2. A submission by John MacKintosh with reproducible code.

A submission by John MacKintosh who visualized meteorological data using a heatmap built with ggplot2. Initial code is stored on github and displayed below:

library(ggplot2)
library(dplyr) # easier data wrangling 
library(viridis) # colour blind friendly palette, works in B&W also
library(Interpol.T) #  will generate a large dataset on initial load
library(lubridate) # for easy date manipulation
library(ggExtra) # because remembering ggplot theme options is beyond me
library(tidyr) 
 
 
data <- data(Trentino_hourly_T,package = "Interpol.T")
 
names(h_d_t)[1:5]<- c("stationid","date","hour","temp","flag")
df <- tbl_df(h_d_t) %>%
  filter(stationid =="T0001")
 
df <- df %>% mutate(year = year(date),
                  month = month(date, label=TRUE),
                  day = day(date))
  
df$date<-ymd(df$date) # not necessary for plot but 
#useful if you want to do further work with the data
 
#cleanup
rm(list=c("h_d_t","mo_bias","Tn","Tx",
          "Th_int_list","calibration_l",
          "calibration_shape","Tm_list"))
 
 
#create plotting df
df <-df %>% dplyr::select(stationid,day,hour,month,year,temp)%>%
        fill(temp) #optional - see note below
 
# Re: use of fill
# This code is for demonstrating a visualisation technique
# There are 5 missing hourly values in the dataframe.
 
# see the original plot here (from my ggplot demo earlier this year) to see the white spaces where the missing values occcur:
# https://github.com/johnmackintosh/ggplotdemo/blob/master/temp8.png 
 
# I used 'fill' from  tidyr to take the prior value for each missing value and replace the NA
# This is a quick fix for the blog section only - _do not_ do this with your real world data
 
# Should really use either use replace_NA or complete(with fill)in tidyr 
# OR 
# Look into more specialist way of replacing these missing values -e.g. imputation.
 
 
 
statno <-unique(df$stationid)
 
 
 
######## Plotting starts here#####################
p <-ggplot(df,aes(day,hour,fill=temp))+
  geom_tile(color= "white",size=0.1) + 
  scale_fill_viridis(name="Hrly Temps C",option ="C")
p <-p + facet_grid(year~month)
p <-p + scale_y_continuous(trans = "reverse", breaks = unique(df$hour))
p <-p + scale_x_continuous(breaks =c(1,10,20,31))
p <-p + theme_minimal(base_size = 8)
p <-p + labs(title= paste("Hourly Temps - Station",statno), x="Day", y="Hour Commencing")
p <-p + theme(legend.position = "bottom")+
  theme(plot.title=element_text(size = 14))+
  theme(axis.text.y=element_text(size=6)) +
  theme(strip.background = element_rect(colour="white"))+
  theme(plot.title=element_text(hjust=0))+
  theme(axis.ticks=element_blank())+
  theme(axis.text=element_text(size=7))+
  theme(legend.title=element_text(size=8))+
  theme(legend.text=element_text(size=6))+
  removeGrid()#ggExtra
 
# you will want to expand your plot screen before this bit!
p #awesomeness

3.2.14 Using the levelplot() Function of Lattice

This document explains how to use the levelplot() function of the lattice R package to build heatmaps.

3.2.14.1 Basis use of levelplot()

The lattice package allows to build heatmaps thanks to the levelplot() function.

Input data: here input is a data frame with 3 columns prividing the X and Y coordinate of the cell and its value. (Long format).

# Load the lattice package
library("lattice")
 
# Dummy data
x <- seq(1,10, length.out=20)
y <- seq(1,10, length.out=20)
data <- expand.grid(X=x, Y=y)
data$Z <- runif(400, 0, 5)

## Try it out
levelplot(Z ~ X*Y, data=data  ,xlab="X",
          main="")

3.2.15 From Wide Input Matrix

Previous example of this document was based on a data frame at the long format. Here, a square matrix is used instead. It is the second format understood by the levelplot() function.

Note: here row and column order isn’t respected in the heatmap.

# Load the library
library("lattice")
 
# Dummy data
data <- matrix(runif(100, 0, 5) , 10 , 10)
colnames(data) <- letters[c(1:10)]
rownames(data) <- paste( rep("row",10) , c(1:10) , sep=" ")
 
# plot it flipping the axis
levelplot(data)

3.2.16 Flip and Reorder Axis

The t() function of R allows to transpose the input matrix, and thus to flip X and Y coordinates.

Moreover, you can reverse matrix order as shown below to reverse order in the heatmap as well. Now the heatmap is organized exactly as the input matrix.

# Load the library
library("lattice")
 
# Dummy data
data <- matrix(runif(100, 0, 5) , 10 , 10)
colnames(data) <- letters[c(1:10)]
rownames(data) <- paste( rep("row",10) , c(1:10) , sep=" ")
 
# plot it flipping the axis
levelplot( t(data[c(nrow(data):1) , ]),
           col.regions=heat.colors(100))

3.2.17 Custom Colors

There are several ways to custom the color palette:

  • Native palettes of R: terrain.color(), rainbow(), heat.colors(), topo.colors() or cm.colors()
  • Palettes of RColorBrewer. See list of available palettes here.
  • Palettes of Viridis: viridis, magma, inferno, plasma.
# Lattice package
require(lattice)

# The volcano dataset is provided, it looks like that:
#head(volcano)

# 1: native palette from R
levelplot(volcano, col.regions = terrain.colors(100)) # try cm.colors() or terrain.colors()

# 2: Rcolorbrewer palette
library(RColorBrewer)
coul <- colorRampPalette(brewer.pal(8, "PiYG"))(25)
levelplot(volcano, col.regions = coul) # try cm.colors() or terrain.colors()

# 3: Viridis
library(viridisLite)
coul <- viridis(100)
levelplot(volcano, col.regions = coul) 

#levelplot(volcano, col.regions = magma(100)) 

3.2.18 Heatmap with Smoothing - LatticeExtra

The latticeExtra package allows to build a heatmap with smoothing. Here is how with reproducible code.

The levelplot function of lattice allows to deal with 3 numeric variables as shown in the heatmap section of the gallery.

Here, the idea is to plot data points as circles using panel.levelplot.points to get a scatterplot. Then, a layer is added using panel.2dsmoother to show the general distribution of the third numeric variable.

# library
library(latticeExtra) 
 
# create data
set.seed(1) 
data <- data.frame(x = rnorm(100), y = rnorm(100)) 
data$z <- with(data, x * y + rnorm(100, sd = 1)) 
 
# showing data points on the same color scale 
levelplot(z ~ x * y, data, 
          panel = panel.levelplot.points, cex = 1.2
    ) + 
    layer_(panel.2dsmoother(..., n = 200))

3.3 Correlogram


A correlogram or correlation matrix allows to analyse the relationship between each pair of numeric variables in a dataset. It gives a quick overview of the whole dataset. It is more used for exploratory purpose than explanatory.

3.3.0.1 Using the ggally Package

The GGally package offers great options to build correlograms. The ggpairs() function build a classic correlogram with scatterplot, correlation coefficient and variable distribution. On top of that, it is possible to inject ggplot2 code, for instance to color categories.

3.3.0.2 Correlation Matrix with Ggally

This section explains how to build a correlogram with the ggally R package. It provides several reproducible examples with explanation and R code.

3.3.0.3 Scatterplot Matrix with Ggpairs()

The ggpairs() function of the GGally package allows to build a great scatterplot matrix.

Scatterplots of each pair of numeric variable are drawn on the left part of the figure. Pearson correlation is displayed on the right. Variable distribution is available on the diagonal.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables 
library(GGally)
 
# Create data 
data <- data.frame( var1 = 1:100 + rnorm(100,sd=20), v2 = 1:100 + rnorm(100,sd=27), v3 = rep(1, 100) + rnorm(100, sd = 1)) 
data$v4 = data$var1 ** 2 
data$v5 = -(data$var1 ** 2) 
 
# Check correlations (as scatterplots), distribution and print corrleation coefficient 
ggpairs(data, title="correlogram with ggpairs()") 

3.3.1 Visualize Correlation with ggcorr()

The ggcorr() function allows to visualize the correlation of each pair of variable as a square. Note that the method argument allows to pick the correlation type you desire.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables 
library(GGally)
 
# Create data 
data <- data.frame( var1 = 1:100 + rnorm(100,sd=20), v2 = 1:100 + rnorm(100,sd=27), v3 = rep(1, 100) + rnorm(100, sd = 1)) 
data$v4 = data$var1 ** 2 
data$v5 = -(data$var1 ** 2) 
 
# Check correlation between variables
#cor(data) 
 
# Nice visualization of correlations
ggcorr(data, method = c("everything", "pearson")) 

3.3.2 Change Plot Types

Change the type of plot used on each part of the correlogram. This is done with the upper and lower argument.

3.3.3 Split by Group

It is possible to use ggplot2 aesthetics on the chart, for instance to color each category.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables 
library(GGally)
 
# From the help page:
data(flea)
ggpairs(flea, columns = 2:4, ggplot2::aes(colour=species)) 

3.3.4 Using the corrgram Package

The corrgram is another great alternative to build correlograms. You can choose what to display in the upper, lower and diagonal part of the figure: scatterplot, pie chart, text, ellipse and more.

3.3.4.1 Correlogram with the corrgram Library

This section explains how to build a correlogram with the corrgram R package. It provides several reproducible examples with explanation and R code.

3.3.4.2 Scatterplot Matrix with ggpairs()

The corrgram package allows to build correlogram. The output allows to check the relationship between each pair of a set of numeric variable.

Relationship can be visualized with different methods:

  • panel.ellipse to display ellipses
  • panel.shade for colored squares
  • panel.pie for pie charts
  • panel.pts for scatterplots
# Corrgram library
library(corrgram)

# mtcars dataset is natively available in R
# head(mtcars)

# First
corrgram(mtcars, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Car Milage Data in PC2/PC1 Order") 

# Second
corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse, upper.panel=panel.pts, text.panel=panel.txt, diag.panel=panel.minmax, main="Car Milage Data in PC2/PC1 Order") 

# Third
corrgram(mtcars, order=NULL, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt, main="Car Milage Data (unsorted)")

3.3.5 Visualize Correlation with ggcorr()

The gcorr() function allows to visualize the correlation of each pair of variable as a square. Note that the method argument allows to pick the correlation type you desire.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables
library(GGally)

# Create data
data <- data.frame( var1 = 1:100 + rnorm(100,sd=20), v2 = 1:100 + rnorm(100,sd=27), v3 = rep(1, 100) + rnorm(100, sd = 1))
data$v4 = data$var1 ** 2
data$v5 = -(data$var1 ** 2)

# Check correlation between variables
#cor(data)

# Nice visualization of correlations
ggcorr(data, method = c("everything", "pearson"))

3.3.6 Split by Group

It is possible to use ggplot2 aesthetics on the chart, for instance to color each category.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables
library(GGally)

# From the help page:
data(flea)
ggpairs(flea, columns = 2:4, ggplot2::aes(colour=species))

3.3.7 Change plot types

Change the type of plot used on each part of the correlogram. This is done with the upper and lower argument.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables
library(GGally)

# From the help page:
data(tips, package = "reshape")
ggpairs(
  tips[, c(1, 3, 4, 2)],
  upper = list(continuous = "density", combo = "box_no_facet"),
  lower = list(continuous = "points", combo = "dot_no_facet")
)

3.3.8 Other Methods

Lesser known ways to build correlogram with R, like the ellipse package, the plot() function and the car package.

3.3.8.1 Correlogram with the Ellipse Package

This section explains how to build a correlogram based on ellipses with the ellipse R package. It provides several reproducible examples with explanation and R code.

3.3.8.2 Scatterplot Matrix with ggpairs()

The ellipse package allows to build a correlogram thanks to the plotcorr() function.

First of all, you have to compute the correlation matrix of your dataset using the cor() function of R. Each correlation will be represented as an ellipse by the plotcorr() function. Color, shape and orientation depend on the correlation value.

# Libraries
library(ellipse)
library(RColorBrewer)
 
# Use of the mtcars data proposed by R
data <- cor(mtcars)
 
# Build a Pannel of 100 colors with Rcolor Brewer
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)
 
# Order the correlation matrix
ord <- order(data[1, ])
data_ord <- data[ord, ord]
plotcorr(data_ord , col=my_colors[data_ord*50+50] , mar=c(1,1,1,1)  )

3.3.9 Basic Scatterplot Matrix

This section explains how to build a scatterplot matrix with base R, without any packages. It provides several reproducible examples with explanation and R code

3.3.9.1 Scatterplot Matrix with the Native plot() Function

For a set of data variables (dimensions) \(X_1, X_2, ... , X_k\), the scatter plot matrix shows all the pairwise scatterplots of the variables on a single view with multiple scatterplots in a matrix format.

The native plot() function does the job pretty well as long as you just need to display scatterplots. For more option, check the correlogram section.

# Data: numeric variables of the native mtcars dataset
data <- mtcars[ , c(1,3:6)]
 
# Plot
plot(data , pch=20 , cex=1.5 , col="#69b3a2")

3.3.10 Correlogram with the car Package

This section explains how to build a scatterplot matrix with the car package. It provides several reproducible examples with explanation and R code.

3.3.10.1 Scatterplot Matrix with the Native plot() Function

This is a scatterplot matrix built with the scatterplotMatrix() function of the car package.

See more correlogram examples in the dedicated section.

Note the |cyl syntax: it means that categories available in the cyl variable must be represented distinctly (color, shape, size..).

# Packages
library(car)
library(RColorBrewer) # for the color palette

# Let's use the car dataset natively available in R
data <- mtcars

# Make the plot
my_colors <- brewer.pal(nlevels(as.factor(data$cyl)), "Set2")
scatterplotMatrix(~mpg+disp+drat|cyl, data=data , 
      reg.line="" , smoother="", col=my_colors , 
      smoother.args=list(col="grey") , cex=1.5 , 
      pch=c(15,16,17) , 
      main="Scatter plot with Three Cylinder Options"
      )

3.4 Bubble Plot


A bubble plot is a scatter plot with a third numeric variable mapped to circle size. This page describes several methods to build one with R.

3.4.0.1 A Bubble Chart is a Scatterplot

A bubble chart is basically a scatterplot with a third numeric variable used for circle size. Thus, remember all the tips described in the scatterplot section also apply here.

3.4.0.2 Step by Step with ggplot2

ggplot2 allows to create bubble chart thanks to the geom_point() function. Next examples will lead you through the process step by step:

3.4.1 Most Basic bubble Chart with geom_point()

A bubble plot is a scatterplot where a third dimension is added: the value of an additional numeric variable is represented through the size of the dots. (source: data-to-viz).

With ggplot2, bubble chart are built thanks to the geom_point() function. At least three variable must be provided to aes(): x, y and size. The legend will automatically be built by ggplot2.

Here, the relationship between life expectancy (y) and gdp per capita (x) of world countries is represented. The population of each country is represented through circle size.

# Libraries
library(ggplot2)
library(dplyr)

# The dataset is provided in the gapminder library
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

# Most basic bubble plot
ggplot(data, aes(x=gdpPercap, y=lifeExp, size = pop)) +
    geom_point(alpha=0.7)

3.4.2 Control Circle Size with scale_size()

The first thing we need to improve on the previous chart is the bubble size. scale_size() allows to set the size of the smallest and the biggest circles using the range argument. Note that you can customize the legend name with name.

Note: circles often overlap. To avoid having big circles on top of the chart you have to reorder your dataset first, as in the code below.

ToDo: give more details about how to map a numeric variable to circle size. Use of scale_radius, scale_size and scale_size_area. See this post.

# Libraries
library(ggplot2)
library(dplyr)

# The dataset is provided in the gapminder library
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

# Most basic bubble plot
data %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size = pop)) +
    geom_point(alpha=0.5) +
    scale_size(range = c(.1, 24), name="Population (M)")

3.4.3 Add a Fourth Dimension: Color

If you have one more variable in your dataset, why not showing it using circle color? Here, the continent of each country is used to control circle color:

# Libraries
library(ggplot2)
library(dplyr)

# The dataset is provided in the gapminder library
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

# Most basic bubble plot
data %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color=continent)) +
    geom_point(alpha=0.5) +
    scale_size(range = c(.1, 24), name="Population (M)")

3.4.4 Make it Pretty

A few classic improvement:

  • Use of the viridis package for nice color palette.
  • Use of theme_ipsum() of the hrbrthemes package.
  • Custom axis titles with xlab and ylab.
  • Add stroke to circle: change shape to 21 and specify color (stroke) and fill.
# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)
library(viridis)

# The dataset is provided in the gapminder library
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

# Most basic bubble plot
data %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent)) +
    geom_point(alpha=0.5, shape=21, color="black") +
    scale_size(range = c(.1, 24), name="Population (M)") +
    scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
    theme_ipsum() +
    theme(legend.position="bottom") +
    ylab("Life Expectancy") +
    xlab("Gdp per Capita") +
    theme(legend.position = "none")

3.4.5 Interactive Version

Here is an interactive bubble chart built in R, thanks to the ggplotly() function of the plotly library. Try to hover circles to get a tooltip, or select an area of interest for zooming. Double click to reinitialize.

3.4.5.1 Interactive Bubble Chart

This section explains how to build an interactive bubble chart with R, using ggplot2 and the ggplotly() function of the plotly package.

3.4.5.2 Most Basic Bubble Chart with geom_point()

This section follows the previous step by step description of building bubble chart with ggplot2.

The idea is to turn the chart interactive:

  • You can zoom by selecting an area of interest
  • Hover a circle to get information about it
  • Export to png
  • Slide axis
  • Double click to re-initialize.

This is done thanks to the ggplotly() function of the plotly package that turn any ggplot2 chart object interactive. Note the little trick to custom the tooltip content.

# Libraries
library(ggplot2)
library(dplyr)
library(plotly)
library(viridis)
library(hrbrthemes)

# The dataset is provided in the gapminder library
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

# Interactive version
p <- data %>%
  mutate(gdpPercap=round(gdpPercap,0)) %>%
  mutate(pop=round(pop/1000000,2)) %>%
  mutate(lifeExp=round(lifeExp,1)) %>%
  
  # Reorder countries to having big bubbles on top
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  
  # prepare text for tooltip
  mutate(text = paste("Country: ", country, "\nPopulation (M): ", pop, "\nLife Expectancy: ", lifeExp, "\nGdp per capita: ", gdpPercap, sep="")) %>%
  
  # Classic ggplot
  ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent, text=text)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1.4, 19), name="Population (M)") +
    scale_color_viridis(discrete=TRUE, guide=FALSE) +
    theme_ipsum() +
    theme(legend.position="none")

# turn ggplot interactive with plotly
pp <- ggplotly(p, tooltip="text")
pp
# save the widget
# library(htmlwidgets)
# saveWidget(pp, file=paste0( getwd(), "/HtmlWidget/ggplotlyBubblechart.html"))

3.5 Connected Scatterplot


Welcome to the connected scatterplot section of the gallery. If you want to know more about this kind of chart, visit data-to-viz.com. If you’re looking for a simple way to implement it in R and ggplot2, pick an example below.

3.5.1 Connected Scatterplot with R and Ggplot2

This section explains how to build a basic connected scatterplot with R and ggplot2. It provides several reproducible examples with explanation and R code.

3.5.1.1 Most Basic Connected Scatterplot: geom_point() and geom_line()

A connected scatterplot is basically a hybrid between a scatterplot and a line plot. Thus, you just have to add a geom_point() on top of the geom_line() to build it.

# Libraries
library(ggplot2)
library(dplyr)

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/3_TwoNumOrdered.csv", header=T)
data$date <- as.Date(data$date)

# Plot
data %>%
  tail(10) %>%
  ggplot( aes(x=date, y=value)) +
    geom_line() +
    geom_point()

3.5.2 Customize the Connected Scatterplot

Custom the general theme with the theme_ipsum() function of the hrbrthemes package. Add a title with ggtitle(). Custom circle and line with arguments like shape, size, color and more.

# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/3_TwoNumOrdered.csv", header=T)
data$date <- as.Date(data$date)

# Plot
data %>%
  tail(10) %>%
  ggplot( aes(x=date, y=value)) +
    geom_line( color="grey") +
    geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
    theme_ipsum() +
    ggtitle("Evolution of bitcoin price")

3.5.3 Connected Scatterplot to show an Evolution

The connected scatterplot can also be a powerfull technique to tell a story about the evolution of 2 variables. Let???s consider a dataset composed of 3 columns:

  • Year
  • Number of baby born called Amanda this year
  • Number of baby born called Ashley

The scatterplot beside allows to understand the evolution of these 2 names. Note that the code is pretty different in this case. geom_segment() is used of geom_line(). This is because geom_line() automatically sort data points depending on their X position to link them.

# Libraries
library(ggplot2)
library(dplyr)
library(babynames)
library(ggrepel)
library(tidyr)

# data
data <- babynames %>% 
  filter(name %in% c("Ashley", "Amanda")) %>%
  filter(sex=="F") %>%
  filter(year>1970) %>%
  dplyr::select(year, name, n) %>%
  spread(key = name, value=n, -1)

# plot
data %>% 
  ggplot(aes(x=Amanda, y=Ashley, label=year)) +
     geom_point() +
     geom_segment(aes(
                    xend=c(tail(Amanda, n=-1), NA), 
                    yend=c(tail(Ashley, n=-1), NA)
                  )
      ) 

It makes sense to add arrows and labels to guide the reader in the chart:

# data
data <- babynames %>% 
  filter(name %in% c("Ashley", "Amanda")) %>%
  filter(sex=="F") %>%
  filter(year>1970) %>%
  dplyr::select(year, name, n) %>%
  spread(key = name, value=n, -1)

# Select a few date to label the chart
tmp_date <- data %>% sample_frac(0.3)

# plot 
data %>% 
  ggplot(aes(x=Amanda, y=Ashley, label=year)) +
     geom_point(color="#69b3a2") +
     geom_text_repel(data=tmp_date) +
     geom_segment(color="#69b3a2", 
                  aes(
                    xend=c(tail(Amanda, n=-1), NA), 
                    yend=c(tail(Ashley, n=-1), NA)
                  ),
                  arrow=arrow(length=unit(0.3,"cm"))
      ) +
      theme_ipsum()

3.5.4 Connected Scatterplot for Time Series

Connected scatterplots are often used for time series. Remember the R graph gallery offers a dedicated section, with heaps of examples. For instance, here is an interactive chart made with the dygraphs library.

# Library
library(dygraphs)
library(xts)          # To make the convertion data-frame / xts format
library(tidyverse)
library(lubridate)
 
# Read the data (hosted on the gallery website)
data <- read.table("https://python-graph-gallery.com/wp-content/uploads/bike.csv", header=T, sep=",") %>% head(300)

# Check type of variable
# str(data)
 
# Since my time is currently a factor, I have to convert it to a date-time format!
data$datetime <- ymd_hms(data$datetime)
 
# Then you can create the xts necessary to use dygraph
don <- xts(x = data$count, order.by = data$datetime)

# Finally the plot
p <- dygraph(don) %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1)

# save the widget
# library(htmlwidgets)
# saveWidget(p, file=paste0( getwd(), "/HtmlWidget/dygraphs318.html"))

3.5.5 Connected Scatterplot using Base R

Basic R also allows to build connected scatterplot thanks to the line() function. You just need to use the b option of the type argument. See examples below.

3.5.5.1 Add a Legend to a Base R Chart

This section explains how to add a legend to a chart made with base R, using the legend() function. It provides several reproducible examples with explanation and R code. It is done using the legend() function. The main arguments are:

This page aims to explain how to add a legend to a plot made in base R. It is done using the legend() function. The main arguments are:

  • legend: names to display
  • bty: type of box around the legend. See graph #73
  • horiz: legend in column or in row
  • col: symbol color
  • pch: symbol type. See graph #6
  • pt.cex: symbol size
  • cex: text size
  • text.col: text color
  • topright: legend position: bottomright, bottom, bottomleft, left, topleft, top, topright, right, center
  • inset: % (from 0 to 1) to draw the legend away from x and y axis

You can also give the X and Y coordinate of the legend: legend(3, 5, ...)

Note that an equivalent page exist concerning legends with ggplot2.

# Create data:
a=c(1:5)
b=c(5,3,4,5,5)
c=c(4,5,4,3,1)
 
# Make a basic graph
plot( b~a , type="b" , bty="l" , xlab="value of a" , ylab="value of b" , col=rgb(0.2,0.4,0.1,0.7) , lwd=3 , pch=17 , ylim=c(1,5) )
lines(c ~a , col=rgb(0.8,0.4,0.1,0.7) , lwd=3 , pch=19 , type="b" )
 
# Add a legend
legend("bottomleft", 
  legend = c("Group 1", "Group 2"), 
  col = c(rgb(0.2,0.4,0.1,0.7), 
  rgb(0.8,0.4,0.1,0.7)), 
  pch = c(17,19), 
  bty = "n", 
  pt.cex = 2, 
  cex = 1.2, 
  text.col = "black", 
  horiz = F , 
  inset = c(0.1, 0.1))

3.5.6 Manage Dates Data with Base R

This section explains how to deal with date data in base R. It takes a connected scatterplot as an example and display several options to deal with dates.

3.5.6.1 Important note about the lubridate() library.

I strongly advise to have a look to the lubridate() library. It allows to easily manipulate the date format, and is very powerful in conjunction with ggplot2. Have a look to the time series section of the gallery.

3.5.6.1.1 Is your date recognized as a date?

R offers a special data type for dates. It is important to use it since it will make the creation of charts lot easier. The str() function allows to check the type of each column. In the example beside, the date column is recognized as a factor.

# Create data
set.seed(124)
date <- paste(   "2015/03/" , sample(seq(1,31),6) , sep="")
value <- sample(seq(1,100) , 6)
data <- data.frame(date,value)

# Date and time are recognized as factor:
str(data)
## 'data.frame':	6 obs. of  2 variables:
##  $ date : chr  "2015/03/1" "2015/03/7" "2015/03/10" "2015/03/27" ...
##  $ value: int  6 45 74 33 15 91

3.5.6.2 Why it Matters

The issue is that your plot is gonna be very disappointing if the date is not recognized properly, as shown beside

# Create data
set.seed(124)
date <- paste("2015/03/" , sample(seq(1,31),6) , sep="")
value <- sample(seq(1,100) , 6)
data <- data.frame(date,value)

# Date and time are recognized as factor:
str(data)
## 'data.frame':	6 obs. of  2 variables:
##  $ date : chr  "2015/03/1" "2015/03/7" "2015/03/10" "2015/03/27" ...
##  $ value: int  6 45 74 33 15 91

3.5.7 Switch to Date Format

You can use the as.Date() function to specify that a column is at the date format. Now, with a bit of customization, we can get a nice connected scatterplot from our data:

# Create data
set.seed(124)
date <- paste(   "2015/03/" , sample(seq(1,31),6) , sep="")
value <- sample(seq(1,100) , 6)
data <- data.frame(date,value)

# Let's change the date to the "date" format:
data$date <- as.Date(data$date)
 
# So we can sort the table:
data <- data[order(data$date) , ]
 
# Easy to make it better now:
plot(data$value~data$date , type="b" , lwd=3 , col=rgb(0.1,0.7,0.1,0.8) , ylab="value of ..." , xlab="date" , bty="l" , pch=20 , cex=4)
abline(h=seq(0,100,10) , col="grey", lwd=0.8)

3.5.8 Base R Graph Parameters: Cheatsheet

This section aims to remind the options offered to customize a graph in base R. Understand in a sec how to use lwd, pch, type, lty, cex, and more. Base R offers many option to customize the chart appearance.

Basically everthing is double with those few options:

  • cex: shape size
  • lwd: line width
  • col: control colors
  • lty: line type
  • pch: marker shape
  • type: link between dots

Note: visit the cheatsheet section for more.

# initialization
par(mar=c(3,3,3,3))
num <- 0 ; 
num1 <- 0
plot(0,0 , xlim=c(0,21) , ylim=c(0.5,6.5), col="white" , yaxt="n" , ylab="" , xlab="")
 
#fill the graph
for (i in seq(1,20)){
  points(i,1 , pch=i , cex=3)
  points(i,2 , col=i , pch=16 , cex=3)
  points(i,3 , col="black" , pch=16 , cex=i*0.25)
  
  #lty
  if(i %in% c(seq(1,18,3))){
        num=num+1
    points(c(i,i+2), c(4,4) , col="black" , lty=num , type="l" , lwd=2)
        text(i+1.1 , 4.15 , num)
        }
  
  #type and lwd 
  if(i %in% c(seq(1,20,5))){
    num1=num1+1
    points(c(i,i+1,i+2,i+3), c(5,5,5,5) , col="black"  , type=c("p","l","b","o")[num1] , lwd=2)
    text(i+1.1 , 5.2 , c("p","l","b","o")[num1] )
    points(c(i,i+1,i+2,i+3), c(6,6,6,6) , col="black"  , type="l",  lwd=num1)
    text(i+1.1 , 6.2 , num1 )
 
    }
  }
 
#add axis
axis(2, at = c(1,2,3,4,5,6), labels = c("pch" , "col" , "cex" , "lty", "type" , "lwd" ), 
     tick = TRUE, col = "black", las = 1, cex.axis = 0.8)

3.6 Density 2D


A 2D density chart displays the relationship between 2 numeric variables. One is represented on the X axis, the other on the Y axis, like for a scatterplot. Then, the number of observations within a particular area of the 2D space is counted and represented by a color gradient. Several types of 2d density chart exist:

3.6.0.1 2d Histogram with geom_bin2d()

This is the two dimension version of the classic histogram. The plot area is split in a multitude of small squares, the number of points in each square is represented by its color.

3.6.1 The Issue with geom_point()

A 2d density plot is useful to study the relationship between 2 numeric variables if you have a huge number of points. To avoid overlapping (as in the scatterplot beside), it divides the plot area in a multitude of small fragment and represents the number of points in this fragment. There are several types of 2d density plots. Each has its proper ggplot2 function. This section describes all of them.

# Library
library(tidyverse)
 
# Data
a <- data.frame( x=rnorm(20000, 10, 1.9), y=rnorm(20000, 10, 1.2) )
b <- data.frame( x=rnorm(20000, 14.5, 1.9), y=rnorm(20000, 14.5, 1.9) )
c <- data.frame( x=rnorm(20000, 9.5, 1.9), y=rnorm(20000, 15.5, 1.9) )
data <- rbind(a,b,c)
 
 
# Basic scatterplot
ggplot(data, aes(x=x, y=y) ) +
  geom_point()# 2d histogram with default option

ggplot(data, aes(x=x, y=y) ) +
  geom_bin2d() +
  theme_bw()

3.6.2 2d Histogram with geom_bin2d()

This is the two dimension version of the classic histogram. The plot area is split in a multitude of small squares, the number of points in each square is represented by its color.

For 2d histogram, the plot area is divided in a multitude of squares. (It is a 2d version of the classic histogram). It is called using the geom_bin_2d() function. This function offers a bins argument that controls the number of bins you want to display.

Note: If you’re not convinced about the importance of the bins option, read this.

# 2d histogram with default option
ggplot(data, aes(x=x, y=y) ) +
  geom_bin2d() +
  theme_bw()

# Bin size control + color palette
ggplot(data, aes(x=x, y=y) ) +
  geom_bin2d(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

3.6.3 Hexbin Chart with geom_hex()

Another alternative is to divide the plot area in a multitude of hexagons: it is thus called a hexbin chart, and is made using the geom_hex() function.

This function provides the bins argument as well, to control the number of division per axis.

# Hexbin chart with default option
ggplot(data, aes(x=x, y=y) ) +
  geom_hex() +
  theme_bw()

# Bin size control + color palette
ggplot(data, aes(x=x, y=y) ) +
  geom_hex(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

3.6.4 2d Distribution with geom_density_2d or stat_density_2d

As you can plot a density chart instead of a histogram, it is possible to compute a 2d density and represent it. Several possibilities are offered by ggplot2: you can show the contour of the distribution, or the area, or use the raster function:

# Show the contour only
ggplot(data, aes(x=x, y=y) ) +
  geom_density_2d()

# Show the area only
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon")

# Area + contour
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white")

# Using raster
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme(
    legend.position='none'
  )

3.6.5 Customize the Color Palette

Whatever you use a 2d histogram, a hexbin chart or a 2d distribution, you can and should custom the colour of your chart. Here is a suggestion using the scale_fill_distiller() function. You can see other methods in the ggplot2 section of the gallery.

# Call the palette with a number
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_fill_distiller(palette=4, direction=-1) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme(
    legend.position='none'
  )

# The direction argument allows to reverse the palette
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_fill_distiller(palette=4, direction=1) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme(
    legend.position='none'
  )

# You can also call the palette using a name.
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_fill_distiller(palette= "Spectral", direction=1) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme(
    legend.position='none'
  )

3.6.6 Hexbin Chart with the Hexbin Package

This section explains how to build a hexbin chart with R using the hexbin package. Hexbin chart is a 2d density chart, allowing to visualize the relationship between 2 numeric variables.

Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can’t be individually discerned.

Binning can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. This technique is computed in the hexbin package.

This example has been published by Myles Harrison on R-bloggers.

# Packages
library(hexbin)
library(RColorBrewer)
 
# Create data
x <- rnorm(mean=1.5, 5000)
y <- rnorm(mean=1.6, 5000)
 
# Make the plot
bin<-hexbin(x, y, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin, main="" , colramp=my_colors , legend=F ) 

3.6.7 Hexbin Chart and Scatterplot with Ggplot2

This section explains how to build a hexbin chart with a scatterplot on top using R and ggplot2. It is an addition to the page about 2d density plot with ggplot2.

This plot extends the concepts described in the 2d density chart with ggplot2 document. It simply illustrates that a scatterplot can be added on top of the 2d density chart.

Thanks Christian Jacob for this submission.

# library
library(ggplot2) 

# data
sample_data <- data.frame(x_values = 1:100 + rnorm(100,sd=20), y_values = 1:100 + rnorm(100,sd=27)) 

#plot
ggplot(sample_data, aes(x_values, y_values)) +
  stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + 
  geom_point(colour = "white")