Chapter 6 Visualizing RNA-seq results

During this lesson, we will get you started with some basic and more advanced visualization techniques to explore and interpret the results of your differential expression analysis.

Let’s start by loading a few libraries:

# load libraries
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(RColorBrewer)
library(DESeq2)
library(pheatmap)
library(dplyr)

We will be working with three different data objects we have already created in earlier lessons:

  • Metadata for our samples (a dataframe): meta
  • Normalized expression data for every gene in each of our samples (a matrix): normalized_counts
  • Tibble versions of the DESeq2 results we generated in the last lesson: res_tableOE_tb and res_tableKD_tb

Let’s create tibble objects from the meta and normalized_counts data frames before we start plotting. This will enable us to use the tidyverse functionality more easily.

Check to make sure the column names of the normalized_counts matrix are the same as the row names of the meta data frame. If they are not, you will need to reorder the columns of normalized_counts to match the row names of meta.

# Read in the metadata
meta <- read.table("data/Mov10_full_meta.txt", 
                   header=T, row.names=1)

# Create a tibble for meta data
mov10_meta <- meta %>% 
  rownames_to_column(var="samplename") %>% 
  as_tibble()
        
# read in the normalized counts
normalized_counts <- read.delim("data/normalized_counts.txt", row.names=1)

all(mov10_meta$samplename == colnames(normalized_counts))
## [1] TRUE
# Create a tibble for normalized_counts
normalized_counts <- normalized_counts %>% 
  data.frame() %>%
  rownames_to_column(var="gene") %>% 
  as_tibble()

6.0.1 Plotting signicant DE genes

One way to visualize results would be to simply plot the expression data for a handful of genes. We could do that by picking out specific genes of interest or selecting a range of genes.

6.0.1.1 Using DESeq2 plotCounts() to plot expression of a single gene

To pick out a specific gene of interest to plot, for example Mov10, we can use the plotCounts() from DESeq2. plotCounts() is a function that allows us to plot the normalized counts for a single gene across all samples.

plotCounts(dds, 
           gene="MOV10", 
           intgroup="sampletype") 

# We can also color the points by sample type
sampletype = as.factor(mov10_meta$sampletype)

library(RColorBrewer)

display.brewer.all()

col = brewer.pal(8,"Dark2")
palette(col)

plotCounts(dds, gene="MOV10", 
           intgroup="sampletype",
           col = as.numeric(sampletype),
           pch = 19) 

This function only allows for plotting the counts of a single gene at a time.

6.0.1.2 Using ggplot2 to plot expression of a single gene

We can also use ggplot2 to plot the MOV10 counts. We can save the output of plotCounts() to a variable specifying the returnData=TRUE argument. This will save the normalized counts for the gene MOV10 to a data frame object. We can then use ggplot2 to plot the normalized counts for MOV10 across all samples.

# Save `plotCounts()` to a data frame object
d <- plotCounts(dds, gene = "MOV10", intgroup = "sampletype", 
                returnData=TRUE)

# Plot using ggplot2
ggplot(d, aes(x = sampletype, y = count, color = sampletype)) + 
  geom_point(position = position_jitter(w = 0.1,h = 0)) +
  geom_text_repel(aes(label = rownames(d))) + 
  theme_bw() +
  ggtitle("MOV10") +
  theme(plot.title = element_text(hjust = 0.5))

Note that in the plot below (code above), we are using geom_text_repel() from the ggrepel package to label our individual points on the plot.

6.0.1.3 Using ggplot2 to plot multiple genes (e.g. top 20)

Often it is helpful to check the expression of multiple genes of interest at the same time. This often first requires some data wrangling.

We are going to plot the normalized count values for the top 20 differentially expressed genes (by padj values).

To do this, we first need to determine the gene names of our top 20 genes by ordering our results and extracting the top 20 genes (by padj values):

res_tableOE <- readRDS("data/res_tableOE.RDS")

res_tableOE_tb <- res_tableOE %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>% 
  as_tibble()

## Order results by padj values
top20_sigOE_genes <- res_tableOE_tb %>% 
        arrange(padj) %>%   
#Arrange rows by padj values
        pull(gene) %>%      
#Extract character vector of ordered genes
        head(n=20)      
#Extract the first 20 genes

Then, we can extract the normalized count values for these top 20 genes:

## Normalized counts for top 20 significant genes
top20_sigOE_norm <- normalized_counts %>%
        filter(gene %in% top20_sigOE_genes)

Now that we have the normalized counts for each of the top 20 genes for all 8 samples, to plot using ggplot(), we need to pivot_longer top20_sigOE_norm from a wide format to a long format so the counts for all samples will be in a single column to allow us to give ggplot the one column with the values we want it to plot.

The pivot_longer() function in the tidyr package will perform this operation and will output the normalized counts for all genes for Mov10_oe_1 listed in the first 20 rows, followed by the normalized counts for Mov10_oe_2 in the next 20 rows, so on and so forth.

Alt text

# Pivot the data frame
pivoted_top20_sigOE <- top20_sigOE_norm %>%
  pivot_longer(colnames(top20_sigOE_norm)[2:9], 
               names_to = "samplename", 
               values_to = "normalized_counts")

## Check the column header in the "pivoted" data frame
head(pivoted_top20_sigOE)
gene samplename normalized_counts
ADAMTS1 Irrel_kd_1 6717.735
ADAMTS1 Irrel_kd_2 6454.641
ADAMTS1 Irrel_kd_3 6801.543
ADAMTS1 Mov10_kd_2 6869.168
ADAMTS1 Mov10_kd_3 8730.977
ADAMTS1 Mov10_oe_1 13252.239

Now, if we want our counts colored by sample group, then we need to combine the metadata information with the melted normalized counts data into the same data frame for input to ggplot():

inner_join(x,y) will merge 2 data frames by the colname in x that matches a column name in y in this case samplename column.

pivoted_top20_sigOE <- inner_join(
  mov10_meta, 
  pivoted_top20_sigOE)
## Joining with `by = join_by(samplename)`

Now that we have a data frame in a format that can be utilised by ggplot easily, let’s plot!

## plot using ggplot2
ggplot(pivoted_top20_sigOE) +
        geom_point(aes(x = gene, y = normalized_counts, 
          color = sampletype)) +
        scale_y_log10() +
        xlab("Genes") +
        ylab("log10 Normalized Counts") +
        ggtitle("Top 20 Significant DE Genes") +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 45, 
                                         hjust = 1)) +
        theme(plot.title = element_text(hjust = 0.5))

6.0.2 Heatmap

We will plot a heatmap of the sigOE genes to visualize their expression across all samples using pheatmap().

sigOE = readRDS("data/sigOE.RDS")

norm_OEsig <- normalized_counts[,c(1,2:4,7:9)] %>% 
              filter(gene %in% sigOE$gene) %>% 
              data.frame() %>%
              column_to_rownames(var = "gene") 

Now let’s draw the heatmap using pheatmap. We can also add annotations to the heatmap to indicate the sample type. The annotation_col argument in pheatmap()requires a data frame with the same row names as the column names of the matrix used to generate the heatmap.

### Annotate our heatmap (optional)
annotation <- mov10_meta %>% 
  filter(samplename %in% colnames(norm_OEsig)) %>%
    dplyr::select(samplename, sampletype) %>% 
    data.frame(row.names = "samplename")

annotation$sampletype = factor(annotation$sampletype)

### Set a color palette
heat_colors <- brewer.pal(6, "YlOrRd")

### Set annotation colors
ann_colors = list(
  sampletype = c(control="navy",
  MOV10_overexpression= "green"))

### Run pheatmap
pheatmap(norm_OEsig, 
         annotation_colors = ann_colors, 
         color = heat_colors, 
         cluster_rows = T, 
         show_rownames = F,
         annotation_col = annotation, 
         border_color = NA, 
         fontsize = 12, 
         scale = "row", 
         fontsize_col = 12)

NOTE: There are several additional arguments we have included in the function for aesthetics. One important one is scale="row", in which Z-scores are plotted, rather than the actual normalized count value.

6.0.3 Volcano plots

A commonly used plot in RNA-seq analysis is the volcano plot. in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis.

To generate a volcano plot, we first need to have a column in our results data indicating whether or not the gene is considered differentially expressed based on p-adjusted values.

## create a column for thresholding in the results table
res_tableOE_tb <- res_tableOE_tb %>% 
                  mutate(threshold_OE = padj < 0.05 & 
                  abs(log2FoldChange) >= 0.58)

Now we can start plotting. The geom_point object is most applicable, as this is essentially a scatter plot:

ggplot(res_tableOE_tb) +
        geom_point(aes(x = log2FoldChange, 
        y = -log10(padj), colour = threshold_OE)) +
        ggtitle("Mov10 overexpression") +
        xlab("log2 fold change") + 
        ylab("-log10 adjusted p-value") +
        theme_bw() +      
        theme(legend.position = "none",
        plot.title = element_text(size = rel(1.5), 
                                  hjust = 0.5),
        axis.title = element_text(size = rel(1.25)))  

What if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the volcano plot using geom_text_repel().

First, we need to order the res_tableOE tibble by padj, and add an additional column to it, to include on those gene names we want to use to label the plot.

## Create a column to indicate which genes to label
res_tableOE_tb <- res_tableOE_tb %>% 
  arrange(padj) %>% 
  mutate(genelabels = "")

res_tableOE_tb$genelabels[1:10] <- res_tableOE_tb$gene[1:10]

head(res_tableOE_tb)
gene baseMean log2FoldChange lfcSE pvalue padj threshold_OE genelabels
MOV10 21681.800 5.0833015 0.1100637 0 0 TRUE MOV10
H1F0 7881.081 1.5488442 0.0565866 0 0 TRUE H1F0
HIST1H1C 1741.383 1.5234178 0.0706100 0 0 TRUE HIST1H1C
TXNIP 5133.749 1.4237003 0.0697011 0 0 TRUE TXNIP
NEAT1 21973.706 0.9164119 0.0466768 0 0 TRUE NEAT1
KLF10 1694.211 1.2330697 0.0651906 0 0 TRUE KLF10

Next, we plot it as before with an additiona layer for geom_text_repel() wherein we can specify the column of gene labels we just created.

ggplot(res_tableOE_tb, aes(x = log2FoldChange, 
        y = -log10(padj))) +
        geom_point(aes(colour = threshold_OE)) +
        geom_text_repel(aes(label = genelabels)) +
        ggtitle("Mov10 overexpression") +
        xlab("log2 fold change") + 
        ylab("-log10 adjusted p-value") +
        theme_bw() +
        theme(legend.position = "none",
        plot.title = element_text(size = rel(1.5), 
                                  hjust = 0.5),
        axis.title = element_text(size = rel(1.25)))