Chapter 7 Week 3 - Comparing bioinformatic pipelines

7.1 Learning Objectives

At the conclusion of today’s workshop students are expected to be able to:

Compare and contrast different bioinformatic pipelines and their effects on downstream results
Visualise RNAseq results
Explain the reasons why different algorithms may yield different results in analysis.

7.2 Workshop setup

Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.

cd ~
mkdir week_3
cd week_3
ln -s /apps/data/bms5021/week_3/ resources

7.3 Assessing the influence of bioinformatic tools on results

Each bioinformatic tool has a different underlying statistical model and assumptions. It is not correct to say that tools that have identical purposes will have identical results. In this workshop, we will compare the results of a STAR-RSEM-DeSeq2 pipeline to kallisto-DeSeq2.

Exercise 1: To prepare for this comparison, please review the following publication and make a summary table and brief description of your findings: Sarantopoulou, D., Brooks, T.G., Nayak, S. et al. Comparative evaluation of full-length isoform quantification from RNA-Seq. BMC Bioinformatics 22, 266 (2021). https://doi.org/10.1186/s12859-021-04198-1(https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04198-1).

7.4 Comparing isoform expression quantification

Last week we quantified isoform expression using Kallisto, a fast quantification method that leverages the pseudo-alignment to find transcripts that reads are compatible with. This method then computes the probability that a read belonged to a certain transcript, rather than performing direct alignment. Another popular approach is to use the STAR alignment tool, which is a splice aware aligner, and then estimate transcript expression using RSEM (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323).

Kallisto is a light-weight tool and does not require significant computational resources, nor does it require excessive wall-time. STAR, by contrast, stores it’s index as a suffix-array and thus is very memory intensive (up to 36GB RAM usage for a human genome). We have aligned the reads you trimmed last week to the WBCel235 reference genome using STAR, and quantified expression using RSEM. You will find these quantifications in week_3/resources/rsem_quantifications.

Exercise 2: Compare the quantification of isoform expression by Kallisto to RSEM.

Copy (using cp) the quantifications by kallisto, which you generated last week, and provided RSEM quantifications to your week_3 directory

Use the Rstudio interface to download these files to your local computer

Using a programming language or software of your choice, merge the quantifications for each sample. Note, you should perform an “inner join” keeping only rows that are present in both datasets

Generate scatterplots to evaluate the consistency of expression estimates generated by Kallisto and RSEM

Perform linear regression analysis and comment on the correlation between these estimates.

Another analysis that we often perform in RNAseq is PCA. Principle components analysis is a dimension-reduction method that allows us to visualise highly dimensional data, such as RNAseq data, in 2-3 dimensions. Let’s perform PCA analysis on the Kallisto and RSEM abundance estimates to see if we get similar results.

Exercise 3: Compare the quantification of isoform expression by Kallisto to RSEM.

Copy (using cp) the files in week_3/resources/kallisto_rsem_quants to your week_3 directory.

Download these files to your computer.

Using R or Python, perform PCA analysis and visualise PC1 and PC2.

If you are unsure on how to do this analysis in any language, we will show you using R.

Are the results consistent?

7.5 Differential gene expression analysis

Differential gene expression analysis allows us to identify genes and isoforms that differ between experimental conditions (ie. what goes up and down). We have performed differential gene expression analysis using DeSeq2 on both the Kallisto and STAR-RSEM generated quantifications. You will find these in week_3/resources/dge.

Exercise 4: Differential gene expression comparisons

Copy (using cp) the files in week_3/resources/dge to your week_3 directory.

Download these files to your computer.

How many isoforms were identified as differentially expressed in each pipeline?

What was the overlap (ie. what isoforms were identified by both kallisto-deseq2 and STAR-RSEM-deseq2

Create a Venn Diagram in a language of your choice to visualise your results

Exercise 5: Visualising results

Using a programming language of your choice, generate volcano plots to visualise the differential gene expression results. Volcano plots should depict the logFC on the X-axis and the -log10(p) on the y-axis.

Do these plots look similar.

Have a go at generating other visualisations from the lecture.