Chapter 6 Quantification of transcripts

After trimming and checking the quality of the trimmed data, we will quantify the transcripts. In this step we will not use the alignment of the readings but the mapping with Salmon or pseudo-alignment with Kallisto. The idea of this pipeline is to pre-process the data generated by RNA-Seq until obtaining the gene expression matrix on a common desktop or laptop. This becomes unfeasible if we use alignment. The Salmon and Kallisto tools came to solve this problem.

Salmon corrects GC content by improving the accuracy of abundance estimates, and when uncorrected this bias can result in high false positive rates. Kallisto, on the other hand, has shown itself to be faster due to its pseudo-alignment, which does not show where each reading lines up but with which transcripts it is compatible. However, each tool has its advantages and disadvantages for the method they use. Depending on the experimental designer of the work, one tool is more suitable than another. In this pipeline, we chose to address mapping and pseudo-alignment.

6.1 Kallisto

Kallisto is not only a mapper but also a quantifier. The pseudo-alignment method implemented in the tool preserves the main information for quantification, being considered a tool of fast execution when compared to others with the same focus. Kallisto is an alignment-free RNA-Seq quantifier, it uses few memory resources, thus allowing its use in common laptops. It is focused on the transcription level rather than the gene. It can be executed in a single step or split it in bootstrap with the use of processors in parallel, in order to obtain estimates of uncertainty of the expression level.

For quantification with Kallisto, it is necessary to construct an index of the reference transcriptome. This construction is done only once, if you want to use this same reference to quantify other samples. First let’s download the reference transcriptome. In this case, we will use version 40 of the transcriptome provided by GENCODE:

# -- R environment
setwd("~/PreProcSEQ-main/")
download.file("ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz",
              "gencode.v40.transcripts.fa.gz")

We need to mount the index for Kallisto, we will do it as follows:

# -- terminal-ubuntu
kallisto index --index=4-quantification/kallisto/gencode_v40_index gencode.v40.transcripts.fa.gz

With the index assembled, let’s build a script to quantify the transcripts with Kallisto and name it kallisto_quantification.sh

Script for paired-end samples:

#!/bin/bash
INPUT="2-trimming/trimmomatic/paired"
QUANT="4-quantification"

time while read SAMP \n
        do
            echo "Processing sample ${SAMP}"
            kallisto quant --index=$QUANT/kallisto/gencode_v40_index \
            --output=$QUANT/kallisto/quant_kallisto/${SAMP}_quant --threads=4 \
            $INPUT/${SAMP}_R1_trimmomatic_paired.fastq \
            $INPUT/${SAMP}_R2_trimmomatic_paired.fastq
        done < samplesNames.txt

Script for single-end samples:

#!/bin/bash
INPUT="2-trimming/trimmomatic/paired"
QUANT="4-quantification"

time while read SAMP \n
        do
            echo "Processing sample ${SAMP}"
            kallisto quant --index=$QUANT/kallisto/gencode_v40_index \
            --output=$QUANT/kallisto/quant_kallisto/${SAMP}_quant --threads=4 \
            --single $INPUT/${SAMP}_R1_trimmomatic_paired.fastq
        done < samplesNames.txt

So let’s make the script an executable and run it:

# -- terminal-ubuntu
chmod +x kallisto_quantification.sh
./kallisto_quantification.sh

If you want to know more about the parameters of the tool click here or type kallisto --help in the terminal

Quantifications done by Kallisto are saved in 4-quantification/kallisto/quant_kallisto:

# -- terminal-ubuntu
ls ~/PreProcSEQ-main/4-quantification/kallisto/quant_kallisto
## _quant
## sample_01_quant
## sample_02_quant
## sample_03_quant
## sample_04_quant
## sample_05_quant
## sample_06_quant
## sample_07_quant
## sample_08_quant
## sample_09_quant
## sample_10_quant
## sample_11_quant
## sample_12_quant
## sample_13_quant
## sample_14_quant
## sample_15_quant
## sample_16_quant
## sample_17_quant

6.2 Salmon

Salmon has an accurate and fast method of quantifying transcripts. According to the authors, it is the first quantifier that considers the bias correction of the GC content in the fragments, showing an improvement in the precision of the estimates and reliability in the expression of the transcripts. Unlike pseudo-alignment, the method addressed by Salmon considers both the position and orientation of all mapped readings.

For the execution of Salmon we also need the index of the reference transcriptome. If you have not downloaded the reference transcriptome, run the following command:

# -- R environment
setwd("~/PreProcSEQ-main/")
download.file("ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz",
              "gencode.v40.transcripts.fa.gz")

After having downloaded the transcriptome, let’s build the index with Salmon:

# -- terminal-ubuntu
salmon index -t gencode.v40.transcripts.fa.gz -i 4-quantification/salmon/gencode_v40_index

With the index ready, let’s build a script to quantify transcripts with Salmon and name it salmon_quant.sh:

Script for paired-end samples:

#!/bin/bash
time while read SAMP \n
        do
            echo "Processing sample ${SAMP}"
            salmon quant -i 4-quantification/salmon/gencode_v40_index -l A \
            -1 2-trimming/trimmomatic/paired/${SAMP}_R1_trimmomatic_paired.fastq \
            -2 2-trimming/trimmomatic/paired/${SAMP}_R2_trimmomatic_paired.fastq \
            -p 4 --gcBias --validateMappings -o 4-quantification/salmon/quant_salmon/${SAMP}_quant
        done < samplesNames.txt

Script for single-end samples:

#!/bin/bash
time while read SAMP \n
        do
            echo "Processing sample ${SAMP}"
            salmon quant -i 4-quantification/salmon/gencode_v40_index -l A \
            -r 2-trimming/trimmomatic/paired/${SAMP}_R1_trimmomatic_paired.fastq \
            -p 4 --gcBias --validateMappings -o 4-quantification/salmon/quant_salmon/${SAMP}_quant
        done < samplesNames.txt

Let’s then convert the script into an executable and run it:

# -- terminal-ubuntu
chmod +x kallisto_quantification.sh
./salmon_quant.sh

The result of the quantification done by Salmon, is in 4-quantification/salmon/quant_salmon:

# -- terminal-ubuntu
ls ~/PreProcSEQ-main/4-quantification/salmon/quant_salmon
## _quant
## sample_01_quant
## sample_02_quant
## sample_03_quant
## sample_04_quant
## sample_05_quant
## sample_06_quant
## sample_07_quant
## sample_08_quant
## sample_09_quant
## sample_10_quant
## sample_11_quant
## sample_12_quant
## sample_13_quant
## sample_14_quant
## sample_15_quant
## sample_16_quant
## sample_17_quant
## SRR7009362_quant
## SRR7009363_quant
## SRR7009364_quant
## SRR7009365_quant