Chapter 5 Trimming

Depending on the QC result, it is necessary to remove adapter sequences, unidentified bases, bases with very low quality scores, and these runs are done in the data trimming step. When this step is not included in the pipeline, sequencing errors, for example, can lead to incorrect alignment of the readings. Currently, sequencers generate short and long reads ranging from 25 to hundreds of bp, but errors occur more commonly on long reads. The effect of these errors is observed through the probability that a base has been called incorrectly. The most used metric is the Phred quality index (Q) used in Illumina platforms, ranging from 0 to 40, indicating that the higher the score, the better the sequencing quality.

If we choose to carry out the trimming step, we need to be careful with the parameters used by the tool. Knowing well how the tool performs the procedure and what it does also becomes essential knowledge, in order to avoid false positive results. There are many tools that do data trimming. Each with a different metric and focus. However, some are more used and very well documented. In this work we chose to work with Trimmomatic. The Trimmomatic is one of the oldest and most used tools for adapter cutting and filtering.

To run Trimmomatic, we build a txt file in R, containing the names of the files to be analyzed (samplesNames.txt):

setwd("~/PreProcSEQ-main/")
dir_files <- "~/PreProcSEQ-main/0-samples/"
files <- list.files(dir_files)
samples_names <- unique(gsub("_R..fastq","",files))
write.table(samples_names, "samplesNames.txt", quote = F, row.names = F, col.names = F)
samples_names
##  [1] "sample_01" "sample_02" "sample_03" "sample_04" "sample_05" "sample_06"
##  [7] "sample_07" "sample_08" "sample_09" "sample_10" "sample_11" "sample_12"
## [13] "sample_13" "sample_14" "sample_15" "sample_16" "sample_17"

Then we build the script in BASH to run Trimmomatic on each sample, remembering that the data is paired (files R1 and R2 belong to the same sample):

#!/bin/bash
INPUT="0-samples"
OUTPUTP="2-trimming/trimmomatic/paired"
OUTPUTUNP="2-trimming/trimmomatic/unpaired"

time while read SAMP \n
            do
            TrimmomaticPE $INPUT/${SAMP}_R1.fastq $INPUT/${SAMP}_R2.fastq \
            $OUTPUTP/${SAMP}_R1_trimmomatic_paired.fastq \
            $OUTPUTUNP/${SAMP}_R1_trimmomatic_unpaired.fastq \
            $OUTPUTP/${SAMP}_R2_trimmomatic_paired.fastq \
            $OUTPUTUNP/${SAMP}_R2_trimmomatic_unpaired.fastq \
            ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 \
            LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
        done < samplesNames.txt

If the data were not paired, the script would look like this:

#!/bin/bash
INPUT="0-samples"
OUTPUTP="2-trimming/trimmomatic/paired"
OUTPUTUNP="2-trimming/trimmomatic/unpaired"

time while read SAMP \n
            do
            TrimmomaticSE $INPUT/${SAMP}.fastq \
            $OUTPUTP/${SAMP}_trimmomatic_paired.fastq \
            $OUTPUTUNP/${SAMP}_trimmomatic_unpaired.fastq \
            ILLUMINACLIP:TruSeq3-SE-2.fa:2:30:10 \
            LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
        done < samplesNames.txt

Among the parameters that the tool uses, we use LEADING:3 to remove the main bases of low quality or N; TRAILING:3 to remove the last bases with low quality or N; SLIDINGWINDOW:4:15 evaluates sequence using a sliding window of 4 bases and cutting when the average quality per base is below 15; MINLEN:36 cut readings below 36bp; ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 for removing adapters. For more information about the parameters that Trimmomatic offers click here.

We save the script to run on paired samples with the name: trimming_trimmomatic.sh. We convert it into an executable and then run it:

chmod +x trimming_trimmomatic.sh
./trimming_trimmomatic.sh

The filtered files were stored in 2-trimming/trimmomatic/paired, the readings that were removed were stored in 2-trimming/trimmomatic/unpaired.

ls ~/PreProcSEQ-main/2-trimming/trimmomatic/paired
## sample_01_R1_trimmomatic_paired.fastq
## sample_01_R2_trimmomatic_paired.fastq
## sample_02_R1_trimmomatic_paired.fastq
## sample_02_R2_trimmomatic_paired.fastq
## sample_03_R1_trimmomatic_paired.fastq
## sample_03_R2_trimmomatic_paired.fastq
## sample_04_R1_trimmomatic_paired.fastq
## sample_04_R2_trimmomatic_paired.fastq
## sample_05_R1_trimmomatic_paired.fastq
## sample_05_R2_trimmomatic_paired.fastq
## sample_06_R1_trimmomatic_paired.fastq
## sample_06_R2_trimmomatic_paired.fastq
## sample_07_R1_trimmomatic_paired.fastq
## sample_07_R2_trimmomatic_paired.fastq
## sample_08_R1_trimmomatic_paired.fastq
## sample_08_R2_trimmomatic_paired.fastq
## sample_09_R1_trimmomatic_paired.fastq
## sample_09_R2_trimmomatic_paired.fastq
## sample_10_R1_trimmomatic_paired.fastq
## sample_10_R2_trimmomatic_paired.fastq
## sample_11_R1_trimmomatic_paired.fastq
## sample_11_R2_trimmomatic_paired.fastq
## sample_12_R1_trimmomatic_paired.fastq
## sample_12_R2_trimmomatic_paired.fastq
## sample_13_R1_trimmomatic_paired.fastq
## sample_13_R2_trimmomatic_paired.fastq
## sample_14_R1_trimmomatic_paired.fastq
## sample_14_R2_trimmomatic_paired.fastq
## sample_15_R1_trimmomatic_paired.fastq
## sample_15_R2_trimmomatic_paired.fastq
## sample_16_R1_trimmomatic_paired.fastq
## sample_16_R2_trimmomatic_paired.fastq
## sample_17_R1_trimmomatic_paired.fastq
## sample_17_R2_trimmomatic_paired.fastq
ls ~/PreProcSEQ-main/2-trimming/trimmomatic/unpaired
## sample_01_R1_trimmomatic_unpaired.fastq
## sample_01_R2_trimmomatic_unpaired.fastq
## sample_02_R1_trimmomatic_unpaired.fastq
## sample_02_R2_trimmomatic_unpaired.fastq
## sample_03_R1_trimmomatic_unpaired.fastq
## sample_03_R2_trimmomatic_unpaired.fastq
## sample_04_R1_trimmomatic_unpaired.fastq
## sample_04_R2_trimmomatic_unpaired.fastq
## sample_05_R1_trimmomatic_unpaired.fastq
## sample_05_R2_trimmomatic_unpaired.fastq
## sample_06_R1_trimmomatic_unpaired.fastq
## sample_06_R2_trimmomatic_unpaired.fastq
## sample_07_R1_trimmomatic_unpaired.fastq
## sample_07_R2_trimmomatic_unpaired.fastq
## sample_08_R1_trimmomatic_unpaired.fastq
## sample_08_R2_trimmomatic_unpaired.fastq
## sample_09_R1_trimmomatic_unpaired.fastq
## sample_09_R2_trimmomatic_unpaired.fastq
## sample_10_R1_trimmomatic_unpaired.fastq
## sample_10_R2_trimmomatic_unpaired.fastq
## sample_11_R1_trimmomatic_unpaired.fastq
## sample_11_R2_trimmomatic_unpaired.fastq
## sample_12_R1_trimmomatic_unpaired.fastq
## sample_12_R2_trimmomatic_unpaired.fastq
## sample_13_R1_trimmomatic_unpaired.fastq
## sample_13_R2_trimmomatic_unpaired.fastq
## sample_14_R1_trimmomatic_unpaired.fastq
## sample_14_R2_trimmomatic_unpaired.fastq
## sample_15_R1_trimmomatic_unpaired.fastq
## sample_15_R2_trimmomatic_unpaired.fastq
## sample_16_R1_trimmomatic_unpaired.fastq
## sample_16_R2_trimmomatic_unpaired.fastq
## sample_17_R1_trimmomatic_unpaired.fastq
## sample_17_R2_trimmomatic_unpaired.fastq

Let’s verify that the trimming removed the low quality score readings, Ns, and adapter sequences by re-evaluating the QC on this trimmed data:

fastqc 2-trimming/trimmomatic/paired/*.fastq -o 3-qualityControl_afterTrimming/outputFastqc
multiqc 3-qualityControl_afterTrimming/outputFastqc/. -o 3-qualityControl_afterTrimming/outputMultiqc

FastQC results are in 3-qualityControl_afterTrimming/outputFastqc and MultiQC results are in 3-qualityControl_afterTrimming/outputMultiqc.

Figure 2: Quality control after trimming. The background color indicates whether the region is bad (red), acceptable (yellow), and great (green). In (A), (B) and (C) each line represents a FASTQ file of the project. (A) Average of quality scores. The X axis represents the position of the nucleotides and the Y axis indicates the quality score on the phred scale. (B) Quality score by sequence. The X axis represents the average phred score and the Y axis indicates the number of readings. (C) N content. The X axis represents the position of the base and the Y axis indicates the amount of N in percentage. (D) Content of adapters.