4 Trimming and Filtering
4.1 Cleaning reads
In the previous episode, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This does not mean, though, that our samples should be thrown out! It is very common to have some quality metrics fail, and this may or may not be a problem for your downstream application. For our variant calling workflow, we will be removing some of the low quality sequences to reduce our false positive rate due to sequencing error.
We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.
4.2 Trimmomatic options
Trimmomatic has a variety of options to trim your reads. If we run the following command, we can see some of our options.
cd ~/dc_workshop/data/untrimmed_fastq
source $HOME/miniconda3/bin/activate
conda activate variant
trimmomatic
Which will give you the following output:
Usage: PE [-version] [-threads <threads] [-phred33|-phred64] [-trimlog <trimLogFile] [-summary <statsSummaryFile] [-quiet] [-validatePairs] [-basein <inputBase| <inputFile1<inputFile2] [-baseout <outputBase| <outputFile1P<outputFile1U<outputFile2P<outputFile2U] <trimmer1… or: SE [-version] [-threads <threads] [-phred33|-phred64] [-trimlog <trimLogFile] [-summary <statsSummaryFile] [-quiet] <inputFile<outputFile<trimmer1… or: -version
This output shows us that we must first specify whether we have paired end (PE
) or single end (SE
) reads.
Next, we specify what flag we would like to run. For example, you can specify threads
to indicate the number of
processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important.
In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below. While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file.
option | meaning |
---|---|
<inputFile1 | Input reads to be trimmed. Typically the file name will contain an _1 or _R1 in the name. |
<inputFile2 | Input reads to be trimmed. Typically the file name will contain an _2 or _R2 in the name. |
<outputFile1P | Output file that contains surviving pairs from the _1 file. |
<outputFile1U | Output file that contains orphaned reads from the _1 file. |
<outputFile2P | Output file that contains surviving pairs from the _2 file. |
<outputFile2U | Output file that contains orphaned reads from the _2 file. |
The last thing trimmomatic expects to see is the trimming parameters:
step | meaning |
---|---|
ILLUMINACLIP |
Perform adapter removal. |
SLIDINGWINDOW |
Perform sliding window trimming, cutting once the average quality within the window falls below a threshold. |
LEADING |
Cut bases off the start of a read, if below a threshold quality. |
TRAILING |
Cut bases off the end of a read, if below a threshold quality. |
CROP |
Cut the read to a specified length. |
HEADCROP |
Cut the specified number of bases from the start of the read. |
MINLEN |
Drop an entire read if it is below a specified length. |
TOPHRED33 |
Convert quality scores to Phred-33. |
TOPHRED64 |
Convert quality scores to Phred-64. |
We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.
However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work, as we do not have the files it refers to:
trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq
SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq
SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20
In this example, we have told Trimmomatic:
code | meaning |
---|---|
PE |
that it will be taking a paired end file as input |
-threads 4 |
to use four computing threads to run (this will speed up our run) |
SRR_1056_1.fastq |
the first input file name |
SRR_1056_2.fastq |
the second input file name |
SRR_1056_1.trimmed.fastq |
the output file for surviving pairs from the _1 file |
SRR_1056_1un.trimmed.fastq |
the output file for orphaned reads from the _1 file |
SRR_1056_2.trimmed.fastq |
the output file for surviving pairs from the _2 file |
SRR_1056_2un.trimmed.fastq |
the output file for orphaned reads from the _2 file |
ILLUMINACLIP:SRR_adapters.fa |
to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa |
SLIDINGWINDOW:4:20 |
to use a sliding window of size 4 that will remove bases if their phred score is below 20 |
4.3 Multi-line commands
Some of the commands we ran in this lesson are long! When typing a long
command into your terminal, you can use the \
character
to separate code chunks onto separate lines. This can make your code more readable.
4.4 Running Trimmomatic
Now we will run Trimmomatic on our data. To begin, navigate to your untrimmed_fastq
data directory:
cd ~/dc_workshop/data/untrimmed_fastq
We are going to run Trimmomatic on one of our paired-end samples. While using FastQC we saw that Nextera adapters were present in our samples. The adapter sequences came with the installation of trimmomatic, so we will first copy these sequences into our current directory.
Note You might have to change your path here, check your version and folders for this file:
cd ~/dc_workshop/data/untrimmed_fastq/
cp -R ~/miniconda3/pkgs/trimmomatic-0.39-hdfd78af_2/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa ~/dc_workshop/data/untrimmed_fastq/
We will also use a sliding window of size 4 that will remove bases if their phred score is below 20 (like in our example above). We will also discard any reads that do not have at least 25 bases remaining after this trimming step. Three additional pieces of code are also added to the end of the ILLUMINACLIP step. These three additional numbers (2:40:15) tell Trimmimatic how to handle sequence matches to the Nextera adapters. A detailed explanation of how they work is advanced for this particular lesson. For now we will use these numbers as a default and recognize they are needed to for Trimmomatic to run properly. This command will take a few minutes to run.
cd ~/dc_workshop/data/untrimmed_fastq
source $HOME/miniconda3/bin/activate
conda activate variant
trimmomatic PE SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz \
\
SRR2584863_1.trim.fastq.gz SRR2584863_1un.trim.fastq.gz \
SRR2584863_2.trim.fastq.gz SRR2584863_2un.trim.fastq.gz
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
## TrimmomaticPE: Started with arguments:
## SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz SRR2584863_1.trim.fastq.gz SRR2584863_1un.trim.fastq.gz SRR2584863_2.trim.fastq.gz SRR2584863_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
## Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
## ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
## Quality encoding detected as phred33
## Input Read Pairs: 1553259 Both Surviving: 1271924 (81.89%) Forward Only Surviving: 273488 (17.61%) Reverse Only Surviving: 3861 (0.25%) Dropped: 3986 (0.26%)
## TrimmomaticPE: Completed successfully
4.5 Exercise
Use the output from your Trimmomatic command to answer the following questions.
- What percent of reads did we discard from our sample?
- What percent of reads did we keep both pairs?
4.6 Solution
You may have noticed that Trimmomatic automatically detected the quality encoding of our sample. It is always a good idea to double-check this or to enter the quality encoding manually.
We can confirm that we have our output files:
cd ~/dc_workshop/data/untrimmed_fastq/
ls SRR2584863*
## SRR2584863_1.fastq
## SRR2584863_1.fastq.gz
## SRR2584863_1.trim.fastq.gz
## SRR2584863_1un.trim.fastq.gz
## SRR2584863_2.fastq.gz
## SRR2584863_2.trim.fastq.gz
## SRR2584863_2un.trim.fastq.gz
The output files are also FASTQ files. It should be smaller than our input file, because we have removed reads. We can confirm this:
cd ~/dc_workshop/data/untrimmed_fastq/
ls -lh SRR2584863*
gunzip -f -dk SRR2584863_1.trim.fastq.gz
gunzip -f -dk SRR2584863_2.trim.fastq.gz
ls -lh SRR2584863*
## -rw-r--r-- 1 ggiaever staff 545M Nov 24 22:38 SRR2584863_1.fastq
## -rw-r--r-- 1 ggiaever staff 175M Nov 24 22:38 SRR2584863_1.fastq.gz
## -rw-r--r-- 1 ggiaever staff 134M Nov 27 21:33 SRR2584863_1.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 23M Nov 27 21:33 SRR2584863_1un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 182M Nov 24 22:38 SRR2584863_2.fastq.gz
## -rw-r--r-- 1 ggiaever staff 131M Nov 27 21:33 SRR2584863_2.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 377K Nov 27 21:33 SRR2584863_2un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 545M Nov 24 22:38 SRR2584863_1.fastq
## -rw-r--r-- 1 ggiaever staff 175M Nov 24 22:38 SRR2584863_1.fastq.gz
## -rw-r--r-- 1 ggiaever staff 426M Nov 27 21:33 SRR2584863_1.trim.fastq
## -rw-r--r-- 1 ggiaever staff 134M Nov 27 21:33 SRR2584863_1.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 23M Nov 27 21:33 SRR2584863_1un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 182M Nov 24 22:38 SRR2584863_2.fastq.gz
## -rw-r--r-- 1 ggiaever staff 383M Nov 27 21:33 SRR2584863_2.trim.fastq
## -rw-r--r-- 1 ggiaever staff 131M Nov 27 21:33 SRR2584863_2.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 377K Nov 27 21:33 SRR2584863_2un.trim.fastq.gz
Let’s look at how much trimming occured on our SRR2584863_1.trim.fastq & SRR2584863_2.trim.fastq files. 426M and 383M, 119M & 162M smaller than their original size of 545M. So it looks like more trimming occurred on the SRR2584863_2.trim.fastq.gz file. Remember SRR2584863_2 failed the per base sequence quality report, so this makes sense.
We have just successfully run Trimmomatic on one of our FASTQ files!
However, there is some bad news. Trimmomatic can only operate on
one sample at a time and we have more than one sample. The good news
is that we can use a for
loop to iterate through our sample files
quickly!
We unzipped one of our files before to work with it, let’s compress it again before we run our for loop.
cd ~/dc_workshop/data/untrimmed_fastq/
gzip -f SRR2584863_1.fastq
cd ~/dc_workshop/data/untrimmed_fastq/
source $HOME/miniconda3/bin/activate
conda activate variant
for infile in *_1.fastq.gz
do
base=$(basename ${infile} _1.fastq.gz)
trimmomatic PE ${infile} ${base}_2.fastq.gz \
${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 done
## TrimmomaticPE: Started with arguments:
## SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz SRR2584863_1.trim.fastq.gz SRR2584863_1un.trim.fastq.gz SRR2584863_2.trim.fastq.gz SRR2584863_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
## Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
## ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
## Quality encoding detected as phred33
## Input Read Pairs: 1553259 Both Surviving: 1271924 (81.89%) Forward Only Surviving: 273488 (17.61%) Reverse Only Surviving: 3861 (0.25%) Dropped: 3986 (0.26%)
## TrimmomaticPE: Completed successfully
## TrimmomaticPE: Started with arguments:
## SRR2584866_1.fastq.gz SRR2584866_2.fastq.gz SRR2584866_1.trim.fastq.gz SRR2584866_1un.trim.fastq.gz SRR2584866_2.trim.fastq.gz SRR2584866_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
## Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
## ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
## Quality encoding detected as phred33
## Input Read Pairs: 2768398 Both Surviving: 2053874 (74.19%) Forward Only Surviving: 578725 (20.90%) Reverse Only Surviving: 131060 (4.73%) Dropped: 4739 (0.17%)
## TrimmomaticPE: Completed successfully
## TrimmomaticPE: Started with arguments:
## SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
## Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
## Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
## ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
## Quality encoding detected as phred33
## Input Read Pairs: 1107090 Both Surviving: 885220 (79.96%) Forward Only Surviving: 216472 (19.55%) Reverse Only Surviving: 2850 (0.26%) Dropped: 2548 (0.23%)
## TrimmomaticPE: Completed successfully
Go ahead and run the for loop. It should take a few minutes for
Trimmomatic to run for each of our four input files. Once it is done
running, take a look at your directory contents. You will notice that even though we ran Trimmomatic on file SRR2584863_1
before running the for loop, there is only one set of files for it. Because we matched the ending _1.fastq.gz
, we re-ran Trimmomatic on this file, overwriting our first results. That is ok, but it is good to be aware that it happened.
cd ~/dc_workshop/data/untrimmed_fastq/
ls
## NexteraPE-PE.fa
## SRR2584863_1.fastq.gz
## SRR2584863_1.trim.fastq
## SRR2584863_1.trim.fastq.gz
## SRR2584863_1un.trim.fastq.gz
## SRR2584863_2.fastq.gz
## SRR2584863_2.trim.fastq
## SRR2584863_2.trim.fastq.gz
## SRR2584863_2un.trim.fastq.gz
## SRR2584866_1.fastq.gz
## SRR2584866_1.trim.fastq.gz
## SRR2584866_1un.trim.fastq.gz
## SRR2584866_2.fastq.gz
## SRR2584866_2.trim.fastq.gz
## SRR2584866_2un.trim.fastq.gz
## SRR2589044_1.fastq.gz
## SRR2589044_1.trim.fastq.gz
## SRR2589044_1un.trim.fastq.gz
## SRR2589044_2.fastq.gz
## SRR2589044_2.trim.fastq.gz
## SRR2589044_2un.trim.fastq.gz
4.7 Exercise
We trimmed our fastq files with Nextera adapters, but there are other adapters that are commonly used. What other adapter files came with Trimmomatic?
4.8 Solution
We have now completed the trimming and filtering steps of our quality
control process! Before we move on, let’s move our trimmed FASTQ files
to a new subdirectory within our data/
directory.
cd ~/dc_workshop/data/untrimmed_fastq
mkdir -p ../trimmed_fastq
mv *.trim* ../trimmed_fastq
cd ../trimmed_fastq
ls -lah
## total 3807696
## drwxr-xr-x 16 ggiaever staff 512B Nov 27 21:41 [34m.[39;49m[0m
## drwxr-xr-x 8 ggiaever staff 256B Nov 25 23:41 [34m..[39;49m[0m
## -rw-r--r-- 1 ggiaever staff 426M Nov 27 21:33 SRR2584863_1.trim.fastq
## -rw-r--r-- 1 ggiaever staff 134M Nov 27 21:36 SRR2584863_1.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 23M Nov 27 21:36 SRR2584863_1un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 383M Nov 27 21:33 SRR2584863_2.trim.fastq
## -rw-r--r-- 1 ggiaever staff 131M Nov 27 21:36 SRR2584863_2.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 377K Nov 27 21:36 SRR2584863_2un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 210M Nov 27 21:39 SRR2584866_1.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 40M Nov 27 21:39 SRR2584866_1un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 211M Nov 27 21:39 SRR2584866_2.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 13M Nov 27 21:39 SRR2584866_2un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 93M Nov 27 21:41 SRR2589044_1.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 17M Nov 27 21:41 SRR2589044_1un.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 91M Nov 27 21:41 SRR2589044_2.trim.fastq.gz
## -rw-r--r-- 1 ggiaever staff 271K Nov 27 21:41 SRR2589044_2un.trim.fastq.gz
4.9 Bonus exercise (advanced)
Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualize the HTML files to see whether your per base sequence quality is higher after trimming. Save the HTML files in ~/dc_workshop/fastqc_html/trimmed, then take a look at the html files in your browser.