Chapter 5 Week 2 Part A - Quality control in Bioinformatic Sequence Analysis

5.1 Learning Objectives

At the conclusion of today’s workshop students are expected to be able to:

  • Understand the importance of quality control when dealing with bioinformatic data

  • Perform quality control using FastQC and identify any potential quality control concerns

  • Rectify sequence quality issues using common read QC tools

  • Align RNAseq reads to a reference genome

  • Quantify gene expression using common bioinformatic tools

5.2 Workshop setup

Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.

  • cd ~

  • mkdir week_2

  • cd week_2

  • ln -s /apps/data/bms5021/week_2/ resources

The last command in this series creates a “symbolic link”. A symbolic link is like a shortcut. This provides you access to the data, which resides in /apps/data/bms5021/week_2/, and will be used over the course of the next few weeks. Note that you can read these files but you cannot modify them (this is to prevent any inadvertent deletions or edits to the raw data!)

We will use symbolic links a few times over the next few weeks to link to large raw data files and reference genomes.

5.3 The impact of gamma radiation on gene expression

Over the next two weeks we will be evaluating RNAseq data from Maremonti and colleagues (2020). In this study, the authors evaluated the effects of radiation of the transcriptome of C Elegans. They had several experimental groups comprised of different worms exposed to increasing radiation dosages. The data that we are examining pertains to the control group (0gy radiation) and the high radiation group (100gy radiation). Our task is to examine how 100gy radiation affects gene expression of the worms.

5.3.0.0.0.1 Maremonti E, Eide DM, Rossbach LM, Lind OC, Salbu B, Brede DA. In vivo assessment of reactive oxygen species production and oxidative stress effects induced by chronic exposure to gamma radiation in Caenorhabditis elegans. Free Radic Biol Med. 2020 May 20;152:583-596. doi: 10.1016/j.freeradbiomed.2019.11.037. Epub 2019 Dec 2. PMID: 31805397.

5.4 Quality control of sequencing reads

We can use the fastqc tool to rapidly assess the quality of our sequencing and inform downstream quality control processing. Below are a list of the metrics that fastqc reports provide.

  • Per Base Sequence Quality: This metric evaluates the quality scores of each base position in the reads. Quality scores are typically represented as Phred scores, which quantify the base-calling error probability. Lower Phred scores indicate lower confidence in the base call and may suggest poor sequencing quality.

    • Potential cause: In a low-quality region, the Phred scores may drop below a certain threshold (e.g., below 20), leading to increased chances of base-calling errors. This can arise due to issues like sequencing machine malfunctions, low-quality sequencing reagents, or problems with the sample preparation process.
  • Per Sequence Quality Scores: This metric presents a graph showing the distribution of average quality scores across all sequences in the dataset. It helps identify whether the majority of sequences have high or low average quality scores.

    • Potential cause: If the graph shows a skewed distribution towards low average quality scores, it may indicate that a significant portion of the data contains poor quality reads. This could be due to issues such as over-amplification during library preparation, contamination, or sequencing artifacts.
  • Per Base Sequence Content: This metric assesses the percentage of each base position that corresponds to specific nucleotides (A, C, G, or T). It helps identify potential biases or contamination in the sequencing data.

    • Potential cause: A sudden drop in the percentage of a specific nucleotide at certain positions may suggest sequencing-specific biases or contaminants. This could arise from technical artifacts, PCR biases, or even contamination during sample handling. Tagmentation based libraries will usually show a non-uniform per base sequence content in the first 20 nucleotides of reads.
  • Per Base GC Content: This metric analyzes the distribution of GC content across all reads. Deviations from the expected GC content may indicate potential issues.

    • Potential cause: If the GC content distribution shows a bimodal pattern, it could indicate a mixed sample, adapter contamination, or PCR amplification biases.
  • Sequence Duplication Levels: This metric assesses the percentage of duplicate sequences in the dataset. High duplication levels may indicate PCR amplification biases or over-representation of certain regions in the genome.

    • Example: If a significant portion of the data comprises duplicate reads, it might suggest PCR artifacts, library preparation biases, or issues during sample indexing. Some libraries, such as RNAseq libraries, will have biological sequence duplication owing to gene expression patterns. It is important to consider this before deduplicating data.
  • Overrepresented Sequences: FastQC flags sequences that are overrepresented in the dataset. Such sequences may indicate adapter contamination, PCR artifacts, or even contamination from other sources.

    • Potential Cause: Overrepresented sequences could be adapter sequences or primer dimers that were not properly removed during library preparation or contaminants that originated from the experimental environment.
  • Adapter content

    • FastQC will perform an automatic analysis to determine the presence of common adapter sequences. If no adapters are present these lines should be flat. If you observe a steady rise in adapter content towards the end of a read, this is indicative of adapters on the reads.

In the ~/week_2/resources/data you will find six pairs of sequencing reads. We need to perform quality assessment on each set of reads.

Exercise 1: Construct a script that will loop over each .fq file and execute fastqc

  • Make a new directory in week_2 called src. This will be where you store all of your scripts.

  • Use nano to create a new script called fastqc.sh

  • write a FOR loop to loop over all the .fq files in ~/week_2/resources/data

    • Set the output to go to a new directory called fastqc_results in week_2
  • Launch an Rstudio session in your browser from https://bioinformatics-training.cloud.edu.au/ . Use the file browser to download the .html files. If you are unsure how to do this approach one of the teaching staff

  • Review the fastqc results and discuss at your table. What do they mean?

Next we will be performing quality control on these reads. Have a think about what quality control we should perform. Is there evidence of adapters? Should we quality trim reads? Should we discard any sequences? These are all decisions that we have to make in any sequencing experiment. You should justify your decisions and make notes of them in a README style document.

5.5 Trimming reads with TrimGalore!

TrimGalore! is a tool that wraps cutadapt and fastqc into a convenient, flexible and easy to use tool. When correctly configured, TrimGalore! can:

  • Perform basic quality trimming using the cutadapt method (with both default and user-set phred thresholds)

  • Remove adapter sequences

  • Perform hard trimming (removing n number of bases from either 3’ or 5’ end of the reads)

  • Remove reads below a threshold length

  • Handle paired-end sequences in a manner that ensures only reads that are paired are written to a file

  • Run Fastqc analysis on trimmed reads

Lets explore the default behaviour of TrimGalore! This tool will trim based from the 3’ end of a file (as per the BWA and cutadapt method) until it reaches a nucleotide that exceeds the quality threshold. The phred threshold is Q20, which corresponds to a 1% probability of an incorrect basecall. This can be modified using the -q option. For most analyses, Q20 represents a good trade off between quality and lost reads. TrimGalore! will also simultaneously detect and remove adapter sequences. TrimGalore! will automatically detect the following adapters:

  • Illumina standard adapters: AGATCGGAAGAGC

  • Illumina small RNA adapters: TGAATTCTCGG

  • Nextera adapters: CTGTCTTATA

We can also specify the adapter sequence if we are using non-standard adapters using the -a option. Other default parameters are as follows:

Parameter and Option Flag Default Description
–stringency 1 Overlap with adapter sequence required to trigger trimming.
–length 20 Minimum length required to retain a read after trimming. Paired-end requires both sequence meet this threshold.
-j 1 Number of cores used in analysed. Required python3 and pigz packages.
–paired Single-end reads Whether the reads are from a paired-end or single-end experiment. Note that both reads must be provided as input if this option is selected

For the purposes of this workshop and your assessment to leave the majority of the default parameters in place. Lets move onto the standard syntax of TrimGalore!

Here is an example of how to write a TrimGalore script to analyse paired end sequencing:

#!/bin/bash
RESULTS=PATH_TO_RESULTS
READ_1=PATH_TO_READ_1
READ_2=PATH_TO_READ_2
trim_galore --fastqc --paired --output_dir $RESULTS $READ_1 $READ_2
  • First we set three variables, corresponding to the desired path of our trimmed reads results files, and the location of the forward and reverse reads.

  • trim_galore - This is the tool we are calling.

  • --fastqc - This instructs TrimGalore! to run fastqc on the sequencing reads after trimming.

  • --paired - This informs TrimGalore! that we are conducting trimming on paired-end sequences and changes the behaviour of the tool accordingly.

  • --output_dir - This instructs TrimGalore! where to write the trimming report, trimmed reads and the fastqc reports.

#!/bin/bash
OUTPUT=~/week_2/fastqc_results
# Create the directory to output things to (-p creates it if it doesn't exist)
mkdir -p ~/week_2/fastqc_results
for file in ~/week_2/resources/data/*.fq; do
        # lets do fastqc on each file and output it to our new directory.
        fastqc "$file" --outdir="$OUTPUT"
done

Exercise 2: Trimming reads using TrimGalore!

  • In the ~/week2/src directory create a new script called trim_reads.sh

  • Using the above template, create a script that will run TrimGalore! the reads in the data directory.

  • You may choose to include variables within your script, or use command line arguments to direct your script to the fastqfiles. Remember to set the output_dir appropriately to ensure the trimmed reads and associated files are written to the correct location. Write these to a directory called trimmed_reads_trimGalore

#!/bin/bash
RESULTS=~/week_2/trimmed_reads_TrimGalore
#READ_1=~/week_2/resources/data/*R1.fq
#READ_2=~/week_2/resources/data/*R2.fq
# can include a script here to create the results directory if you want
# mkdir -p $RESULTS
for READ1 in ~/week_2/resources/data/*R1.fq; do
        READ2="${READ1/R1/R2}"
        trim_galore --fastqc --paired --output_dir $RESULTS $READ1 $READ2
done

5.6 Trimming reads with Trimmomatic

Trimmomatic is another tool that we can use to remove adapter sequences and perform basic read quality control. Trimmomatic uses a sliding window approach to removing bases from sequencing reads. We will not be using trimmomatic in this course, however there are a number of differences between Trimmomatic and TrimGalore!

Exercise 3: Compare and contrast TrimGalore! and Trimmomatic