Chapter 3 Installing PreProcSEQ and the tools it uses

3.1 Installing PreProcSEQ

Download the PreProcSEQ repository in zip from your Linux machine’s home directory. Preferably Ubuntu, as the BASH commands are directed to this distribution. But if you are working with another distro, change the script commands as needed:

cd ~
wget https://github.com/resendejss/PreProcSEQ/archive/refs/heads/main.zip
unzip main.zip

The commands used in this documentation assume that you downloaded the repository and unzipped the file in the home directory (~). Access the PreProcSEQ directory with the terminal and then run the scripts:

cd ~/PreProcSEQ-main

3.2 Installing the tools that PreProcSEQ uses

In each topic we made a brief comment about the tools. During the execution of each tool, we will comment more on its functions and its results. In this section is presented a brief summary about them and some installation ways.

To facilitate the installation procedure of the tools, we built the installTools.sh script that contains the installation commands of all the necessary tools for the execution of the pipeline. If you want to install the tools directly, run the command below in the Linux terminal:

./installTools.sh

But if you want to install each tool manually, follow the commands below.

3.2.1 FastQC

FastQC does simple QC on raw high-throughput sequencing data. It runs a series of tests on FASTQ and SAM/BAM files and presents necessary corrections before proceeding with subsequent analyses. However, it is important to note that warnings or failures will not always indicate that the files are necessarily in trouble, it may be that the biology of the data we are working on allows for this bias - as is the case with RNA-Seq.

FastQC can be run in the terminal just like an interactive graphical application. It was developed in java and runs on Windows, MacOSX and Linux. Let’s use FastQC to evaluate the QC of all FASTQ files. It is available at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

There are a few ways to install FastQC. We use APT, but feel free to install it however you like.

APT:

sudo apt-get update
sudo apt-get -y install fastqc

Conda environment:

conda install -c bioconda fastqc
conda install -c "bioconda/label/broken" fastqc
conda install -c "bioconda/label/cf201901" fastqc

GitHub:

cd ~
wget https://github.com/s-andrews/FastQC/archive/refs/heads/master.zip
unzip master.zip
cd FastQC-master
chmod 755 fastqc

3.2.2 MultiQC

MultiQC compiles the resulting logs from 114 tools supported so far (Sep/2022) into an HTML report. MultiQC plots a visualization with all samples together. If the tool we’re working on isn’t in the list of supported tools, just create an issue on GitHub to request it. It is available at https://multiqc.info/.

We will use MultiQC to compile the log files generated by FastQC and then have an HTML file with the QC results of all analyzed files.

pip:

sudo apt install python3-pip
pip install multiqc

Conda environment:

conda install -c bioconda -c conda-forge multiqc

GitHub:

git clone https://github.com/ewels/MultiQC.git
python setup.py install

3.2.3 Trimmomatic

Trimmomatic is an efficient and flexible pre-processing tool and can handle read-only or paired data. This tool includes several processing steps for cutting and filtering readings. But one of its differentiators is its optimization for Illumina NGS data.

APT:

sudo apt-get update -y
sudo apt-get install -y trimmomatic

Conda environment:

conda install -c bioconda trimmomatic
conda install -c "bioconda/label/broken" trimmomatic
conda install -c "bioconda/label/cf201901" trimmomatic

Repository:

cd ~
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip

3.2.4 Salmon

Salmon is an open access program for the quantification of transcripts from RNA-Seq. What is needed to run the program is a FASTA file containing the reference transcripts and a set of FASTQ files containing the readings. It can also use precomputed alignments - SAM/BAM files - instead of raw reads.

APT:

sudo apt-get update
sudo apt-get -y install salmon

Conda environment:

conda install -c bioconda salmon
conda install -c "bioconda/label/cf201901" salmon

3.2.5 Kalliisto

Kallisto is a program that quantifies transcript abundances of RNA-Seq data. It is based on the idea of pseudo-alignment to then quickly determine the compatibility of the readings in the reference transcriptome, without using alignment.

APT:

sudo apt-get update
sudo apt-get install kallisto

Conda environment:

conda install -c bioconda kallisto
conda install -c "bioconda/label/cf201901" kallisto

GitHub:

git clone https://github.com/pachterlab/kallisto.git
cd kallisto
mkdir build
cd build
cmake ..
make
make install

3.2.6 Linguagem R

The R language is versatile and has a series of packages for statistical purposes. It was developed from the need for a program that would help in the manipulation, analysis and visualization of data. Below are the commands for your installation on Linux-Ubuntu. For more information about its installation visit: https://cran.r-project.org/

sudo apt update -qq
sudo apt install --no-install-recommends software-properties-common dirmngr
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
sudo apt install --no-install-recommends r-base

3.2.7 Pacotes R

As we work with packages in R, we will comment on the functions that we will use from each package. Following is the R installation of the packages used in this pipeline.

CRAN repository:

install.packages(c("BiocManager","readxl","stringr","magrittr"))

Bioconductor repository:

BiocManager::install(c("tximport","tximeta","GenomicFeatures","ensembldb","SummarizedExperiment","AnnotationHub","edgeR","sva"))