Chapter 2 Introduction

RNA-Seq has stood out among sequencing technologies. Since then, further analysis of the raw data obtained with this technology has gained a focus on bioinformatics. This Pipeline aims to present the main steps for the construction of the gene expression matrix, from raw RNA-Seq data.

Among the steps presented in this pipeline, the topics covered are:

  • quality control

  • trimming

  • quantification of transcripts

  • annotation of transcripts

  • normalization of counts

  • batch effect removal

2.1 Workflow

2.2 Prerequisites

2.2.1 Software

This pipeline was developed in BASH and R, so it is necessary to install some Linux libraries to run some programs. Such libraries are in the installTools.sh script as well as the R programs and packages. The programs used in this pipeline are: FastQC, MultiQC, Trimmomatic, Salmon, Kallisto and R. Along with R packages: tximport, tximeta, GenomicFeatures, ensembldb, SummarizedExperiment, readxl, AnnotationHub, stringr, edgeR, sva, magrittr.

2.2.2 Operating system and hardware

The machine used to build the pipeline was a machine with 24 gigabytes of RAM, Intel® Core™ i5-3570 CPU @ 3.40GHz × 4, NVIDIA Corporation GK106 (GeForce GTX 660) video card and storage capacity. of 2.2 terabyte. The operating system used was 64-bit, with Ubuntu 22.04.1 LTS.

One of the differentials of this pipeline is its execution on a regular laptop. To run this pipeline, you don’t need a server, or a very powerful machine with multiple processor cores and ram memory.

2.2.3 Input file

The file that PreProcSEQ receives as input is the FASTQ file. This file is generated by converting the signals from each base to each sample sequenced by Next-Generation Sequencing (NGS).

FASTQ is a file format that combines read sequencing data with quality scores. This format was introduced in 2010 and has since been used as a storage file for sequencing information. It consists of four types of lines for each reading. The first line starts with ‘@’ and refers to the title/identifier. Next is the base pair (bp) sequence - which may have line breaks. Thirdly, there is the ‘+’ sign, some files also contain a title/identifier next to this sign. Finally, there are the quality scores of each base in ASCII encoding.