Chapter 7 Construction of the gene expression matrix

After the quantification step, we need to integrate the abundance estimates and transcript counts, compiling these data into expression matrices. There are a few ways to do this, we chose to use the R/Bioconductor packages tximport and tximeta. Both tximport and tximeta import and summarize transcript abundance estimates into expression matrices, which can be used in further studies such as differential expression, regulon inference and other functional analyses.

7.1 tximport

Some advantages of using tximport are:

  • correction of possible changes in gene length between samples.

  • developed as the next step for non-alignment based quantification methods such as Sailfish, Kallisto and Salmon.

  • avoids discarding sequences that can align with multiple genes, thus being more sensitive to data.

Let’s import the quantifications into R, using tximport, and then build the matrices.

First let’s load the necessary packages:

library(tximport)
library(GenomicFeatures)

Second, let’s set the quantification directory and check if they exist - in this example we put the Salmon directory but it could be from Kallisto:

setwd("~/PreProcSEQ-main/5-expressionMatrix/tximport")
dirquant <- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"
files <- list.files(dirquant)
files_import <- paste0(dirquant, files[-1], "/quant.sf")
all(file.exists(files_import))

Let’s build the objects needed to import the transcripts:

txdb <- makeTxDbFromGFF("~/PreProcSEQ-main/gencode.v40.annotation.gff3.gz")
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- ensembldb::select(txdb, k, "GENEID", "TXNAME")

Let’s import the quantifications:

mat_gse <- tximport(files_import,
                    type = "salmon",
                    tx2gene = tx2gene,
                    ignoreAfterBar = TRUE)

Finally, let’s save the arrays:

save(mat_gse, file = "matrix_salmon_tximport.RData") # objeto R
write.csv(mat_gse$abundance, file = "matrix_salmon_tximport_abundance.csv") # TPM
write.csv(mat_gse$counts, file = "matrix_salmon_tximport_counts.csv") # counts

If the quantifications were generated by Kallisto, we would need to change the value of the type argument from salmon to "kallisto".

7.2 tximeta

tximeta extends tximport, providing similar functions, but with the automatic addition of transcriptome annotation metadata used by GENCODE, Ensembl and RefSeq for humans or mice. Another difference between the two packages is their output. tximport generates a list of arrays, while tximeta generates a SummarizedExperiment object with added metadata for the genes and samples.

Let’s import the quantifications generated by Salmon with tximeta. First, let’s load the necessary packages into the R environment:

library(tximeta)
library(SummarizedExperiment)
library(readxl)
library(GenomicFeatures)

Next, let’s set the work environment and the directory that contain the quantifications, and import the samples’ metadata into the R environment:

setwd("~/PreProcSEQ-main/5-expressionMatrix/tximeta")
dirquant <- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"

coldata <- read_xlsx("../../metadata.xlsx")
coldata$names <- coldata$Sample_id
coldata$files <- paste0(dirquant,coldata$names,"_quant/","quant.sf")
all(file.exists(coldata$files))
coldata <- as.data.frame(coldata)
rownames(coldata) <- coldata$names

Let’s set the reference files that will be used to build the metadata:

indexDir <- file.path("../../4-quantification/salmon/gencode_v40_index")
gffPath <- file.path("../../gencode.v40.annotation.gff3.gz")
fastaFTP <- "ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz"

Let’s import the quantifications and join the metadata of the samples and transcripts:

makeLinkedTxome(indexDir = indexDir,
                source = "GENCODE",
                organism = "Homo sapiens",
                release = "40",
                genome = "GRCh38",
                fasta = fastaFTP,
                gtf = gffPath,
                write = FALSE)

Let’s summarize the data for genes:

se <- tximeta(coldata, useHub = F)
gse <- summarizeToGene(se)

Finally, let’s save the gse object which contains the expression arrays:

save(gse, file="matrix_gse_salmon_tximeta.RData")