Chapter 7 Construction of the gene expression matrix
After the quantification step, we need to integrate the abundance estimates and transcript counts, compiling these data into expression matrices. There are a few ways to do this, we chose to use the R/Bioconductor packages tximport and tximeta. Both tximport and tximeta import and summarize transcript abundance estimates into expression matrices, which can be used in further studies such as differential expression, regulon inference and other functional analyses.
7.1 tximport
Some advantages of using tximport
are:
correction of possible changes in gene length between samples.
developed as the next step for non-alignment based quantification methods such as Sailfish, Kallisto and Salmon.
avoids discarding sequences that can align with multiple genes, thus being more sensitive to data.
Let’s import the quantifications into R, using tximport
, and then build the matrices.
First let’s load the necessary packages:
library(tximport)
library(GenomicFeatures)
Second, let’s set the quantification directory and check if they exist - in this example we put the Salmon directory but it could be from Kallisto:
setwd("~/PreProcSEQ-main/5-expressionMatrix/tximport")
<- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"
dirquant <- list.files(dirquant)
files <- paste0(dirquant, files[-1], "/quant.sf")
files_import all(file.exists(files_import))
Let’s build the objects needed to import the transcripts:
<- makeTxDbFromGFF("~/PreProcSEQ-main/gencode.v40.annotation.gff3.gz")
txdb <- keys(txdb, keytype = "TXNAME")
k <- ensembldb::select(txdb, k, "GENEID", "TXNAME") tx2gene
Let’s import the quantifications:
<- tximport(files_import,
mat_gse type = "salmon",
tx2gene = tx2gene,
ignoreAfterBar = TRUE)
Finally, let’s save the arrays:
save(mat_gse, file = "matrix_salmon_tximport.RData") # objeto R
write.csv(mat_gse$abundance, file = "matrix_salmon_tximport_abundance.csv") # TPM
write.csv(mat_gse$counts, file = "matrix_salmon_tximport_counts.csv") # counts
If the quantifications were generated by Kallisto, we would need to change the value of the type
argument from salmon
to "kallisto"
.
7.2 tximeta
tximeta
extends tximport
, providing similar functions, but with the automatic addition of transcriptome annotation metadata used by GENCODE, Ensembl and RefSeq for humans or mice. Another difference between the two packages is their output. tximport
generates a list of arrays, while tximeta
generates a SummarizedExperiment
object with added metadata for the genes and samples.
Let’s import the quantifications generated by Salmon with tximeta
. First, let’s load the necessary packages into the R environment:
library(tximeta)
library(SummarizedExperiment)
library(readxl)
library(GenomicFeatures)
Next, let’s set the work environment and the directory that contain the quantifications, and import the samples’ metadata into the R environment:
setwd("~/PreProcSEQ-main/5-expressionMatrix/tximeta")
<- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"
dirquant
<- read_xlsx("../../metadata.xlsx")
coldata $names <- coldata$Sample_id
coldata$files <- paste0(dirquant,coldata$names,"_quant/","quant.sf")
coldataall(file.exists(coldata$files))
<- as.data.frame(coldata)
coldata rownames(coldata) <- coldata$names
Let’s set the reference files that will be used to build the metadata:
<- file.path("../../4-quantification/salmon/gencode_v40_index")
indexDir <- file.path("../../gencode.v40.annotation.gff3.gz")
gffPath <- "ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz" fastaFTP
Let’s import the quantifications and join the metadata of the samples and transcripts:
makeLinkedTxome(indexDir = indexDir,
source = "GENCODE",
organism = "Homo sapiens",
release = "40",
genome = "GRCh38",
fasta = fastaFTP,
gtf = gffPath,
write = FALSE)
Let’s summarize the data for genes:
<- tximeta(coldata, useHub = F)
se <- summarizeToGene(se) gse
Finally, let’s save the gse
object which contains the expression arrays:
save(gse, file="matrix_gse_salmon_tximeta.RData")