Chapter 8 Annotation of transcripts

The annotation of transcripts corresponds to the indexing of genomic coordinates and the classification of transcripts, for example: miRNA, coding RNAs, lncRNA and so on. As technologies are evolving, so are transcriptomics and genomics annotations, as is the case with GENCODE and RefSeq.

GENCODE is the gene annotation used as a standard for the Ensembl project. In it are noted the different biotypes of transcripts. RefSeq is a sequence bank of the National Center for Biotechnology Information (NCBI). Both are being updated continuously and it will take some time before it is completed. But as we have seen, as innovations in throughput sequencing technologies progress, there is also an acceleration in transcript annotation.

In this pipeline, we focus on GENCODE. As technologies and metrics advance, GENCODE is updated, resulting in different versions. We will be working with version 40 available at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/. The procedure for the final annotation of the transcripts for further analysis will be carried out through the R/Bioconductor package AnnotationHub.

8.1 Annotating transcripts of arrays generated by tximport

Let’s start by loading the necessary packages:

library(AnnotationHub)
library(GenomicFeatures)
library(ensembldb)
library(SummarizedExperiment)

Next, let’s set the working directory and mount the reference object. This object will contain the data referring to the gene id (gene_id), symbol (symbol), biotype (gene_biotype), entrez id (entrezid):

setwd("~/PreProcSEQ-main/6-annotationTranscripts")

ah <- AnnotationHub()

edb <- query(ah, pattern = c("Homo sapiens", "EnsDb",106))[[1]]
gns <- genes(edb)
EnsDbAnnotation <- as.data.frame(gns)
EnsDbAnnotation <- EnsDbAnnotation[,c("gene_id","symbol","gene_biotype","entrezid")]
dim(EnsDbAnnotation)
colnames(EnsDbAnnotation) <- c("ensemblid","symbol","gene_biotype","entrezid")

Let’s import the matrix we want to annotate:

load("../5-expressionMatrix/tximport/matrix_salmon_tximport.RData")

matrix_expr <- mat_gse$abundance

So, let’s strip the version number from the transcript to align it correctly with the reference object:

nrow(matrix_expr)
rownames(matrix_expr)

rownames(matrix_expr) <- stringr::str_replace(rownames(matrix_expr), "\\...$", "")
rownames(matrix_expr) <- stringr::str_replace(rownames(matrix_expr), "\\..$", "")

all(rownames(matrix_expr)%in%rownames(EnsDbAnnotation))

Let’s select from the reference, the annotation for the transcripts present in the expression matrix:

rowAnnotation <- EnsDbAnnotation[rownames(matrix_expr),]
rowAnnotation <- data.frame(rowAnnotation, stringsAsFactors = F)
rownames(rowAnnotation) <- rowAnnotation$gene_id

The rowAnnotation object contains the information for the transcripts present in the array. Having in hand the expression matrices, the annotation of the transcripts and the metadata of the samples, we can build the SummmarizedExperiment object:

dirquant <- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"
files <- list.files(dirquant)
filesnames <- gsub("_quant","",files[-1])

colnames(mat_gse$abundance) <- filesnames
colnames(mat_gse$counts) <- filesnames
colnames(mat_gse$length) <- filesnames

coldata <- readxl::read_xlsx("../metadata.xlsx")
coldata <- as.data.frame(coldata)
rownames(coldata) <- coldata$Sample_id

colnames(mat_gse$abundance) <- rownames(coldata)
colnames(mat_gse$counts) <- rownames(coldata)
colnames(mat_gse$length) <- rownames(coldata)

mat_annot <- SummarizedExperiment(assays = list(counts=mat_gse$counts, 
                                                abundance=mat_gse$abundance,
                                                length=mat_gse$length),
                                  rowRanges = rowRanges,
                                  colData = coldata)

rowData(mat_annot) <- rowAnnotation

save(mat_annot, file = "matrix_gse_salmon_tximport_noted.RData")

8.2 Annotating matrix transcripts generated by tximeta

Let’s start by loading the necessary packages:

library(AnnotationHub)
library(GenomicFeatures)
library(ensembldb)
library(SummarizedExperiment)

As in the previous section, let’s build the object that will contain the annotations:

setwd("~/PreProcSEQ-main/6-annotationTranscripts")

ah <- AnnotationHub()

edb <- query(ah, pattern = c("Homo sapiens", "EnsDb",106))[[1]]
gns <- genes(edb)
EnsDbAnnotation <- as.data.frame(gns)
EnsDbAnnotation <- EnsDbAnnotation[,c("gene_id","symbol","gene_biotype","entrezid")]
dim(EnsDbAnnotation)
colnames(EnsDbAnnotation) <- c("ensemblid","symbol","gene_biotype","entrezid")

Let’s load the object to be annotated:

load("../5-expressionMatrix/tximeta/matrix_gse_salmon_tximeta.RData")

nrow(gse)
gseAnnotation <- rowData(gse)

Let’s remove the id version of the transcript so we can align it with the reference object:

rownames(gseAnnotation) <- stringr::str_replace(rownames(gseAnnotation), "\\...$", "")
rownames(gseAnnotation) <- stringr::str_replace(rownames(gseAnnotation), "\\..$", "")

all(rownames(gseAnnotation)%in%rownames(EnsDbAnnotation))

Finally, let’s replace the annotation generated by tximeta for the annotation we made:

rowAnnotation <- EnsDbAnnotation[rownames(gseAnnotation),]
rowAnnotation <- data.frame(gseAnnotation, rowAnnotation, stringsAsFactors = F)
rownames(rowAnnotation) <- rowAnnotation$gene_id
rowData(gse) <- rowAnnotation

save(gse, file = "matrix_gse_salmon_tximeta_noted.RData")