Chapter 8 Annotation of transcripts
The annotation of transcripts corresponds to the indexing of genomic coordinates and the classification of transcripts, for example: miRNA, coding RNAs, lncRNA and so on. As technologies are evolving, so are transcriptomics and genomics annotations, as is the case with GENCODE and RefSeq.
GENCODE is the gene annotation used as a standard for the Ensembl project. In it are noted the different biotypes of transcripts. RefSeq is a sequence bank of the National Center for Biotechnology Information (NCBI). Both are being updated continuously and it will take some time before it is completed. But as we have seen, as innovations in throughput sequencing technologies progress, there is also an acceleration in transcript annotation.
In this pipeline, we focus on GENCODE. As technologies and metrics advance, GENCODE is updated, resulting in different versions. We will be working with version 40 available at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/. The procedure for the final annotation of the transcripts for further analysis will be carried out through the R/Bioconductor package AnnotationHub.
8.1 Annotating transcripts of arrays generated by tximport
Let’s start by loading the necessary packages:
library(AnnotationHub)
library(GenomicFeatures)
library(ensembldb)
library(SummarizedExperiment)
Next, let’s set the working directory and mount the reference object. This object will contain the data referring to the gene id (gene_id
), symbol (symbol
), biotype (gene_biotype
), entrez id (entrezid
):
setwd("~/PreProcSEQ-main/6-annotationTranscripts")
<- AnnotationHub()
ah
<- query(ah, pattern = c("Homo sapiens", "EnsDb",106))[[1]]
edb <- genes(edb)
gns <- as.data.frame(gns)
EnsDbAnnotation <- EnsDbAnnotation[,c("gene_id","symbol","gene_biotype","entrezid")]
EnsDbAnnotation dim(EnsDbAnnotation)
colnames(EnsDbAnnotation) <- c("ensemblid","symbol","gene_biotype","entrezid")
Let’s import the matrix we want to annotate:
load("../5-expressionMatrix/tximport/matrix_salmon_tximport.RData")
<- mat_gse$abundance matrix_expr
So, let’s strip the version number from the transcript to align it correctly with the reference object:
nrow(matrix_expr)
rownames(matrix_expr)
rownames(matrix_expr) <- stringr::str_replace(rownames(matrix_expr), "\\...$", "")
rownames(matrix_expr) <- stringr::str_replace(rownames(matrix_expr), "\\..$", "")
all(rownames(matrix_expr)%in%rownames(EnsDbAnnotation))
Let’s select from the reference, the annotation for the transcripts present in the expression matrix:
<- EnsDbAnnotation[rownames(matrix_expr),]
rowAnnotation <- data.frame(rowAnnotation, stringsAsFactors = F)
rowAnnotation rownames(rowAnnotation) <- rowAnnotation$gene_id
The rowAnnotation
object contains the information for the transcripts present in the array. Having in hand the expression matrices, the annotation of the transcripts and the metadata of the samples, we can build the SummmarizedExperiment
object:
<- "~/PreProcSEQ-main/4-quantification/salmon/quant_salmon/"
dirquant <- list.files(dirquant)
files <- gsub("_quant","",files[-1])
filesnames
colnames(mat_gse$abundance) <- filesnames
colnames(mat_gse$counts) <- filesnames
colnames(mat_gse$length) <- filesnames
<- readxl::read_xlsx("../metadata.xlsx")
coldata <- as.data.frame(coldata)
coldata rownames(coldata) <- coldata$Sample_id
colnames(mat_gse$abundance) <- rownames(coldata)
colnames(mat_gse$counts) <- rownames(coldata)
colnames(mat_gse$length) <- rownames(coldata)
<- SummarizedExperiment(assays = list(counts=mat_gse$counts,
mat_annot abundance=mat_gse$abundance,
length=mat_gse$length),
rowRanges = rowRanges,
colData = coldata)
rowData(mat_annot) <- rowAnnotation
save(mat_annot, file = "matrix_gse_salmon_tximport_noted.RData")
8.2 Annotating matrix transcripts generated by tximeta
Let’s start by loading the necessary packages:
library(AnnotationHub)
library(GenomicFeatures)
library(ensembldb)
library(SummarizedExperiment)
As in the previous section, let’s build the object that will contain the annotations:
setwd("~/PreProcSEQ-main/6-annotationTranscripts")
<- AnnotationHub()
ah
<- query(ah, pattern = c("Homo sapiens", "EnsDb",106))[[1]]
edb <- genes(edb)
gns <- as.data.frame(gns)
EnsDbAnnotation <- EnsDbAnnotation[,c("gene_id","symbol","gene_biotype","entrezid")]
EnsDbAnnotation dim(EnsDbAnnotation)
colnames(EnsDbAnnotation) <- c("ensemblid","symbol","gene_biotype","entrezid")
Let’s load the object to be annotated:
load("../5-expressionMatrix/tximeta/matrix_gse_salmon_tximeta.RData")
nrow(gse)
<- rowData(gse) gseAnnotation
Let’s remove the id version of the transcript so we can align it with the reference object:
rownames(gseAnnotation) <- stringr::str_replace(rownames(gseAnnotation), "\\...$", "")
rownames(gseAnnotation) <- stringr::str_replace(rownames(gseAnnotation), "\\..$", "")
all(rownames(gseAnnotation)%in%rownames(EnsDbAnnotation))
Finally, let’s replace the annotation generated by tximeta
for the annotation we made:
<- EnsDbAnnotation[rownames(gseAnnotation),]
rowAnnotation <- data.frame(gseAnnotation, rowAnnotation, stringsAsFactors = F)
rowAnnotation rownames(rowAnnotation) <- rowAnnotation$gene_id
rowData(gse) <- rowAnnotation
save(gse, file = "matrix_gse_salmon_tximeta_noted.RData")