Chapter 8 Functional analysis of RNA-seq data
8.0.1 Ontology and GO Terms
Ontology: An ontology is a classification of things that are detectable or directly observable and the relationships between those things.
The Gene Ontology (GO) project provides an ontology of defined terms representing gene product properties. These terms are connected to each other via formally defined relationships, allowing for consistent annotation and analysis of gene functions across different species and research areas.
GO term hierarchy Some gene products are well-researched, with vast quantities of data available regarding their biological processes and functions. However, other gene products have very little data available about their roles in the cell.
For example, the protein, “p53”, would contain a wealth of information on it’s roles in the cell, whereas another protein might only be known as a “membrane-bound protein” with no other information available.
The GO ontologies were developed to describe and query biological knowledge with differing levels of information available. To do this, GO ontologies are loosely hierarchical, ranging from general, ‘parent’, terms to more specific, ‘child’ terms. The GO ontologies are “loosely” hierarchical since ‘child’ terms can have multiple ‘parent’ terms.
Some genes with less information may only be associated with general ‘parent’ terms or no terms at all, while other genes with a lot of information be associated with many terms.
8.0.2 GO Terms (Gene Ontology Terms)
The GO terms describe three aspects of gene products:
- Biological Process: Refers to the biological objectives achieved by gene products, such as “cell division” or “response to stress.” Inferred from Direct Assay (IDA)
Molecular Function: Describes the biochemical activities at the molecular level, such as “kinase activity” or “DNA binding.”
Cellular Component: Indicates where gene products are located within the cell, like “nucleus” or “mitochondrion.”
8.0.3 GO Enrichment Methods
GO enrichment methods help interpret differentially expressed gene lists from RNA-seq data and include the following:
- Overrepresentation Analysis (ORA):
- ORA assesses whether specific GO terms are over-represented in a gene list compared to the background set of genes (e.g., all genes in the genome). Over-represented GO terms indicate biological functions or processes that are significantly associated with the genes of interest.
`
- Hypergeometric Test:
The hypergeometric test in GO enrichment analysis is used to determine whether a set of genes is enriched for specific GO terms compared to the background set of genes. The test calculates the probability of observing the overlap between the query gene set and the gene set associated with a particular GO term, given the total number of genes in the dataset and the number of genes in the background set.
Test Analogy The jar contains:
The hypergeometric test asks: “If I randomly drew this many balls from the jar, what’s the probability I’d get at least this many red balls just by chance?”
So, it’s like drawing colored balls from a jar:
🔴 Red balls = genes that belong to a specific GO term
⚫ Black balls = genes that do NOT belong to that GO term
Total balls in jar = all genes in your background (universe)
You draw from the jar:
You draw a handful of balls = your significant genes (e.g., padj < 0.05)
The test checks how surprising it is that so many of your significant genes are red.
If the probability of getting that many red balls (GO term genes) by chance is very low, then the GO term is overrepresented — meaning your experiment is likely enriched for that biological process.
Note for the curious: The hypergeometric test is based on counting combinations — the number of ways to draw q or more “successes” (genes with a given GO term) in k draws from a total of N genes, of which m belong to that GO term.
The formula for the cumulative probability of observing at least q successes (the upper-tail probability) in a sample of size \(k\) from a population with \(m\) successes and \(N - m\) failures is:
\[ P(X \geq q) = \sum_{i=q}^{k} \frac{\binom{m}{i} \binom{N - m}{k - i}}{\binom{N}{k}} \]
Where:
N
: The total number of genes in the background set (e.g., all genes in the genome).m
: The total number of genes in the background annotated with a specific GO term (e.g., all genes associated with “cell cycle regulation”).k
: The total number of genes in your gene list of interest (e.g., your differentially expressed genes).q
: The number of genes in your list that are annotated with that specific GO term (the observed overlap).
8.1 Perform the hypergeometric test: P(X >= q)
In R, the phyper() function computes the probability of observing this much overlap (or more) between our gene list and a GO term by chance.
Suppose your background population has \(N = 20,000\) genes (e.g., the entire genome). Out of these, \(m = 500\) genes are associated with a specific GO term (e.g., “cell cycle regulation”). You have a list of \(k = 1000\) genes of interest. In this list, \(q = 50\) genes are associated with the “cell cycle regulation” GO term.
A small p-value indicates that observing 50 or more genes with this GO term is unlikely by chance alone, suggesting genuine enrichment.
Question: Is 50 more than we’d expect by chance?
# Define parameters
N <- 20000 # Total genes in the background
m <- 500 # Genes with the GO term in background
k <- 1000 # Your gene list size (e.g., DE genes)
q <- 50 # Observed overlap with GO term
# Perform the hypergeometric test: P(X >= q)
# Note: We use q-1 because phyper calculates P(X > q-1) = P(X >= q)
p_value <- phyper(q - 1, m, N - m, k, lower.tail = FALSE)
p_value
## [1] 2.551199e-06
A small p-value indicates that observing 50 or more genes with this GO term is unlikely by chance alone, suggesting genuine enrichment.
- Gene Set Enrichment Analysis (GSEA): Instead of only considering genes above a certain threshold (e.g., differentially expressed genes), GSEA looks at the entire ranked list of genes (e.g., ranked by their expression levels) and checks whether genes associated with a certain GO term are statistically over-represented at the top or bottom of a ranked gene list, indicating coordinated changes in gene expression.
Here’s how GSEA works:
Gene Ranking: Genes are ranked based on a metric that reflects their differential expression, such as log2FoldChange.
Enrichment Score (ES): GSEA calculates an Enrichment Score (ES) for each gene set by walking down the ranked list, increasing the score when a gene belongs to the set and decreasing it otherwise. A positive ES suggests upregulation, while a negative ES suggests downregulation of the gene set.
The hypothesis of GSEA is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting an arbitrary threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.
- Example: The GSEA enrichment plot provides a graphical view of the enrichment score for a gene set:
- The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set.
The middle portion of the plot shows where the members of the gene set appear in the ranked list of genes.
The leading edge subset of a gene set is the subset of members that contribute most to the ES.
The bottom portion of the plot shows the value of the ranking metric (such as log2FoldChange) as you move down the list of ranked genes.
ORA vs. GSEA
When to use each: - Use ORA when you have a clear significance threshold (e.g., padj < 0.05) and want to test specific gene lists - Use GSEA when you want to detect subtle but coordinated changes across a pathway, even if individual genes aren’t significantly DE.
ORA → Uses only significant genes (binary: in/out)
ORA → Hypergeometric test
GSEA → Uses all genes ranked by expression change
GSEA → Enrichment score
8.1.1 Over-representation Analysis (ORA) using clusterProfiler
## Load required libraries
library(org.Hs.eg.db)
library(clusterProfiler)
library(tidyverse)
library(enrichplot)
library(fgsea)
library(igraph)
library(ggraph)
library(visNetwork)
library(GO.db)
library(GOSemSim)
library(GOenrichment)
library(tidyr)
8.1.1.1 Running clusterProfiler: we first need to load the results of the differential expression analysis.
res_tableOE = readRDS("data/res_tableOE.RDS")
res_tableOE_tb <- res_tableOE %>%
data.frame() %>% rownames_to_column(var = "gene") %>%
dplyr::filter(!is.na(log2FoldChange)) %>% as_tibble()
To perform the over-representation analysis, we need a list of background genes and a list of significant genes:
## background set of genes
allOE_genes <- res_tableOE_tb$gene
sigOE = dplyr::filter(res_tableOE_tb, padj < 0.05) # Standard padj value
## significant genes
sigOE_genes = sigOE$gene
TheenrichGO()
function performs the ORA for the significant genes of interest (sigOE_genes
) compared to the background gene list (allOE_genes
) and returns the enriched GO terms and their associated p-values.
## Run GO enrichment analysis
ego <- enrichGO(gene = sigOE_genes,
universe = allOE_genes,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
minGSSize = 20,
maxGSSize = 300,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.01, # slightly stringent FDR cutoff
readable = TRUE)
We can extract the gene lists associated with each enriched GO term from the ego
enrichResult object to create edges between terms based on shared genes.
## [1] "ID" "Description" "GeneRatio" "BgRatio"
## [5] "RichFactor" "FoldEnrichment" "zScore" "pvalue"
## [9] "p.adjust" "qvalue" "geneID" "Count"
ID | Description | GeneRatio | BgRatio | RichFactor | FoldEnrichment | zScore | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GO:0042254 | GO:0042254 | ribosome biogenesis | 152/5254 | 278/14554 | 0.5467626 | 1.514576 | 6.511013 | 0 | 6.0e-07 | 5.0e-07 | AATF/ABT1/BMS1/BRIX1/BYSL/C1QBP/CHD7/CUL4A/CUL4B/DDX18/DDX21/DDX27/DDX28/DDX47/DDX49/DDX51/DDX54/DDX56/DHX30/DHX37/DKC1/DNTTIP2/EBNA1BP2/EIF1AX/EIF2A/EIF4A3/EIF5/EIF5B/EIF6/EMG1/ERAL1/ERCC2/EXOSC10/EXOSC3/EXOSC5/EXOSC6/EXOSC9/FAU/FBL/FBLL1/FCF1/FRG1/FTSJ3/GEMIN4/GLUL/GNL3L/GRWD1/GTPBP10/HEATR1/IMP3/IMP4/ISG20L2/KRI1/LSG1/LSM6/LTV1/MAK16/MCAT/MDN1/MPHOSPH10/MRPS2/MRPS7/MRTO4/MYBBP1A/NAT10/NGDN/NGRN/NHP2/NLE1/NMD3/NOC2L/NOC4L/NOL10/NOL11/NOL6/NOL8/NOL9/NOLC1/NOP10/NOP14/NOP16/NOP2/NSUN5/NUDT16/NUP88/PA2G4/PDCD11/PELP1/PES1/PIH1D1/POP4/PPAN/PWP1/PWP2/RAN/RBM10/REXO1/REXO4/RIOK3/RPL24/RPL26/RPL26L1/RPL7L1/RPP30/RPP40/RPS15/RPS15A/RPS19BP1/RPS23/RPS27/RPS27A/RPS27L/RPS3A/RPSA/RPUSD1/RPUSD4/RRP1/RRP15/RRP1B/RRP36/RRP7A/RRP9/RRS1/RSL24D1/SBDS/SDAD1/SIRT7/SRFBP1/SURF6/SUV39H1/TFB1M/TRMT112/TRMT2B/TSR1/TSR2/URB1/URB2/UTP14A/UTP15/UTP18/UTP20/UTP3/WDR18/WDR43/WDR46/WDR55/WDR74/XPO1/XRCC5/XRN2/YTHDF2/ZNF622 | 152 |
GO:0006403 | GO:0006403 | RNA localization | 110/5254 | 189/14554 | 0.5820106 | 1.612216 | 6.367405 | 0 | 8.0e-07 | 7.0e-07 | ALKBH5/ALYREF/ARC/ATR/ATXN2/BICD1/CAPRIN1/CASC3/CCT2/CCT3/CCT5/CCT6A/CCT7/CETN2/CETN3/CHTOP/DCP2/DDX19A/DDX39A/DHX9/DKC1/EIF4A3/ENY2/EXOSC10/FBL/FMR1/FUBP3/FXR2/FYTTD1/GLE1/HNRNPU/HSF1/IGF2BP1/IGF2BP2/KHDRBS1/KHSRP/KPNB1/MCM3AP/MVP/NCBP1/NEAT1/NHP2/NOL6/NOP10/NSUN2/NUP107/NUP133/NUP188/NUP205/NUP210/NUP214/NUP50/NUP54/NUP85/NUP88/NUP93/NUP98/NUTF2/NXF1/NXF3/NXT1/NXT2/PABPN1/PARN/PEG10/PHAX/PIH1D1/POLDIP3/POM121/POM121C/POM121L2/PRPF6/QKI/RAE1/RAN/RBM15B/RBM33/RBM8A/RFTN1/RUVBL1/RUVBL2/SARNP/SEC13/SEH1L/SENP2/SETD2/SHQ1/SIDT2/SLBP/SMG5/SMG6/SMG7/SNUPN/SRSF3/SRSF7/STAU1/SUPT6H/TGFBR2/THOC1/THOC2/THOC7/TOMM20/UPF1/XPO1/XPO5/YBX1/ZC3H11A/ZFP36/ZFP36L1/ZNF385A | 110 |
GO:0043484 | GO:0043484 | regulation of RNA splicing | 102/5254 | 174/14554 | 0.5862069 | 1.623840 | 6.222261 | 0 | 1.3e-06 | 1.1e-06 | ACIN1/AHNAK/AHNAK2/ARGLU1/ATXN7L3/C1QBP/CCNL1/CCNL2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLK1/CLNS1A/DAZAP1/DDX5/ENY2/ERN1/ESRP1/EXOSC10/FAM50A/FASTK/FMR1/FUS/GRSF1/HNRNPF/HNRNPH2/HNRNPK/HNRNPL/HNRNPU/HOXB-AS3/HSPA1A/KAT2A/KHDRBS1/KHDRBS3/LARP7/MBNL1/MBNL2/MBNL3/NCBP1/NCL/NOVA1/NUP98/PIK3R1/POLR2A/PQBP1/PRDX6/PRMT5/PRPF19/PTBP1/PTBP2/PUF60/QKI/RBFOX2/RBM10/RBM11/RBM12B/RBM15/RBM15B/RBM22/RBM23/RBM24/RBM3/RBM38/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNPS1/RPS26/RRP1B/SF1/SF3B3/SFSWAP/SLC38A2/SMU1/SNRNP70/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/SUPT3H/TADA2B/TADA3/TAF5L/TAF6L/TIA1/TMBIM6/TRRAP/U2AF2/UPF1/USP22/WDR77/ZBTB7A/ZNF326/ZNF638 | 102 |
GO:0051168 | GO:0051168 | nuclear export | 93/5254 | 156/14554 | 0.5961538 | 1.651394 | 6.148011 | 0 | 1.5e-06 | 1.3e-06 | ADAR/AHCYL1/ALKBH5/ALYREF/ANP32B/BAG3/CASC3/CDK5/CHTOP/CSE1L/DDX19A/DDX39A/DHX9/EIF4A3/EIF6/EMD/ENY2/FMR1/FYTTD1/GAS6/GLE1/KHDRBS1/LSG1/LTV1/MCM3AP/MDM2/MDN1/NCBP1/NEAT1/NEMF/NMD3/NOL6/NSUN2/NUP107/NUP133/NUP188/NUP214/NUP85/NUP88/NUP93/NUP98/NUTF2/NXF1/NXF3/NXT1/NXT2/PABPN1/PARK7/PHAX/PKD1/POLDIP3/POM121/POM121C/POM121L2/PPM1A/PRKACA/PTPN11/RAE1/RAN/RANBP3/RANBP3L/RANGAP1/RAPGEF3/RBM10/RBM15B/RBM22/RBM33/RBM8A/RPS15/SARNP/SDAD1/SETD2/SFN/SIRT6/SIRT7/SMG5/SMG6/SMG7/SRSF3/SUPT6H/TGFB1/THOC1/THOC2/THOC7/UBE2I/UPF1/XPO1/XPO4/XPO5/XPO6/XPO7/YWHAE/ZC3H11A | 93 |
GO:0000375 | GO:0000375 | RNA splicing, via transesterification reactions | 151/5254 | 284/14554 | 0.5316901 | 1.472824 | 6.048208 | 0 | 1.5e-06 | 1.3e-06 | ACIN1/ARGLU1/C1QBP/C9orf78/CASC3/CD2BP2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLNS1A/COIL/CTNNBL1/CWC15/CWF19L1/DAZAP1/DBR1/DDX1/DDX23/DDX39A/DDX42/DDX5/DHX15/DHX16/DHX38/DHX40/DHX9/EFTUD2/EIF4A3/EXOSC10/FMR1/FRG1/GCFC2/GEMIN4/GEMIN5/GEMIN7/GPKOW/HNRNPA3/HNRNPF/HNRNPK/HNRNPL/HNRNPM/HNRNPR/HNRNPU/HNRNPUL1/HSPA8/HTATSF1/IK/ILF3/KHDRBS1/KHDRBS3/KHSRP/LARP7/LSM2/LSM3/LSM4/LSM5/LSM6/LSM7/LUC7L2/LUC7L3/METTL3/MPHOSPH10/NCBP1/NCL/NOVA1/NUP98/PABPC1/PLRG1/PNN/PPIL1/PQBP1/PRDX6/PRMT5/PRMT7/PRPF19/PRPF3/PRPF38A/PRPF39/PRPF4/PRPF40B/PRPF6/PRPF8/PTBP1/PTBP2/PUF60/QKI/RALY/RBFOX2/RBM10/RBM11/RBM15/RBM15B/RBM17/RBM22/RBM23/RBM24/RBM3/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNF113A/RNPC3/RNPS1/SART3/SF1/SF3A1/SF3A3/SF3B1/SF3B2/SF3B3/SF3B4/SFPQ/SFSWAP/SMU1/SNRNP200/SNRNP25/SNRNP27/SNRNP40/SNRNP70/SNRPB/SNRPC/SNRPD1/SNRPD2/SNRPD3/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/STRAP/SYNCRIP/TFIP11/TIA1/TSSC4/TXNL4B/U2AF1/U2AF2/UBL5/UPF1/USP39/USP4/WDR77/ZBTB7A/ZMAT2 | 151 |
GO:0000377 | GO:0000377 | RNA splicing, via transesterification reactions with bulged adenosine as nucleophile | 149/5254 | 280/14554 | 0.5321429 | 1.474078 | 6.020548 | 0 | 1.5e-06 | 1.3e-06 | ACIN1/ARGLU1/C1QBP/C9orf78/CASC3/CD2BP2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLNS1A/COIL/CTNNBL1/CWC15/CWF19L1/DAZAP1/DBR1/DDX1/DDX23/DDX39A/DDX42/DDX5/DHX15/DHX16/DHX38/DHX40/DHX9/EFTUD2/EIF4A3/EXOSC10/FMR1/FRG1/GCFC2/GEMIN4/GEMIN5/GEMIN7/GPKOW/HNRNPA3/HNRNPF/HNRNPK/HNRNPL/HNRNPM/HNRNPR/HNRNPU/HNRNPUL1/HSPA8/HTATSF1/IK/ILF3/KHDRBS1/KHDRBS3/LARP7/LSM2/LSM3/LSM4/LSM5/LSM6/LSM7/LUC7L2/LUC7L3/METTL3/NCBP1/NCL/NOVA1/NUP98/PABPC1/PLRG1/PNN/PPIL1/PQBP1/PRDX6/PRMT5/PRMT7/PRPF19/PRPF3/PRPF38A/PRPF39/PRPF4/PRPF40B/PRPF6/PRPF8/PTBP1/PTBP2/PUF60/QKI/RALY/RBFOX2/RBM10/RBM11/RBM15/RBM15B/RBM17/RBM22/RBM23/RBM24/RBM3/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNF113A/RNPC3/RNPS1/SART3/SF1/SF3A1/SF3A3/SF3B1/SF3B2/SF3B3/SF3B4/SFPQ/SFSWAP/SMU1/SNRNP200/SNRNP25/SNRNP27/SNRNP40/SNRNP70/SNRPB/SNRPC/SNRPD1/SNRPD2/SNRPD3/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/STRAP/SYNCRIP/TFIP11/TIA1/TSSC4/TXNL4B/U2AF1/U2AF2/UBL5/UPF1/USP39/USP4/WDR77/ZBTB7A/ZMAT2 | 149 |
# GO term IDs
term_ids <- ego_df$ID
# Split gene lists by "/" and assign term IDs as names
gene_lists <- strsplit(ego_df$geneID, "/")
# Assign term IDs as names
names(gene_lists) <- term_ids
Using the gene lists associated with each GO term, we can create edges between terms based on the shared genes.
mgoSim()
calculates the semantic similarity of go terms based on their go ancestry terms, for example:
Term A: “DNA replication” Ancestors: DNA metabolic process → nucleic acid process → cellular process
Term B: “DNA repair”
Ancestors: DNA metabolic process → nucleic acid process → cellular process
Shared ancestry: Both have “DNA metabolic process” as close ancestor → High similarity
# Load GO data for semantic similarity calculation
# GO terms and their relationships (the ontology graph structure)
go_data <- GOSemSim::godata(annoDb = "org.Hs.eg.db",
ont = "BP",
keytype = "SYMBOL")
# Compute semantic similarity
similarity_matrix <- GOSemSim::mgoSim(term_ids, term_ids,
semData = go_data,
measure = "Wang", combine = NULL)
# Convert similarity matrix to edges
edges <- data.frame(from = rep(term_ids,
each = length(term_ids)),
to = rep(term_ids,
times = length(term_ids)),
similarity = as.vector(similarity_matrix))
# Filter out edges with zero similarity or self-loops
edges <- subset(edges, similarity > 0 & from != to)
edges <- edges %>% filter(similarity >= 0.8)
# Convert term IDs to names
edges$to <- Term(edges$to)
edges$from <- Term(edges$from)
dim(edges)
## [1] 66 3
Now we can create a network plot to visualize the relationships between the enriched GO terms based on their semantic similarity.
# Create an igraph object from the edges data frame
g <- graph_from_data_frame(edges, directed = FALSE)
# Visualize the graph using ggraph
ggraph(g, layout = 'kk') + # Kamada-Kawai layout
geom_edge_link(aes(width = similarity), alpha = 0.5) + # Edges with some transparency
geom_node_point(size = 5, color = "blue") + # Nodes sized uniformly
geom_node_text(aes(label = name), repel = TRUE, size = 3) + # Node labels
theme_void() + # Clean theme without axes
labs(title = "GO Terms Semantic Similarity Network") # Title for the plot
Edges can also be defined by Jaccard similarity between GO terms; defined as the size of the intersection each pair of GO terms divided by the size of their union.
# Define a function to compute Jaccard similarity
jaccard_index <- function(genes1, genes2) {
length(intersect(genes1, genes2)) / length(union(genes1, genes2))
}
# Create an empty data frame to store edges
edges <- data.frame(from = character(), to = character(),
similarity = numeric())
# Compute Jaccard similarity for each pair of terms
for (i in 1:(length(term_ids) - 1)) {
for (j in (i + 1):length(term_ids)) {
similarity <- jaccard_index(gene_lists[[i]], gene_lists[[j]])
if (similarity > 0) { # Only keep edges with some overlap
edges <- rbind(edges, data.frame(from = term_ids[i],
to = term_ids[j],
similarity = similarity))
}}}
edges <- edges %>%
dplyr::filter(similarity >= 0.8)
# Convert term IDs to names
edges$to <- Term(edges$to)
edges$from <- Term(edges$from)
dim(edges)
## [1] 27 3
Now we can create a network plot to visualize the relationships between the enriched GO terms based on the Jaccard similarity of their gene lists. Jaccard similarity is stricter than mgoSim.
Jaccard Requirements:
Direct gene overlap between GO terms Must share actual genes to get high similarity No similarity if terms don’t share genes, even if biologically related
mgoSim (Semantic) is More Permissive:
Considers biological relationships in GO hierarchy Can be similar even without shared genes Uses pathway knowledge and term relationships
# Create a graph from the edges
g <- graph_from_data_frame(edges, directed = FALSE)
# Visualize the graph using ggraph
ggraph(g, layout = 'kk') +
geom_edge_link(aes(alpha = similarity, width = similarity, color = edge_color)) +
scale_edge_color_identity() +
geom_node_point(size = 5,
color = "blue") +
geom_node_text(aes(label = name),
repel = TRUE,
size = 3) + # Node labels
theme_void() + # Clean theme without axes
labs(title = "GO Terms Jaccard Overlap Network")
The resulting network plots shows the relationships between the enriched GO terms based on Jaccard similarity of their gene lists.
It is clear that Jaccard similarity is more stringent than Semantic similarity.
8.1.2 Visualizing Enrichment Results
Network Plot:
The emapplot()
function generates a network plot where each node represents an enriched gene set, and edges between nodes indicate the similarity or overlap between those gene sets.
# Filter to top significant results first
ego_filtered <- ego
ego_filtered@result <- ego@result[1:50, ] # Take only top 50 results
# Then create pairwise similarity
pwt_filtered <- pairwise_termsim(ego_filtered, method = "JC")
# Now create network plot
emapplot(
pwt_filtered,
showCategory = 50,
node_label = "category",
layout = "mds"
) +
theme(legend.position = "none")
For a general overview of enrichment results and trends, use ssplot()
. For detailed exploration of pairwise similarities between enrichment terms and their functional relationship networks, use emapplot()
.
cnetplot: Visualizes connections between genes and top GO terms, highlighting shared pathways.This plot is particularly useful for hypothesis generation in identifying genes that may be important to several of the most affected processes.
cnetplot(pwt_filtered, showCategory = 3,
categorySize = "pvalue",
colorEdge = TRUE) +
theme(legend.position = "none")
8.1.3 GSEA Using ‘clusterProfiler’ `gseGO’
Prepare fold changes, sort by expression, and run GSEA with gseGO()
from clusterProfiler
## Extract the foldchanges
foldchanges <- res_tableOE_tb$log2FoldChange
## Name each fold change with the gene name
names(foldchanges) <- res_tableOE_tb$gene
## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)
We can use GSEA to test whether genes in specific biological processes tend to be up- or down-regulated together in our experiment:
8.1.4 Visualizing GSEA Results
Visualize GSEA results with dotplot()
for significant categories and gseaplot2()
for gene set ranks.
The emapplot clusters the 50 most significant (by padj) GO terms to visualize relationships between terms.
gseaGO_filtered <- gseaGO
gseaGO_filtered@result <- gseaGO@result[1:50, ] # Top 50 by p.adjust
# Create pairwise similarity
pwt <- pairwise_termsim(
gseaGO_filtered,
method = "JC",
semData = NULL
)
emapplot(pwt, showCategory = 50,
node_label = "category")
We are going to compare ORA enricher
to running the ORA runGORESP
.
Let’s first load the GOenrichment
package:
8.1.5 GOenrichment
Package Analysis
We’re going to run this analysis using TWO different packages. Why? In research, you often want to verify important results using independent methods. We’ll see that both give us similar answers (~90% overlap), which builds confidence. They differ slightly because of minor implementation details, which is normal and expected.
runGORESP
uses over-representation analysis to identify enriched GO terms and returns two data.frames
; of enriched GO terms (nodes) and GO term relationships (edges). We first need to get the GO annotations and specifically ‘Biological Process’ (BP).
- Generate clean GO Biological Process gene sets with gene symbols
# install.packages("msigdbr") # uncomment to install
library(msigdbr)
# Using msigdbr ensures standardized, up-to-date gene sets
go_bp <- msigdbr(species = "Homo sapiens", collection = "C5", subcollection = "GO:BP")
# Let's look at the format of go_bp
names(go_bp)
## [1] "gene_symbol" "ncbi_gene" "ensembl_gene"
## [4] "db_gene_symbol" "db_ncbi_gene" "db_ensembl_gene"
## [7] "source_gene" "gs_id" "gs_name"
## [10] "gs_collection" "gs_subcollection" "gs_collection_name"
## [13] "gs_description" "gs_source_species" "gs_pmid"
## [16] "gs_geoid" "gs_exact_source" "gs_url"
## [19] "db_version" "db_target_species"
gene_symbol | gs_id |
---|---|
AASDHPPT | M29088 |
ALDH1L1 | M29088 |
ALDH1L2 | M29088 |
MTHFD1 | M29088 |
MTHFD1L | M29088 |
MTHFD2L | M29088 |
## [1] 617648 20
## $GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS
## [1] "AASDHPPT" "ALDH1L1" "ALDH1L2" "MTHFD1" "MTHFD1L" "MTHFD2L"
##
## $GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS
## [1] "AADAT" "ADHFE1" "COL6A1" "D2HGDH" "DLD" "DLST" "GOT1" "GOT2"
## [9] "GPT2" "IDH1" "IDH2" "KGD4" "KYAT3" "L2HGDH" "OGDH" "OGDHL"
## [17] "PHYH" "TAT"
# Clean up term names (remove GOBP_ prefix and replace underscores)
names(hGOBP.gmt) <- gsub("^GOBP_", "", names(hGOBP.gmt))
names(hGOBP.gmt) <- gsub("_", " ", names(hGOBP.gmt))
hGOBP.gmt[1:2]
## $`10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS`
## [1] "AASDHPPT" "ALDH1L1" "ALDH1L2" "MTHFD1" "MTHFD1L" "MTHFD2L"
##
## $`2 OXOGLUTARATE METABOLIC PROCESS`
## [1] "AADAT" "ADHFE1" "COL6A1" "D2HGDH" "DLD" "DLST" "GOT1" "GOT2"
## [9] "GPT2" "IDH1" "IDH2" "KGD4" "KYAT3" "L2HGDH" "OGDH" "OGDHL"
## [17] "PHYH" "TAT"
Create GO ID (go_input argument) for runGORESP
: assigns GO IDs to GO terms
go_mapping <- go_bp[!duplicated(go_bp$gs_name), c("gs_name", "gs_id")]
# Let's look at the format of go_mapping
head(go_mapping)
gs_name | gs_id |
---|---|
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | M29088 |
GOBP_2FE_2S_CLUSTER_ASSEMBLY | M23654 |
GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS | M15857 |
GOBP_3_UTR_MEDIATED_MRNA_DESTABILIZATION | M24263 |
GOBP_3_UTR_MEDIATED_MRNA_STABILIZATION | M11060 |
GOBP_4FE_4S_CLUSTER_ASSEMBLY | M45874 |
names(go_mapping) <- c("term", "GOID")
# Cleanup GO terms
go_mapping$term <- gsub("^GOBP_", "", go_mapping$term)
go_mapping$term <- gsub("_", " ", go_mapping$term)
head(go_mapping)
term | GOID |
---|---|
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | M29088 |
2FE 2S CLUSTER ASSEMBLY | M23654 |
2 OXOGLUTARATE METABOLIC PROCESS | M15857 |
3 UTR MEDIATED MRNA DESTABILIZATION | M24263 |
3 UTR MEDIATED MRNA STABILIZATION | M11060 |
4FE 4S CLUSTER ASSEMBLY | M45874 |
enricher
uses over-representation analysis to identify enriched GO terms. Prepare data for enricher: Create TERM2GENE data frame (term = GO term, gene = gene symbol) for the GO library
gs_name | gene_symbol |
---|---|
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | AASDHPPT |
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | ALDH1L1 |
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | ALDH1L2 |
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | MTHFD1 |
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | MTHFD1L |
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS | MTHFD2L |
names(term2gene) <- c("term", "gene")
# Cleanup GO terms
term2gene$term <- gsub("^GOBP_", "", term2gene$term)
term2gene$term <- gsub("_", " ", term2gene$term)
head(term2gene)
term | gene |
---|---|
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | AASDHPPT |
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | ALDH1L1 |
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | ALDH1L2 |
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | MTHFD1 |
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | MTHFD1L |
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS | MTHFD2L |
- Run
enricher
analysis:
# Optimal parameters for robust analysis
enricher_result <- enricher(
gene = sigOE_genes,
TERM2GENE = term2gene,
pvalueCutoff = 1,
qvalueCutoff = 0.0001, # Stringent FDR threshold
minGSSize = 20, # Robust minimum size
maxGSSize = 300
)
- Prepare data for runGORESP
Create
scoreMat
: data frame with gene, score, and index columns
all_genes <- unique(term2gene$gene)
scoreMat <- data.frame(
gene = all_genes,
score = runif(length(all_genes)), # Random scores
index = ifelse(all_genes %in% sigOE_genes, 1, 0), # 1 = gene of interest
stringsAsFactors = FALSE
)
- Run
runGORESP
analysis:
goresp_result <- runGORESP(
scoreMat = scoreMat,
bp_input = hGOBP.gmt, # Gene sets as named list
go_input = go_mapping, # GO term to ID mapping
fdrThresh = 0.0001, # FDR threshold (same as enricher)
minSetSize = 20, # Same as enricher minGSSize
maxSetSize = 300 # Same as enricher maxGSSize
)
names(goresp_result$edgeMat)
## [1] "source" "target" "overlapCoeff" "width" "label"
## [1] "filename" "GOID" "term"
## [4] "nGenes" "nQuery" "nOverlap"
## [7] "querySetFraction" "geneSetFraction" "foldEnrichment"
## [10] "P" "FDR" "overlapGenes"
## [13] "maxOverlapGeneScore" "cluster" "id"
## [16] "size" "formattedLabel"
GOID | term | nGenes | nQuery | P |
---|---|---|---|---|
M16023 | RNA LOCALIZATION | 204 | 4998 | 0 |
M16816 | REGULATION OF RNA SPLICING | 189 | 5150 | 0 |
M24840 | VESICLE MEDIATED TRANSPORT IN SYNAPSE | 260 | 5117 | 0 |
M15413 | NUCLEAR EXPORT | 172 | 5112 | 0 |
M15304 | STEM CELL DIFFERENTIATION | 265 | 5003 | 0 |
M23968 | ESTABLISHMENT OF RNA LOCALIZATION | 165 | 5115 | 0 |
- Display top results from both methods
##
## === Top enricher results ===
## Description
## RNA LOCALIZATION RNA LOCALIZATION
## REGULATION OF RNA SPLICING REGULATION OF RNA SPLICING
## VESICLE MEDIATED TRANSPORT IN SYNAPSE VESICLE MEDIATED TRANSPORT IN SYNAPSE
## NUCLEAR EXPORT NUCLEAR EXPORT
## ESTABLISHMENT OF RNA LOCALIZATION ESTABLISHMENT OF RNA LOCALIZATION
## STEM CELL DIFFERENTIATION STEM CELL DIFFERENTIATION
## Count pvalue qvalue
## RNA LOCALIZATION 110 4.417655e-15 1.053727e-11
## REGULATION OF RNA SPLICING 103 1.293937e-14 1.543190e-11
## VESICLE MEDIATED TRANSPORT IN SYNAPSE 128 2.315881e-13 1.841329e-10
## NUCLEAR EXPORT 92 1.404827e-12 8.377206e-10
## ESTABLISHMENT OF RNA LOCALIZATION 87 1.573398e-11 7.345982e-09
## STEM CELL DIFFERENTIATION 125 1.847842e-11 7.345982e-09
##
## === Top runGORESP results ===
## term nOverlap P FDR
## 1 RNA LOCALIZATION 110 4.42e-15 1.48e-11
## 2 REGULATION OF RNA SPLICING 103 1.29e-14 2.17e-11
## 3 VESICLE MEDIATED TRANSPORT IN SYNAPSE 128 2.32e-13 2.59e-10
## 4 NUCLEAR EXPORT 92 1.40e-12 1.18e-09
## 5 STEM CELL DIFFERENTIATION 125 1.85e-11 1.03e-08
## 6 ESTABLISHMENT OF RNA LOCALIZATION 87 1.57e-11 1.03e-08
## [1] 65 12
## [1] 57 17
- Let’s check the overlap between the enriched terms found using
runGORESP
and those found usingenricher
as they used the same GO term libraries:
overlap_terms <- intersect(enricher_result$ID, goresp_result$enrichInfo$term)
cat("Overlapping terms:", length(overlap_terms), "\n")
## Overlapping terms: 57
cat("Overlap percentage:", round(100 * length(overlap_terms) / max(nrow(enricher_result), nrow(goresp_result$enrichInfo)), 1), "%\n")
## Overlap percentage: 87.7 %
Percent overlap is close to 90% that is very good.
Key differences between methods - Both use hypergeometric testing with Benjamini-Hochberg FDR correction - enricher: Standard clusterProfiler workflow - runGORESP: Adds gene set clustering and network visualization features - Expected difference: 5-15% due to minor implementation differences - Core significant results are nearly identical
We can visualize the results of the runGORESP analysis using the visNetwork
package. This package allows us to create interactive network visualizations.
We use thevisSetup
function to prepare the data for network visualization. We then run the runNetwork
function to generate the interactive network plot.
vis = visSetup(goresp_result$enrichInfo,goresp_result$edgeMat)
GOenrichment::runNetwork(vis$nodes, vis$edges, main = "MOV10 OE GO Enrichment Network")
This network analysis is based on Cytoscape, an open source bioinformatics software platform for visualizing molecular interaction networks. out of all the GO packages.
Exercise points = +4
1. Perform a GO enrichment analysis using the runGORESP
function from the GOenrichment
package using res_tableKD
with a padj cutoff of 0.05. Use a significance FDR threshold of 0.0001. Save the results in a variable called kresp
.
- Visualize the enriched GO terms using the
runNetwork
function from theGOenrichment
package. Save the visualization setup in a variable calledkvis
.
Step 1: Load and prepare the knockdown data
Step 2: Prepare the score matrix for runGORESP
Step 3: Run GO enrichment analysis with runGORESP
Step 4: Create interactive network visualization
8.1.6 Other tools and resources
GeneMANIA. GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional association data curated from the literature.
ReviGO. Revigo is an online GO enrichment tool that allows you to copy-paste your significant gene list and your background gene list. The output is a visualization of enriched GO terms in a hierarchical tree.
AmiGO. AmiGO is the current official web-based set of tools for searching and browsing the Gene Ontology database.
etc.