Chapter 8 Functional analysis of RNA-seq data

8.0.1 Ontology and GO Terms

Ontology: An ontology is a classification of things that are detectable or directly observable and the relationships between those things.

The Gene Ontology (GO) project provides an ontology of defined terms representing gene product properties. These terms are connected to each other via formally defined relationships, allowing for consistent annotation and analysis of gene functions across different species and research areas.

GO term hierarchy Some gene products are well-researched, with vast quantities of data available regarding their biological processes and functions. However, other gene products have very little data available about their roles in the cell.

For example, the protein, “p53”, would contain a wealth of information on it’s roles in the cell, whereas another protein might only be known as a “membrane-bound protein” with no other information available.

The GO ontologies were developed to describe and query biological knowledge with differing levels of information available. To do this, GO ontologies are loosely hierarchical, ranging from general, ‘parent’, terms to more specific, ‘child’ terms. The GO ontologies are “loosely” hierarchical since ‘child’ terms can have multiple ‘parent’ terms.

Some genes with less information may only be associated with general ‘parent’ terms or no terms at all, while other genes with a lot of information be associated with many terms.

8.0.2 GO Terms (Gene Ontology Terms)

The GO terms describe three aspects of gene products:

  • Biological Process: Refers to the biological objectives achieved by gene products, such as “cell division” or “response to stress.” Inferred from Direct Assay (IDA)

Alt text

  • Molecular Function: Describes the biochemical activities at the molecular level, such as “kinase activity” or “DNA binding.”

  • Cellular Component: Indicates where gene products are located within the cell, like “nucleus” or “mitochondrion.”

8.0.3 GO Enrichment Methods

GO enrichment methods help interpret differentially expressed gene lists from RNA-seq data and include the following:

  1. Overrepresentation Analysis (ORA):
  • ORA assesses whether specific GO terms are over-represented in a gene list compared to the background set of genes (e.g., all genes in the genome). Over-represented GO terms indicate biological functions or processes that are significantly associated with the genes of interest.

Alt text

Alt text`

  • Hypergeometric Test:

The hypergeometric test in GO enrichment analysis is used to determine whether a set of genes is enriched for specific GO terms compared to the background set of genes. The test calculates the probability of observing the overlap between the query gene set and the gene set associated with a particular GO term, given the total number of genes in the dataset and the number of genes in the background set.

Test Analogy The jar contains:

The hypergeometric test asks: “If I randomly drew this many balls from the jar, what’s the probability I’d get at least this many red balls just by chance?”

So, it’s like drawing colored balls from a jar:

🔴 Red balls = genes that belong to a specific GO term

⚫ Black balls = genes that do NOT belong to that GO term

Total balls in jar = all genes in your background (universe)

You draw from the jar:

You draw a handful of balls = your significant genes (e.g., padj < 0.05)

The test checks how surprising it is that so many of your significant genes are red.

If the probability of getting that many red balls (GO term genes) by chance is very low, then the GO term is overrepresented — meaning your experiment is likely enriched for that biological process.

Note for the curious: The hypergeometric test is based on counting combinations — the number of ways to draw q or more “successes” (genes with a given GO term) in k draws from a total of N genes, of which m belong to that GO term.

The formula for the cumulative probability of observing at least q successes (the upper-tail probability) in a sample of size \(k\) from a population with \(m\) successes and \(N - m\) failures is:

\[ P(X \geq q) = \sum_{i=q}^{k} \frac{\binom{m}{i} \binom{N - m}{k - i}}{\binom{N}{k}} \]

Where:

  • N: The total number of genes in the background set (e.g., all genes in the genome).

  • m: The total number of genes in the background annotated with a specific GO term (e.g., all genes associated with “cell cycle regulation”).

  • k: The total number of genes in your gene list of interest (e.g., your differentially expressed genes).

  • q: The number of genes in your list that are annotated with that specific GO term (the observed overlap).

8.1 Perform the hypergeometric test: P(X >= q)

In R, the phyper() function computes the probability of observing this much overlap (or more) between our gene list and a GO term by chance.

Suppose your background population has \(N = 20,000\) genes (e.g., the entire genome). Out of these, \(m = 500\) genes are associated with a specific GO term (e.g., “cell cycle regulation”). You have a list of \(k = 1000\) genes of interest. In this list, \(q = 50\) genes are associated with the “cell cycle regulation” GO term.

A small p-value indicates that observing 50 or more genes with this GO term is unlikely by chance alone, suggesting genuine enrichment.

Question: Is 50 more than we’d expect by chance?

# Define parameters
N <- 20000         # Total genes in the background
m <- 500           # Genes with the GO term in background
k <- 1000          # Your gene list size (e.g., DE genes)
q <- 50            # Observed overlap with GO term

# Perform the hypergeometric test: P(X >= q)
# Note: We use q-1 because phyper calculates P(X > q-1) = P(X >= q)
p_value <- phyper(q - 1, m, N - m, k, lower.tail = FALSE)
p_value
## [1] 2.551199e-06

A small p-value indicates that observing 50 or more genes with this GO term is unlikely by chance alone, suggesting genuine enrichment.

  1. Gene Set Enrichment Analysis (GSEA): Instead of only considering genes above a certain threshold (e.g., differentially expressed genes), GSEA looks at the entire ranked list of genes (e.g., ranked by their expression levels) and checks whether genes associated with a certain GO term are statistically over-represented at the top or bottom of a ranked gene list, indicating coordinated changes in gene expression.

Here’s how GSEA works:

  • Gene Ranking: Genes are ranked based on a metric that reflects their differential expression, such as log2FoldChange.

  • Enrichment Score (ES): GSEA calculates an Enrichment Score (ES) for each gene set by walking down the ranked list, increasing the score when a gene belongs to the set and decreasing it otherwise. A positive ES suggests upregulation, while a negative ES suggests downregulation of the gene set.

The hypothesis of GSEA is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting an arbitrary threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.

  • Example: The GSEA enrichment plot provides a graphical view of the enrichment score for a gene set:

Alt text - The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set.

  • The middle portion of the plot shows where the members of the gene set appear in the ranked list of genes.

  • The leading edge subset of a gene set is the subset of members that contribute most to the ES.

  • The bottom portion of the plot shows the value of the ranking metric (such as log2FoldChange) as you move down the list of ranked genes.

ORA vs. GSEA

When to use each: - Use ORA when you have a clear significance threshold (e.g., padj < 0.05) and want to test specific gene lists - Use GSEA when you want to detect subtle but coordinated changes across a pathway, even if individual genes aren’t significantly DE.

ORA → Uses only significant genes (binary: in/out)

ORA → Hypergeometric test

GSEA → Uses all genes ranked by expression change

GSEA → Enrichment score

Alt text

8.1.1 Over-representation Analysis (ORA) using clusterProfiler

## Load required libraries
library(org.Hs.eg.db)
library(clusterProfiler)
library(tidyverse)
library(enrichplot)
library(fgsea) 
library(igraph)
library(ggraph)
library(visNetwork)
library(GO.db)
library(GOSemSim)
library(GOenrichment)
library(tidyr)

8.1.1.1 Running clusterProfiler: we first need to load the results of the differential expression analysis.

res_tableOE = readRDS("data/res_tableOE.RDS")

res_tableOE_tb <- res_tableOE %>% 
  data.frame() %>% rownames_to_column(var = "gene") %>% 
  dplyr::filter(!is.na(log2FoldChange))  %>% as_tibble()

To perform the over-representation analysis, we need a list of background genes and a list of significant genes:

## background set of genes
allOE_genes <- res_tableOE_tb$gene


sigOE = dplyr::filter(res_tableOE_tb, padj < 0.05) # Standard padj value

## significant genes
sigOE_genes = sigOE$gene

TheenrichGO() function performs the ORA for the significant genes of interest (sigOE_genes) compared to the background gene list (allOE_genes) and returns the enriched GO terms and their associated p-values.

## Run GO enrichment analysis
ego <- enrichGO(gene = sigOE_genes,
 universe = allOE_genes,
 keyType = "SYMBOL",
 OrgDb = org.Hs.eg.db,
 minGSSize = 20,
 maxGSSize = 300,
 ont = "BP",
 pAdjustMethod = "BH",
 qvalueCutoff = 0.01,   # slightly stringent FDR cutoff
 readable = TRUE)
## Dotplot 
dotplot(ego, showCategory=20)

We can extract the gene lists associated with each enriched GO term from the ego enrichResult object to create edges between terms based on shared genes.

# Extract the result as a data frame
ego_df <- as.data.frame(ego)

names(ego_df)
##  [1] "ID"             "Description"    "GeneRatio"      "BgRatio"       
##  [5] "RichFactor"     "FoldEnrichment" "zScore"         "pvalue"        
##  [9] "p.adjust"       "qvalue"         "geneID"         "Count"
head(ego_df)
ID Description GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue geneID Count
GO:0042254 GO:0042254 ribosome biogenesis 152/5254 278/14554 0.5467626 1.514576 6.511013 0 6.0e-07 5.0e-07 AATF/ABT1/BMS1/BRIX1/BYSL/C1QBP/CHD7/CUL4A/CUL4B/DDX18/DDX21/DDX27/DDX28/DDX47/DDX49/DDX51/DDX54/DDX56/DHX30/DHX37/DKC1/DNTTIP2/EBNA1BP2/EIF1AX/EIF2A/EIF4A3/EIF5/EIF5B/EIF6/EMG1/ERAL1/ERCC2/EXOSC10/EXOSC3/EXOSC5/EXOSC6/EXOSC9/FAU/FBL/FBLL1/FCF1/FRG1/FTSJ3/GEMIN4/GLUL/GNL3L/GRWD1/GTPBP10/HEATR1/IMP3/IMP4/ISG20L2/KRI1/LSG1/LSM6/LTV1/MAK16/MCAT/MDN1/MPHOSPH10/MRPS2/MRPS7/MRTO4/MYBBP1A/NAT10/NGDN/NGRN/NHP2/NLE1/NMD3/NOC2L/NOC4L/NOL10/NOL11/NOL6/NOL8/NOL9/NOLC1/NOP10/NOP14/NOP16/NOP2/NSUN5/NUDT16/NUP88/PA2G4/PDCD11/PELP1/PES1/PIH1D1/POP4/PPAN/PWP1/PWP2/RAN/RBM10/REXO1/REXO4/RIOK3/RPL24/RPL26/RPL26L1/RPL7L1/RPP30/RPP40/RPS15/RPS15A/RPS19BP1/RPS23/RPS27/RPS27A/RPS27L/RPS3A/RPSA/RPUSD1/RPUSD4/RRP1/RRP15/RRP1B/RRP36/RRP7A/RRP9/RRS1/RSL24D1/SBDS/SDAD1/SIRT7/SRFBP1/SURF6/SUV39H1/TFB1M/TRMT112/TRMT2B/TSR1/TSR2/URB1/URB2/UTP14A/UTP15/UTP18/UTP20/UTP3/WDR18/WDR43/WDR46/WDR55/WDR74/XPO1/XRCC5/XRN2/YTHDF2/ZNF622 152
GO:0006403 GO:0006403 RNA localization 110/5254 189/14554 0.5820106 1.612216 6.367405 0 8.0e-07 7.0e-07 ALKBH5/ALYREF/ARC/ATR/ATXN2/BICD1/CAPRIN1/CASC3/CCT2/CCT3/CCT5/CCT6A/CCT7/CETN2/CETN3/CHTOP/DCP2/DDX19A/DDX39A/DHX9/DKC1/EIF4A3/ENY2/EXOSC10/FBL/FMR1/FUBP3/FXR2/FYTTD1/GLE1/HNRNPU/HSF1/IGF2BP1/IGF2BP2/KHDRBS1/KHSRP/KPNB1/MCM3AP/MVP/NCBP1/NEAT1/NHP2/NOL6/NOP10/NSUN2/NUP107/NUP133/NUP188/NUP205/NUP210/NUP214/NUP50/NUP54/NUP85/NUP88/NUP93/NUP98/NUTF2/NXF1/NXF3/NXT1/NXT2/PABPN1/PARN/PEG10/PHAX/PIH1D1/POLDIP3/POM121/POM121C/POM121L2/PRPF6/QKI/RAE1/RAN/RBM15B/RBM33/RBM8A/RFTN1/RUVBL1/RUVBL2/SARNP/SEC13/SEH1L/SENP2/SETD2/SHQ1/SIDT2/SLBP/SMG5/SMG6/SMG7/SNUPN/SRSF3/SRSF7/STAU1/SUPT6H/TGFBR2/THOC1/THOC2/THOC7/TOMM20/UPF1/XPO1/XPO5/YBX1/ZC3H11A/ZFP36/ZFP36L1/ZNF385A 110
GO:0043484 GO:0043484 regulation of RNA splicing 102/5254 174/14554 0.5862069 1.623840 6.222261 0 1.3e-06 1.1e-06 ACIN1/AHNAK/AHNAK2/ARGLU1/ATXN7L3/C1QBP/CCNL1/CCNL2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLK1/CLNS1A/DAZAP1/DDX5/ENY2/ERN1/ESRP1/EXOSC10/FAM50A/FASTK/FMR1/FUS/GRSF1/HNRNPF/HNRNPH2/HNRNPK/HNRNPL/HNRNPU/HOXB-AS3/HSPA1A/KAT2A/KHDRBS1/KHDRBS3/LARP7/MBNL1/MBNL2/MBNL3/NCBP1/NCL/NOVA1/NUP98/PIK3R1/POLR2A/PQBP1/PRDX6/PRMT5/PRPF19/PTBP1/PTBP2/PUF60/QKI/RBFOX2/RBM10/RBM11/RBM12B/RBM15/RBM15B/RBM22/RBM23/RBM24/RBM3/RBM38/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNPS1/RPS26/RRP1B/SF1/SF3B3/SFSWAP/SLC38A2/SMU1/SNRNP70/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/SUPT3H/TADA2B/TADA3/TAF5L/TAF6L/TIA1/TMBIM6/TRRAP/U2AF2/UPF1/USP22/WDR77/ZBTB7A/ZNF326/ZNF638 102
GO:0051168 GO:0051168 nuclear export 93/5254 156/14554 0.5961538 1.651394 6.148011 0 1.5e-06 1.3e-06 ADAR/AHCYL1/ALKBH5/ALYREF/ANP32B/BAG3/CASC3/CDK5/CHTOP/CSE1L/DDX19A/DDX39A/DHX9/EIF4A3/EIF6/EMD/ENY2/FMR1/FYTTD1/GAS6/GLE1/KHDRBS1/LSG1/LTV1/MCM3AP/MDM2/MDN1/NCBP1/NEAT1/NEMF/NMD3/NOL6/NSUN2/NUP107/NUP133/NUP188/NUP214/NUP85/NUP88/NUP93/NUP98/NUTF2/NXF1/NXF3/NXT1/NXT2/PABPN1/PARK7/PHAX/PKD1/POLDIP3/POM121/POM121C/POM121L2/PPM1A/PRKACA/PTPN11/RAE1/RAN/RANBP3/RANBP3L/RANGAP1/RAPGEF3/RBM10/RBM15B/RBM22/RBM33/RBM8A/RPS15/SARNP/SDAD1/SETD2/SFN/SIRT6/SIRT7/SMG5/SMG6/SMG7/SRSF3/SUPT6H/TGFB1/THOC1/THOC2/THOC7/UBE2I/UPF1/XPO1/XPO4/XPO5/XPO6/XPO7/YWHAE/ZC3H11A 93
GO:0000375 GO:0000375 RNA splicing, via transesterification reactions 151/5254 284/14554 0.5316901 1.472824 6.048208 0 1.5e-06 1.3e-06 ACIN1/ARGLU1/C1QBP/C9orf78/CASC3/CD2BP2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLNS1A/COIL/CTNNBL1/CWC15/CWF19L1/DAZAP1/DBR1/DDX1/DDX23/DDX39A/DDX42/DDX5/DHX15/DHX16/DHX38/DHX40/DHX9/EFTUD2/EIF4A3/EXOSC10/FMR1/FRG1/GCFC2/GEMIN4/GEMIN5/GEMIN7/GPKOW/HNRNPA3/HNRNPF/HNRNPK/HNRNPL/HNRNPM/HNRNPR/HNRNPU/HNRNPUL1/HSPA8/HTATSF1/IK/ILF3/KHDRBS1/KHDRBS3/KHSRP/LARP7/LSM2/LSM3/LSM4/LSM5/LSM6/LSM7/LUC7L2/LUC7L3/METTL3/MPHOSPH10/NCBP1/NCL/NOVA1/NUP98/PABPC1/PLRG1/PNN/PPIL1/PQBP1/PRDX6/PRMT5/PRMT7/PRPF19/PRPF3/PRPF38A/PRPF39/PRPF4/PRPF40B/PRPF6/PRPF8/PTBP1/PTBP2/PUF60/QKI/RALY/RBFOX2/RBM10/RBM11/RBM15/RBM15B/RBM17/RBM22/RBM23/RBM24/RBM3/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNF113A/RNPC3/RNPS1/SART3/SF1/SF3A1/SF3A3/SF3B1/SF3B2/SF3B3/SF3B4/SFPQ/SFSWAP/SMU1/SNRNP200/SNRNP25/SNRNP27/SNRNP40/SNRNP70/SNRPB/SNRPC/SNRPD1/SNRPD2/SNRPD3/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/STRAP/SYNCRIP/TFIP11/TIA1/TSSC4/TXNL4B/U2AF1/U2AF2/UBL5/UPF1/USP39/USP4/WDR77/ZBTB7A/ZMAT2 151
GO:0000377 GO:0000377 RNA splicing, via transesterification reactions with bulged adenosine as nucleophile 149/5254 280/14554 0.5321429 1.474078 6.020548 0 1.5e-06 1.3e-06 ACIN1/ARGLU1/C1QBP/C9orf78/CASC3/CD2BP2/CELF1/CELF2/CELF3/CELF4/CIRBP/CLNS1A/COIL/CTNNBL1/CWC15/CWF19L1/DAZAP1/DBR1/DDX1/DDX23/DDX39A/DDX42/DDX5/DHX15/DHX16/DHX38/DHX40/DHX9/EFTUD2/EIF4A3/EXOSC10/FMR1/FRG1/GCFC2/GEMIN4/GEMIN5/GEMIN7/GPKOW/HNRNPA3/HNRNPF/HNRNPK/HNRNPL/HNRNPM/HNRNPR/HNRNPU/HNRNPUL1/HSPA8/HTATSF1/IK/ILF3/KHDRBS1/KHDRBS3/LARP7/LSM2/LSM3/LSM4/LSM5/LSM6/LSM7/LUC7L2/LUC7L3/METTL3/NCBP1/NCL/NOVA1/NUP98/PABPC1/PLRG1/PNN/PPIL1/PQBP1/PRDX6/PRMT5/PRMT7/PRPF19/PRPF3/PRPF38A/PRPF39/PRPF4/PRPF40B/PRPF6/PRPF8/PTBP1/PTBP2/PUF60/QKI/RALY/RBFOX2/RBM10/RBM11/RBM15/RBM15B/RBM17/RBM22/RBM23/RBM24/RBM3/RBM39/RBM4/RBM42/RBM8A/RBMX/RBPMS/RBPMS2/RNF113A/RNPC3/RNPS1/SART3/SF1/SF3A1/SF3A3/SF3B1/SF3B2/SF3B3/SF3B4/SFPQ/SFSWAP/SMU1/SNRNP200/SNRNP25/SNRNP27/SNRNP40/SNRNP70/SNRPB/SNRPC/SNRPD1/SNRPD2/SNRPD3/SRPK1/SRPK2/SRSF3/SRSF6/SRSF7/SRSF9/STRAP/SYNCRIP/TFIP11/TIA1/TSSC4/TXNL4B/U2AF1/U2AF2/UBL5/UPF1/USP39/USP4/WDR77/ZBTB7A/ZMAT2 149
# GO term IDs 
term_ids <- ego_df$ID  

# Split gene lists by "/" and assign term IDs as names
gene_lists <- strsplit(ego_df$geneID, "/") 

# Assign term IDs as names
names(gene_lists) <- term_ids  

Using the gene lists associated with each GO term, we can create edges between terms based on the shared genes.

mgoSim() calculates the semantic similarity of go terms based on their go ancestry terms, for example:

Term A: “DNA replication” Ancestors: DNA metabolic process → nucleic acid process → cellular process

Term B: “DNA repair”
Ancestors: DNA metabolic process → nucleic acid process → cellular process

Shared ancestry: Both have “DNA metabolic process” as close ancestor → High similarity

# Load GO data for semantic similarity calculation
# GO terms and their relationships (the ontology graph structure)
go_data <- GOSemSim::godata(annoDb = "org.Hs.eg.db", 
                            ont = "BP", 
                            keytype = "SYMBOL")

# Compute semantic similarity
similarity_matrix <- GOSemSim::mgoSim(term_ids, term_ids, 
                     semData = go_data, 
                     measure = "Wang", combine = NULL)

# Convert similarity matrix to edges
edges <- data.frame(from = rep(term_ids, 
                    each = length(term_ids)),
                    to = rep(term_ids, 
                    times = length(term_ids)),
                    similarity = as.vector(similarity_matrix))

# Filter out edges with zero similarity or self-loops
edges <- subset(edges, similarity > 0 & from != to)

edges <- edges %>% filter(similarity >= 0.8)


# Convert term IDs to names
edges$to <- Term(edges$to)
edges$from <- Term(edges$from)

dim(edges)
## [1] 66  3

Now we can create a network plot to visualize the relationships between the enriched GO terms based on their semantic similarity.

# Create an igraph object from the edges data frame
g <- graph_from_data_frame(edges, directed = FALSE)

# Visualize the graph using ggraph
ggraph(g, layout = 'kk') +  # Kamada-Kawai layout
  geom_edge_link(aes(width = similarity), alpha = 0.5) +  # Edges with some transparency
  geom_node_point(size = 5, color = "blue") +  # Nodes sized uniformly
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +  # Node labels
  theme_void() +  # Clean theme without axes
  labs(title = "GO Terms Semantic Similarity Network")  # Title for the plot

Edges can also be defined by Jaccard similarity between GO terms; defined as the size of the intersection each pair of GO terms divided by the size of their union.

# Define a function to compute Jaccard similarity
jaccard_index <- function(genes1, genes2) {
length(intersect(genes1, genes2)) / length(union(genes1, genes2))
}

# Create an empty data frame to store edges
edges <- data.frame(from = character(), to = character(), 
                    similarity = numeric())
  # Compute Jaccard similarity for each pair of terms
  for (i in 1:(length(term_ids) - 1)) {
  for (j in (i + 1):length(term_ids)) {
  similarity <- jaccard_index(gene_lists[[i]], gene_lists[[j]])
  if (similarity > 0) {  # Only keep edges with some overlap
  edges <- rbind(edges, data.frame(from = term_ids[i], 
                                   to = term_ids[j], 
                                   similarity = similarity))
  }}}

edges <- edges %>% 
  dplyr::filter(similarity >= 0.8)

# Convert term IDs to names
edges$to <- Term(edges$to)
edges$from <- Term(edges$from)

dim(edges)
## [1] 27  3
edges$edge_color <- ifelse(edges$similarity == 1, "red", "gray")

Now we can create a network plot to visualize the relationships between the enriched GO terms based on the Jaccard similarity of their gene lists. Jaccard similarity is stricter than mgoSim.

Jaccard Requirements:

Direct gene overlap between GO terms Must share actual genes to get high similarity No similarity if terms don’t share genes, even if biologically related

mgoSim (Semantic) is More Permissive:

Considers biological relationships in GO hierarchy Can be similar even without shared genes Uses pathway knowledge and term relationships

# Create a graph from the edges
g <- graph_from_data_frame(edges, directed = FALSE)

# Visualize the graph using ggraph
ggraph(g, layout = 'kk') + 
  geom_edge_link(aes(alpha = similarity, width = similarity, color = edge_color)) + 
  scale_edge_color_identity() +
  geom_node_point(size = 5, 
                  color = "blue") + 
  geom_node_text(aes(label = name), 
                 repel = TRUE, 
                 size = 3) +  # Node labels
  theme_void() +  # Clean theme without axes
  labs(title = "GO Terms Jaccard Overlap Network") 

The resulting network plots shows the relationships between the enriched GO terms based on Jaccard similarity of their gene lists.

It is clear that Jaccard similarity is more stringent than Semantic similarity.

8.1.2 Visualizing Enrichment Results

Network Plot: The emapplot() function generates a network plot where each node represents an enriched gene set, and edges between nodes indicate the similarity or overlap between those gene sets.

# Filter to top significant results first
ego_filtered <- ego
ego_filtered@result <- ego@result[1:50, ]  # Take only top 50 results

# Then create pairwise similarity
pwt_filtered <- pairwise_termsim(ego_filtered, method = "JC")

# Now create network plot
emapplot(
  pwt_filtered,
  showCategory = 50,
  node_label = "category",
  layout = "mds"
) +
  theme(legend.position = "none")

ssplot(pwt_filtered)

For a general overview of enrichment results and trends, use ssplot(). For detailed exploration of pairwise similarities between enrichment terms and their functional relationship networks, use emapplot().

cnetplot: Visualizes connections between genes and top GO terms, highlighting shared pathways.This plot is particularly useful for hypothesis generation in identifying genes that may be important to several of the most affected processes.

cnetplot(pwt_filtered, showCategory = 3,
         categorySize = "pvalue",
         colorEdge = TRUE) +
  theme(legend.position = "none")

8.1.3 GSEA Using ‘clusterProfiler’ `gseGO’

Prepare fold changes, sort by expression, and run GSEA with gseGO() from clusterProfiler

## Extract the foldchanges
foldchanges <- res_tableOE_tb$log2FoldChange

## Name each fold change with the gene name
names(foldchanges) <- res_tableOE_tb$gene

## Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)

We can use GSEA to test whether genes in specific biological processes tend to be up- or down-regulated together in our experiment:

gseaGO <- clusterProfiler::gseGO(
  geneList = foldchanges,
  ont = "BP",
  keyType = "SYMBOL",
  minGSSize = 20,
  maxGSSize = 300,
  eps = 0,
  pAdjustMethod = "BH",
  pvalueCutoff = 0.01,
  verbose = TRUE,
  OrgDb = org.Hs.eg.db,
  by = "fgsea"         # fast gene set enrichment analysis
)

8.1.4 Visualizing GSEA Results

Visualize GSEA results with dotplot() for significant categories and gseaplot2() for gene set ranks.

dotplot(gseaGO, showCategory = 20)

gseaplot2(gseaGO, geneSetID = 1:4)

The emapplot clusters the 50 most significant (by padj) GO terms to visualize relationships between terms.

gseaGO_filtered <- gseaGO
gseaGO_filtered@result <- gseaGO@result[1:50, ]  # Top 50 by p.adjust

# Create pairwise similarity 
pwt <- pairwise_termsim(
  gseaGO_filtered,
  method = "JC",
  semData = NULL
)


emapplot(pwt, showCategory = 50,     
         node_label = "category")    

ssplot(pwt)

treeplot(pwt)

We are going to compare ORA enricher to running the ORA runGORESP.

Let’s first load the GOenrichment package:

# Uncomment the following if you haven't yet installed GOenrichment.

# devtools::install_github("gurinina/GOenrichment")

library(GOenrichment)

8.1.5 GOenrichment Package Analysis

We’re going to run this analysis using TWO different packages. Why? In research, you often want to verify important results using independent methods. We’ll see that both give us similar answers (~90% overlap), which builds confidence. They differ slightly because of minor implementation details, which is normal and expected.

runGORESP uses over-representation analysis to identify enriched GO terms and returns two data.frames; of enriched GO terms (nodes) and GO term relationships (edges). We first need to get the GO annotations and specifically ‘Biological Process’ (BP).

  1. Generate clean GO Biological Process gene sets with gene symbols
# install.packages("msigdbr") # uncomment to install
library(msigdbr)

# Using msigdbr ensures standardized, up-to-date gene sets
go_bp <- msigdbr(species = "Homo sapiens", collection = "C5", subcollection = "GO:BP")

# Let's look at the format of go_bp
names(go_bp)
##  [1] "gene_symbol"        "ncbi_gene"          "ensembl_gene"      
##  [4] "db_gene_symbol"     "db_ncbi_gene"       "db_ensembl_gene"   
##  [7] "source_gene"        "gs_id"              "gs_name"           
## [10] "gs_collection"      "gs_subcollection"   "gs_collection_name"
## [13] "gs_description"     "gs_source_species"  "gs_pmid"           
## [16] "gs_geoid"           "gs_exact_source"    "gs_url"            
## [19] "db_version"         "db_target_species"
head(go_bp[,c("gene_symbol", "gs_id")])
gene_symbol gs_id
AASDHPPT M29088
ALDH1L1 M29088
ALDH1L2 M29088
MTHFD1 M29088
MTHFD1L M29088
MTHFD2L M29088
dim(go_bp)
## [1] 617648     20
# Convert to named list format
hGOBP.gmt <- split(go_bp$gene_symbol, go_bp$gs_name)
hGOBP.gmt[1:2]
## $GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS
## [1] "AASDHPPT" "ALDH1L1"  "ALDH1L2"  "MTHFD1"   "MTHFD1L"  "MTHFD2L" 
## 
## $GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS
##  [1] "AADAT"  "ADHFE1" "COL6A1" "D2HGDH" "DLD"    "DLST"   "GOT1"   "GOT2"  
##  [9] "GPT2"   "IDH1"   "IDH2"   "KGD4"   "KYAT3"  "L2HGDH" "OGDH"   "OGDHL" 
## [17] "PHYH"   "TAT"
# Clean up term names (remove GOBP_ prefix and replace underscores)
names(hGOBP.gmt) <- gsub("^GOBP_", "", names(hGOBP.gmt))
names(hGOBP.gmt) <- gsub("_", " ", names(hGOBP.gmt))
hGOBP.gmt[1:2]
## $`10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS`
## [1] "AASDHPPT" "ALDH1L1"  "ALDH1L2"  "MTHFD1"   "MTHFD1L"  "MTHFD2L" 
## 
## $`2 OXOGLUTARATE METABOLIC PROCESS`
##  [1] "AADAT"  "ADHFE1" "COL6A1" "D2HGDH" "DLD"    "DLST"   "GOT1"   "GOT2"  
##  [9] "GPT2"   "IDH1"   "IDH2"   "KGD4"   "KYAT3"  "L2HGDH" "OGDH"   "OGDHL" 
## [17] "PHYH"   "TAT"

Create GO ID (go_input argument) for runGORESP: assigns GO IDs to GO terms

go_mapping <- go_bp[!duplicated(go_bp$gs_name), c("gs_name", "gs_id")]


# Let's look at the format of go_mapping
head(go_mapping)
gs_name gs_id
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS M29088
GOBP_2FE_2S_CLUSTER_ASSEMBLY M23654
GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS M15857
GOBP_3_UTR_MEDIATED_MRNA_DESTABILIZATION M24263
GOBP_3_UTR_MEDIATED_MRNA_STABILIZATION M11060
GOBP_4FE_4S_CLUSTER_ASSEMBLY M45874
names(go_mapping) <- c("term", "GOID")

# Cleanup GO terms
go_mapping$term <- gsub("^GOBP_", "", go_mapping$term)
go_mapping$term <- gsub("_", " ", go_mapping$term)

head(go_mapping)
term GOID
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS M29088
2FE 2S CLUSTER ASSEMBLY M23654
2 OXOGLUTARATE METABOLIC PROCESS M15857
3 UTR MEDIATED MRNA DESTABILIZATION M24263
3 UTR MEDIATED MRNA STABILIZATION M11060
4FE 4S CLUSTER ASSEMBLY M45874
  1. enricher uses over-representation analysis to identify enriched GO terms. Prepare data for enricher: Create TERM2GENE data frame (term = GO term, gene = gene symbol) for the GO library
term2gene <- go_bp[, c("gs_name", "gene_symbol")]
head(term2gene)
gs_name gene_symbol
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS AASDHPPT
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS ALDH1L1
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS ALDH1L2
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS MTHFD1
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS MTHFD1L
GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS MTHFD2L
names(term2gene) <- c("term", "gene")

# Cleanup GO terms
term2gene$term <- gsub("^GOBP_", "", term2gene$term)
term2gene$term <- gsub("_", " ", term2gene$term)

head(term2gene)
term gene
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS AASDHPPT
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS ALDH1L1
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS ALDH1L2
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS MTHFD1
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS MTHFD1L
10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS MTHFD2L
  1. Run enricher analysis:
# Optimal parameters for robust analysis
enricher_result <- enricher(
    gene = sigOE_genes,
    TERM2GENE = term2gene,
    pvalueCutoff = 1,
    qvalueCutoff = 0.0001,        # Stringent FDR threshold
    minGSSize = 20,               # Robust minimum size
    maxGSSize = 300
)
  1. Prepare data for runGORESP Create scoreMat: data frame with gene, score, and index columns
all_genes <- unique(term2gene$gene)
scoreMat <- data.frame(
    gene = all_genes,
    score = runif(length(all_genes)),      # Random scores
    index = ifelse(all_genes %in% sigOE_genes, 1, 0),  # 1 = gene of interest
    stringsAsFactors = FALSE
)
  1. Run runGORESP analysis:
goresp_result <- runGORESP(
    scoreMat = scoreMat,
    bp_input = hGOBP.gmt,                  # Gene sets as named list
    go_input = go_mapping,                 # GO term to ID mapping
    fdrThresh = 0.0001,                    # FDR threshold (same as enricher)
    minSetSize = 20,                       # Same as enricher minGSSize
    maxSetSize = 300                       # Same as enricher maxGSSize
)

names(goresp_result$edgeMat)
## [1] "source"       "target"       "overlapCoeff" "width"        "label"
names(goresp_result$enrichInfo)
##  [1] "filename"            "GOID"                "term"               
##  [4] "nGenes"              "nQuery"              "nOverlap"           
##  [7] "querySetFraction"    "geneSetFraction"     "foldEnrichment"     
## [10] "P"                   "FDR"                 "overlapGenes"       
## [13] "maxOverlapGeneScore" "cluster"             "id"                 
## [16] "size"                "formattedLabel"
head(goresp_result$enrichInfo[,c(2,3,4,5,10)])
GOID term nGenes nQuery P
M16023 RNA LOCALIZATION 204 4998 0
M16816 REGULATION OF RNA SPLICING 189 5150 0
M24840 VESICLE MEDIATED TRANSPORT IN SYNAPSE 260 5117 0
M15413 NUCLEAR EXPORT 172 5112 0
M15304 STEM CELL DIFFERENTIATION 265 5003 0
M23968 ESTABLISHMENT OF RNA LOCALIZATION 165 5115 0
  1. Display top results from both methods
cat("\n=== Top enricher results ===\n")
## 
## === Top enricher results ===
print(head(enricher_result[, c("Description", "Count", "pvalue", "qvalue")]))
##                                                                 Description
## RNA LOCALIZATION                                           RNA LOCALIZATION
## REGULATION OF RNA SPLICING                       REGULATION OF RNA SPLICING
## VESICLE MEDIATED TRANSPORT IN SYNAPSE VESICLE MEDIATED TRANSPORT IN SYNAPSE
## NUCLEAR EXPORT                                               NUCLEAR EXPORT
## ESTABLISHMENT OF RNA LOCALIZATION         ESTABLISHMENT OF RNA LOCALIZATION
## STEM CELL DIFFERENTIATION                         STEM CELL DIFFERENTIATION
##                                       Count       pvalue       qvalue
## RNA LOCALIZATION                        110 4.417655e-15 1.053727e-11
## REGULATION OF RNA SPLICING              103 1.293937e-14 1.543190e-11
## VESICLE MEDIATED TRANSPORT IN SYNAPSE   128 2.315881e-13 1.841329e-10
## NUCLEAR EXPORT                           92 1.404827e-12 8.377206e-10
## ESTABLISHMENT OF RNA LOCALIZATION        87 1.573398e-11 7.345982e-09
## STEM CELL DIFFERENTIATION               125 1.847842e-11 7.345982e-09
cat("\n=== Top runGORESP results ===\n")
## 
## === Top runGORESP results ===
print(head(goresp_result$enrichInfo[, c("term", "nOverlap", "P", "FDR")]))
##                                    term nOverlap        P      FDR
## 1                      RNA LOCALIZATION      110 4.42e-15 1.48e-11
## 2            REGULATION OF RNA SPLICING      103 1.29e-14 2.17e-11
## 3 VESICLE MEDIATED TRANSPORT IN SYNAPSE      128 2.32e-13 2.59e-10
## 4                        NUCLEAR EXPORT       92 1.40e-12 1.18e-09
## 5             STEM CELL DIFFERENTIATION      125 1.85e-11 1.03e-08
## 6     ESTABLISHMENT OF RNA LOCALIZATION       87 1.57e-11 1.03e-08
print(dim(enricher_result))
## [1] 65 12
print(dim(goresp_result$enrichInfo))
## [1] 57 17
  1. Let’s check the overlap between the enriched terms found using runGORESP and those found using enricher as they used the same GO term libraries:
overlap_terms <- intersect(enricher_result$ID, goresp_result$enrichInfo$term)
cat("Overlapping terms:", length(overlap_terms), "\n")
## Overlapping terms: 57
cat("Overlap percentage:", round(100 * length(overlap_terms) / max(nrow(enricher_result), nrow(goresp_result$enrichInfo)), 1), "%\n")
## Overlap percentage: 87.7 %

Percent overlap is close to 90% that is very good.

Key differences between methods - Both use hypergeometric testing with Benjamini-Hochberg FDR correction - enricher: Standard clusterProfiler workflow - runGORESP: Adds gene set clustering and network visualization features - Expected difference: 5-15% due to minor implementation differences - Core significant results are nearly identical

We can visualize the results of the runGORESP analysis using the visNetwork package. This package allows us to create interactive network visualizations.

We use thevisSetup function to prepare the data for network visualization. We then run the runNetwork function to generate the interactive network plot.

vis = visSetup(goresp_result$enrichInfo,goresp_result$edgeMat)
GOenrichment::runNetwork(vis$nodes, vis$edges, main = "MOV10 OE GO Enrichment Network")

This network analysis is based on Cytoscape, an open source bioinformatics software platform for visualizing molecular interaction networks. out of all the GO packages.

Exercise points = +4 1. Perform a GO enrichment analysis using the runGORESP function from the GOenrichment package using res_tableKD with a padj cutoff of 0.05. Use a significance FDR threshold of 0.0001. Save the results in a variable called kresp.

  1. Visualize the enriched GO terms using the runNetwork function from the GOenrichment package. Save the visualization setup in a variable called kvis.

Step 1: Load and prepare the knockdown data

# Your code here

Step 2: Prepare the score matrix for runGORESP

# Your code here

Step 3: Run GO enrichment analysis with runGORESP

# Your code here

Step 4: Create interactive network visualization

# Your code here

8.1.6 Other tools and resources

  • GeneMANIA. GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional association data curated from the literature.

  • ReviGO. Revigo is an online GO enrichment tool that allows you to copy-paste your significant gene list and your background gene list. The output is a visualization of enriched GO terms in a hierarchical tree.

  • AmiGO. AmiGO is the current official web-based set of tools for searching and browsing the Gene Ontology database.

  • etc.