Chapter 4 DGE analysis workflow

Differential expression analysis with DESeq2 involves multiple steps as displayed in the flowchart below in blue.

Alt text Here’s a more succinct version of your workflow:

DESeq2 Workflow Overview:

  1. Load Data: Start with raw count data and sample metadata.

  2. Normalize Counts: Estimate size factors to adjust for sequencing depth or library size differences.

  3. Estimate Dispersion: Calculate gene-wise dispersion, measuring variability in counts not explained by mean expression.

  4. Fit Dispersion Trend: Model the relationship between mean counts and dispersion, either parametrically (default) or non-parametrically.

  5. Shrink Dispersions: Use empirical Bayes shrinkage to stabilize dispersion estimates, particularly for low-count genes. Empirical Bayes is a version of Bayesian statistics where the prior distribution is estimated from the data itself, rather than being set externally. In empirical Bayes shrinkage, the idea is to estimate a common pattern or distribution (the prior) across all observations (e.g., all genes) and then use this estimated distribution to influence or “shrink” each individual observation.

  6. Fit NB Model: Fit a negative binomial model using the shrunken dispersion estimates, capturing variability and estimating log-fold changes.

  7. Shrink Log-Fold Changes: Optionally stabilize log-fold changes with regularization, especially for low-expression genes.

  8. Statistical Testing: Perform Wald tests to identify differentially expressed genes, providing adjusted p-values for multiple testing.

This final step in the differential expression analysis workflow of fitting the raw counts to the NB model and performing the statistical test for differentially expressed genes, is the step we care about. This is the step that performs statistical tests to identify significant differences in gene expression between sample groups.

Alt text

Although the DESeq2 paper was published in 2014, but the package is continually updated and available for use in R through Bioconductor.

4.1 Running DESeq2

Prior to performing the differential expression analysis, it is a good idea to know what sources of variation are present in your data, either by exploration during the QC and/or prior knowledge. Sources of variation refer to anything that causes differences in gene expression across your samples, including biological factors (treatment, sex, age) and technical factors (batch effects, sequencing depth). Ideally, these are all factors in your metadata. These factors lead to variability in the data and can affect your ability to detect meaningful biological signals (like differential expression between treated and control samples).Once you know the major sources of variation, you can remove them prior to analysis or control for them in the statistical model by including them in your design formula.

4.1.1 Set the Design Formula

The main purpose of the design formula in DESeq2 is to specify the factors that are influencing gene expression so that their effects can be accounted for or controlled during the analysis. This allows DESeq2 to isolate the effect of the variable you’re primarily interested in while adjusting for other known sources of variation. The design formula should have all of the factors in your metadata that account for major sources of variation in your data. The last factor entered in the formula should be the condition of interest.

For example:

design(dds)
## ~sampletype

This shows that sampletype is the condition of interest and as such is the only covariate. DESeq2 will test for differential expression between the two sample types, adjusting for any other variation.

Example

If you want to examine the expression differences between treatments as shown below, and you know that major sources of variation include sex and age, then your design formula would be:

design <- ~ sex + age + treatment

Alt text

This allows DESeq2 to adjust for sex and age while testing for differences in treatment.

Why is this important?
Including known sources of variation (like sex and age) ensures any gene expression differences are due to the condition of interest (e.g., treatment), not confounded by other factors.

Steps to set the design formula: 1. Identify your factor of interest (e.g., treatment). 2. Determine any confounders (e.g., sex, age). 3. Write the formula, placing the factor of interest last. 4. Use PCA to ensure all relevant factors are included.

Exercise points = +3

  1. Suppose you wanted to study the expression differences between the two age groups in the metadata shown above, and major sources of variation were sex and treatment, how would the design formula be written?
# Your code here
  1. Based on our Mov10 metadata dataframe, which factors could we include in our design formula?
  • Ans:
  1. What would you do if you wanted to include a factor in your design formula that is not in your metadata?
  • Ans:

4.1.2 MOV10 Differential Expression Analysis

Now that we understand how to specify the model in DESeq2, we can proceed with running the differential expression pipeline on the raw count data.

Running Differential Expression in Two Lines of Code

To obtain differential expression results from our raw count data, we only need to run two lines of code!

First, we create a DESeqDataSet, as we did in the ‘Count normalization’ lesson, specifying the location of our raw counts and metadata, and applying our design formula:

## Create DESeq object
dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ sampletype)

Next, we run the actual differential expression analysis with a single call to the DESeq() function. This function handles everything—from normalization to linear modeling—all in one step. During execution, DESeq2 will print messages detailing the steps being performed: estimating size factors, estimating dispersions, gene-wise dispersion estimates, modeling the mean-dispersion relationship, and statistical testing for differential expression.

## Run analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

By re-assigning the result to back to the same variable name (dds), we update our DESeqDataSet object, which will now contain the results of each step in the analysis, effectively filling in the slots of our DESeqDataSet object.

deseq1

4.2 DESeq2 differential gene expression analysis workflow

With these two lines of code, we have completed the core steps in the DESeq2 differential gene expression analysis. The key steps in this workflow are summarized below:

Alt text

In the following sections, we will explore each step in detail to better understand how DESeq2 performs the statistical analysis and what metrics we should focus on to evaluate the quality of the results.

4.2.1 Step 1: Estimate size factors

The first step in the differential expression analysis is to estimate the size factors, which is exactly what we already did to normalize the raw counts.

Alt text

DESeq2 will automatically estimate the size factors when performing the differential expression analysis. However, if you have already generated the size factors using estimateSizeFactors(), as we did earlier, then DESeq2 will use these values.

To normalize the count data, DESeq2 calculates size factors for each sample using the median of ratios method discussed previously in the Count normalization (Chapter 2) lesson.

4.2.1.1 MOV10 DE analysis: examining the size factors

Let’s take a quick look at size factor values we have for each sample:

## Check the size factors
sizeFactors(dds)
## Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3 
##  1.1224020  0.9625632  0.7477715  1.5646728  0.9351760  1.2016082  1.1205912  0.6534987

Take a look at the total number of reads for each sample:

## Total number of raw counts per sample
colSums(counts(dds))
## Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3 
##   22687366   19381680   14962754   32826936   19360003   23447317   21713289   12737889

How do the numbers correlate with the size factor?

# Your code here

4.2.2 Step 2: Estimate Gene-wise Dispersion

Dispersion Workflow

The next step in differential expression analysis is estimating gene-wise dispersions. Understanding dispersion is crucial in RNA-Seq analysis.

What is Dispersion?

Dispersion measures the variability in gene expression that cannot be explained by the mean expression levels alone. It captures the extra-Poisson variability observed in RNA-Seq data, where the variance tends to exceed the mean due to biological and technical factors.

In DESeq2, dispersion reflects the variability in gene expression for a given mean count and is characterized by:

  • Lower mean counts: Higher dispersion (more variability).
  • Higher mean counts: Lower dispersion (less variability).

The plot below illustrates the relationship between mean expression and variance: each black dot represents a gene, showing that variance can be predicted more reliably for genes with higher mean counts, while lower mean counts exhibit greater variability.

Mean vs Variance

Modeling Dispersion Using the Negative Binomial Distribution

DESeq2 utilizes the Negative Binomial (NB) distribution to model RNA-Seq count data, effectively managing overdispersion, which is common in biological data. The two key parameters of the NB model are the mean expression level and the dispersion.

For each gene \(i\), the observed count \(Y_{ij}\) in sample \(j\) is modeled as:

\[ Y_{ij} \sim \text{NB}(\mu_{ij}, \alpha_{i}) \]

Where: - \(Y_{ij}\) is the observed count for gene \(i\) in sample \(j\). - \(\mu_{ij}\) is the expected normalized count for gene \(i\). - \(\alpha_{i}\) is the gene-specific dispersion parameter.

4.2.3 Step 3: Fit Curve to Gene-wise Dispersion Estimates

After estimating gene-wise dispersions using maximum likelihood, DESeq2 fits a global trend to model how dispersion changes as a function of mean expression across all genes. This process stabilizes the dispersion estimates, especially for low-count genes.

  • Low-count genes with more uncertain dispersion estimates are pulled closer to the global trend.
  • Genes with higher counts or more reliable estimates will be shrunk less.

Why is Shrinkage Important?

Shrinkage improves the reliability of dispersion estimates, making them more stable for genes with low or moderate counts, which typically have noisier estimates. This reduces the influence of random fluctuations and results in more accurate estimates for downstream differential expression testing.

Additionally, shrinkage helps to reduce false positives in differential expression analysis, ensuring that estimates for these genes are more reliable for subsequent testing.

Genes with extremely high dispersion values (outliers) may demonstrate variability that is unexpected for their mean expression levels and are typically not shrunk. These genes are often biological outliers or affected by technical issues.

Global Dispersion Trend

4.2.4 Step 4: Assessing Model Fit with Dispersion Plots

After shrinkage, it’s essential to evaluate how well the model fits the data through dispersion plots. Ideally, most genes should scatter around the fitted curve, with dispersions decreasing as mean expression increases. Unusual patterns, such as points far from the curve, may indicate data quality issues, like outliers or contamination.

Using DESeq2’s plotDispEsts() function visualizes the fit of the model and the extent of shrinkage across genes.

4.2.4.1 Examples of Problematic Dispersion Plots

Certainly! Problematic dispersion plots in RNA-Seq analysis, particularly when using tools like DESeq2, can arise from various issues. Here are some reasons that could lead to visual discrepancies in dispersion plots:

Reasons for Problematic Dispersion Plots

  • Low Count Data:

  • High Variability: Genes with low expression levels often exhibit greater variability (higher dispersion) across replicates, leading to noisy estimates.

  • Unstable Estimates: With few counts, it’s difficult to obtain reliable dispersion estimates, resulting in significant scattering in the plot.

  • Outliers in the Data:

    • Biological Outliers: Certain genes may genuinely exhibit high variability due to biological conditions (e.g., activation in response to stimuli), leading to higher dispersion estimates.
    • Technical Outliers: Poor quality samples or technical errors in sequencing (e.g., contamination or failed PCR) can result in erroneous high counts for certain genes, affecting overall variability.
  • Batch Effects:

    • Differences in sequencing runs or sample processing can introduce variability independent of biological differences. If not normalized, batch effects can distort dispersion estimates, making it seem like certain genes are more variable than they truly are.
  • Unequal Group Sizes:

    • Uneven distribution of samples across experimental conditions can affect dispersion estimates, particularly if one group has fewer replicates or more variability.
  • Unaccounted Biological Variation:

    • Biological variability among samples or genes that is not captured by the experimental design can affect estimates. For instance, genes involved in complex pathways may show higher dispersion due to interactions.
  • Data Quality Issues:

    • Problems with sample handling, library construction, or sequencing can introduce discrepancies in data that visually manifest as unexpected dispersion.

4.2.5 Assessing and Addressing Issues

To tackle issues identified in problematic dispersion plots:

  • Quality Control: Always perform thorough quality control checks on your raw sequencing data to identify outliers and poorly performing samples.

  • Filtering: Remove low-count genes or outliers that might skew your data before running the analysis.

  • Increasing Replicates: Where possible, aim for more replicates to stabilize dispersion estimates, particularly for low-expressed genes.

  • Model Evaluation: Consider exploring different statistical models if patterns suggest that the negative binomial model is not a good fit.

  • Visual Inspection: Regularly visualize data through various methods (e.g., PCA, heatmaps, dispersion plots) to check for quality and consistency.

By understanding the potential causes for problematic dispersion plots, you can troubleshoot and improve your RNA-Seq analysis to yield more reliable results. If you have specific questions or scenarios you want to discuss, feel free to ask!

Here are examples of dispersion plots suggesting potential data quality or fitting issues:

Bad Dispersion Plot 1

Bad Dispersion Plot 2

4.2.6 MOV10 Differential Expression Analysis: Exploring Dispersion Estimates

Now, let’s explore the dispersion estimates for the MOV10 dataset:

## Plot dispersion estimates
plotDispEsts(dds)

Since we have a small sample size, for many genes we see quite a bit of shrinkage. Do you think our data are a good fit for the model?