Chapter 10 Batch effect removal

The batch effect is the accumulation of non-biological differences between groups of samples in experiments. Different factors can contribute to the emergence of this effect, such as: sample storage conditions, preservation protocols, cDNA synthesis, washing conditions, environmental conditions, among others. These technical variations can cause significant differences between batches, having an unfavorable impact on the biological analysis.

Depending on the experimental design, the above factors can be minimized, however, if the study cohort generates very large samples or even if it is many samples, or if the study depends on a time interval, the RNA-Seq procedure will take days, months or even years and the factors will change, and it will be necessary to remove the effect of these factors, ie the batch effect. Some normalizations such as logarithms of TPM, RPKM/FPKM, TMM or RLE can correct differences in the general distribution of expression of each sample, however they do not correct the effects on gene expression when considering the coverage of each sample.

In view of this, several methods were developed in order to address this type of technical bias. One of the most used methods, when knowing the number of batches that the samples were divided, is the ComBat. For unknown batches, the methods SVASeq and RUVSeq. However, differential expression programs require raw, that is, untransformed counts, and suggest including batch variables as covariates. This prevents the use of conventional methods, as the transformed data cannot be used in edgeR and DESeq2. But in 2020 the tool ComBat-seq was developed. It is based on the original ComBat method, but focused on RNA-Seq data. The data adjusted by it are provided in the form of counts, which can be used in differential expression and downstream analyses.

In this pipeline, we will remove the batch effect from normalized data and also from counts.

10.1 Removing the batch effect of normalized counts

Let’s load the necessary packages:

library(sva)
library(SummarizedExperiment)

Next, let’s remove the batch effect with the combat method:

mat <- gse.TMM@assays@data$abundance
colAnnotation <- colData(gse.TMM)
modcombat <- model.matrix(~Tissue, data = colAnnotation)
mat <- ComBat(dat = mat, 
              batch = colAnnotation$Batch, 
              mod = modcombat)
gse.TMM.withoutBE <- gse.TMM
gse.TMM.withoutBE@assays@data$abundance <- mat

save(gse.TMM.withoutBE,file = "gse_withoutBatchEffect_TMM.RData")

10.2 Removing the batch effect from raw counts

Let’s load the necessary packages:

library(sva)
library(magrittr)

Next, let’s set the directory that we will work with, import the count matrix and separate the information from technical variables:

setwd("~/PreProcSEQ-main/8-batchEffect_removal/")
load("../7-normalizationCounts/tmm/gse_tmm.RData")

count_matrix <- gse.TMM@assays@data$counts
batch <- gse.TMM$Batch
Tissue <- gse.TMM$Tissue

So let’s remove the batch effect from raw counts with ComBat_seq:

Tissue <- Tissue %>% as.factor() %>% as.numeric()
covar_mat <- as.matrix(Tissue)

adjusted_counts <- ComBat_seq(count_matrix, batch, group = NULL,
                              covar_mod = covar_mat)

range(count_matrix)
range(adjusted_counts)

gse_adjusted <- gse.TMM
gse_adjusted@assays@data$counts <- adjusted_counts
save(gse_adjusted, file = "gse_wthoutBatchEffect_counts.RData")