Chapter 1 Introduction

This is an introduction to using RUV-III-NB for removing unwanted variation from single cell RNA sequencing data (Salim et al. 2022). The first section RUV-III-NB model below briefly explains the statistical methodology behind RUV-III-NB. Readers who are not interested in the methodology can move straight to Preliminary settings.

1.1 RUV-III-NB model

RUV-III-NB models the raw count for gene \(g\) and cell \(c\), \(y_{gc}\), as Negative Binomial (NB), \(y_{gc} \sim NB(\mu_{gc},\psi_g)\) or a Zero-inflated Negative Binomial (ZINB) random variable. For simplicity, in what follows we describe the NB model for UMI data.

Let \(\boldsymbol y_g=(y_{g1},y_{g2},\ldots,y_{gN})^T\) be the vector of count for gene \(g\) and \(\mu_g\) be the vector of mean parameters. We further assume that there are \(m \mbox{<} N\) groups of cells with the same underlying biology among the \(N\) cells, which we refer to as pseudo-replicates. Using a generalized linear model with log link function to model the mean parameters as a function of the unknown unwanted factors \(\mathbf W\) and the underlying biology represented by the replicate matrix \(\mathbf M\), we have

\[\begin{equation} \log \mu_g = \boldsymbol \zeta_g + \mathbf M\beta_g + \boldsymbol W\alpha_g, \label{eq:NBGLM} \end{equation}\]

where \(\boldsymbol M (N \times m)\) is a matrix of pseudo-replicates membership with \(\boldsymbol M(i,j)=1\) if the ith cell is part of the jth pseudo-replicate and 0 otherwise, \(\beta_g(m \times 1) \sim N(0,\lambda^{-1}_\beta I_m)\) is the vector of biological parameters, with unique value for each of the \(m\) replicates, \(\mathbf W (N \times k)\) is the k-dimensional unknown unwanted factors and \(\alpha_g (k \times 1) \sim N(\alpha_\mu,\lambda^{-1}_\alpha I_k)\) is the vector of regression coefficient associated with the unwanted factors and finally \(\zeta_g\) is the location parameter for gene \(g\) after adjusting for the unwanted factors.

Unwanted variation due to factors such as library size and batch effect are captured by the \(\mathbf W\) matrix. For example, when \(k=1\) and the \(\mathbf W\) column is approximately equal (up to a multiplicative constant) to log library size (LS) then \(\mu_g \propto (LS)^{\alpha_g}\), thus allowing a possibly non-linear, gene-specific relationship between library size and raw count.

To estimate the unknown unwanted factors and the regression coefficients associated with the unwanted factors and biology, we use iterative reweighted least squares (IRLS) algorithm (see Supplementary Methods (Salim et al. 2022)) where we use negative control genes (genes where \(\beta_g \approx 0\)) to estimate the unknown unwanted factors \(\boldsymbol W\) and the pseudo-replicates to estimate the gene-specific effects of the unwanted factors on the sequencing count (\(\alpha_g\)).

Once the parameters of the RUV-III-NB model are estimated, the effect of the unwanted factors are removed from the data. We provide two metrics of adjusted data. The first is log of percentile-invariant adjusted count (log PAC), suitable for Unique Molecular Identifier (UMI) data and the second is Pearson residual suitable for read count data without UMI. Both forms of adjusted data can be used for downstream analyses such as clustering, trajectory and differential expression analyses, with the log PAC recommended for downstream differential expression analysis.

1.2 Preliminary settings

Following the preliminary set-up described in this section, the readers can move on to Application to Non-Small Cell Lung Cancer Cells (NSCLC) Data or Application to cell-line data section.

The RUV-III-NB package can be installed from github as shown below.

#Install ruvIIInb package

Along with the RUV-III-NB package, we will also load the following packages required for pre-processing, visualising and downstream analyses.



Salim, Agus, Ramyar Molania, Jianan Wang, Alysha De Livera, Rachel Thijssen, and Terence P Speed. 2022. “RUV-III-NB: Normalization of Single Cell RNA-Seq Data.” Nucleic Acids Research.