Chapter 1 Introduction

1.1 Quick start

The package can be installed from GitHub:

install.packages("devtools")
devtools::install_github("jiananwehi/CMDdemux")

Load the package into R with

library(CMDdemux)

1.2 Method description

The CMDdemux workflow includes three major steps: (1) within-cell centered log-ratio (CLR) normalization, (2) K-medoids clustering, and (3) Mahalanobis distance (MD)-based classification.

Step 1: Within-cell CLR normalization
This step is also referred to as local CLR normalization. First, hashtag counts of each cell are normalized by their library size. Then, the geometric mean of these normalized counts is calculated. Next, each normalized count is divided by the geometric mean and log-transformed to obtain the final CLR-normalized values.

Step 2: K-medoids clustering
K-medoids clustering is performed in the local CLR space. In most cases, the number of clusters (k) equals the number of hashtags, although additional clusters may occasionally be required. Core and non-core cells are defined based on their Euclidean distance to the cluster centroid: core cells have distances smaller than a defined threshold, while non-core cells have larger distances. Defining core and non-core cells prevents non-core cells from biasing the estimation of cluster-based covariance. Each cluster is then assigned to its corresponding hashtag.

Step 3: Mahalanobis distance-based classification
The hashtag-hashtag covariance matrix is calculated using only core cells. The Mahalanobis distance (MD) is then computed using the CLR-normalized values (from Step 1), the cluster centroids (from Step 2), and the covariance matrix. In the initial demultiplexing step, outlier cells are identified based on a threshold derived from the MD distribution. Non-outlier cells are classified as singlets and assigned to the hashtag with the minimum MD value. Outlier cells are further distinguished as negatives or doublets based on their hashtag library sizes.

Figure 1.1: Schematic workflow of CMDdemux.

1.3 Hashing data types

Cell hashing experiments can sometimes produce low-quality data. For high-quality data, we expect each cell to be predominantly labelled by one hashtag, and the distributions of all hashtags to be similar to each other — typically showing a peak at larger values corresponding to true signals and a smaller peak representing background noise. Uneven labelling is a common feature of low-quality cell hashing data. This type of data can be divided into weak-signal and strong-signal cases: cells may either be insufficiently labelled by hashtags (weak signals) or excessively labelled (over-strong signals). In such cases, the affected hashtags exhibit skewed distributions compared to other standard hashtags. Contamination occurs when cells are attached to multiple hashtags, resulting in intermediate signal levels for several hashtags. This typically produces a secondary peak between the true-signal and noise peaks, referred to as contamination signals. Empty droplets are not inherently low-quality data; however, if their number is large, they can distort the expected number of cell clusters in the dataset, and thus are often considered low-quality in practice. Low-input is another type of low-quality data, where cells are bound by only a limited amount of hashtag and show generally low expression levels. In this document, we aim to demonstrate how to use CMDdemux to demultiplex both high- and low-quality datasets.

Figure 1.2: A summary of hashing data type.

1.4 Example data

Data	Quality	Number of hashtags	Number of cells	Technology	Tissue/Cell type	Gene expression data	Low quality type	Publication	Data source
Human brain	High-quality	8	2754	Single nucleus hashing	Human brain cortex	✓	NA	(Gaublomme et al. 2019)	https://hub.docker.com/r/regevlab/demuxem
BAL		8	24804	TotalSeq antibodies	Paediatric bronchoalveolar lavage fluid	✗		(Howitt et al. 2023)	https://zenodo.org/records/8304003
Vehicle mouse		3	7403	TotalSeq antibodies	Vehicle treated mouse immune cells	✓		(Virassamy et al. 2023)	https://zenodo.org/records/7325111
Treated mouse	Low-quality	3	4436	TotalSeq antibodies	Ipatasertib treated mouse immune cells	✗ (Available upon requests)	Uneven labelling with weak signals	(Virassamy et al. 2023)	https://zenodo.org/records/16976382
OT		4	9405	TotalSeq antibodies	Ovarian carcinoma	✓	Uneven labelling with strong signals	(Hippen et al. 2023)	GSE217517
EMBRYO MULTI-Seq LMO		12	9745	MULTI-Seq lipid modified oligo (LMO)	Mouse E18.5 brain	✓	Uneven labelling with unlabelled hash	(Brown et al. 2024)	https://zenodo.org/records/11134439
EMBRYO MULTI-Seq CMO		12	19494	Custom MULTI-Seq cholesterol modified oligo (CMO)	Mouse E18.5 brain	✓	Contamination
PDX CellPlex		4	4327	CellPlex	Nuclei from human ovarian carcinosarcoma patient-derived xenograft (PDX)	✓	Uneven labelling with strong signals / Low input
PDX Hashtag Ab		4	7071	TotalSeq antibodies		✓	Low-input
PDX MULTI-Seq CMO		4	4178	Custom MULTI-Seq CMO		✓	Uneven labelling with strong signals / Low input
PBMC		8	21522	Antibodies	Peripheral blood mononuclear cells (PBMC)	✓	Empty droplets	(Stoeckius et al. 2018)	GSE108313

References

Brown, Daniel V, Casey JA Anttila, Ling Ling, Patrick Grave, Tracey M Baldwin, Ryan Munnings, Anthony J Farchione, et al. 2024. “A Risk-Reward Examination of Sample Multiplexing Reagents for Single Cell RNA-Seq.” Genomics 116 (2): 110793.

Gaublomme, Jellert T, Bo Li, Cristin McCabe, Abigail Knecht, Yiming Yang, Eugene Drokhlyansky, Nicholas Van Wittenberghe, et al. 2019. “Nuclei Multiplexing with Barcoded Antibodies for Single-Nucleus Genomics.” Nature Communications 10 (1): 2907.

Hippen, Ariel A, Dalia K Omran, Lukas M Weber, Euihye Jung, Ronny Drapkin, Jennifer A Doherty, Stephanie C Hicks, and Casey S Greene. 2023. “Performance of Computational Algorithms to Deconvolve Heterogeneous Bulk Ovarian Tumor Tissue Depends on Experimental Factors.” Genome Biology 24 (1): 239.

Howitt, George, Yuzhou Feng, Lucas Tobar, Dane Vassiliadis, Peter Hickey, Mark A Dawson, Sarath Ranganathan, et al. 2023. “Benchmarking Single-Cell Hashtag Oligo Demultiplexing Methods.” NAR Genomics and Bioinformatics 5 (4): lqad086.

Stoeckius, Marlon, Shiwei Zheng, Brian Houck-Loomis, Stephanie Hao, Bertrand Z Yeung, William M Mauck, Peter Smibert, and Rahul Satija. 2018. “Cell Hashing with Barcoded Antibodies Enables Multiplexing and Doublet Detection for Single Cell Genomics.” Genome Biology 19: 1–12.

Virassamy, Balaji, Franco Caramia, Peter Savas, Sneha Sant, Jianan Wang, Susan N Christo, Ann Byrne, et al. 2023. “Intratumoral CD8+ t Cells with a Tissue-Resident Memory Phenotype Mediate Local Immunity and Immune Checkpoint Responses in Breast Cancer.” Cancer Cell 41 (3): 585–601.