Chapter 3 BAL

This dataset is high-quality (Howitt et al. 2023) and includes 24,804 cells from 8 pediatric bronchoalveolar lavage (BAL) fluid donors. Cells were labelled using TotalSeq-A antibodies. Only Batch 1 data is shown here. The gene expression data are not available.

The subset of hash count data is shown below:

balb1.hash.count[,1:10]

##       X1_AAACCCACACTTCCTG.1 X1_AAACCCACAGACAAAT.1 X1_AAACCCACAGGACGAT.1 X1_AAACCCACATCCTAAG.1 X1_AAACCCAGTAACATCC.1
## BAL-A                    18                    15                     9                     8                    18
## BAL-B                    15                    12                    17                    28                    26
## BAL-C                    12                   202                    11                    20                    14
## BAL-D                   284                     1                   239                   219                     9
## BAL-E                    11                     5                     7                    12                   746
## BAL-F                     4                     6                     3                    13                    11
## BAL-G                     8                     8                     4                     7                    11
## BAL-H                    17                    10                    15                    21                    27
##       X1_AAACCCAGTACAGTTC.1 X1_AAACCCAGTACTGTTG.1 X1_AAACCCAGTGTCACAT.1 X1_AAACCCAGTTCTCAGA.1 X1_AAACCCAGTTGTCATG.1
## BAL-A                    28                    15                    22                     7                   296
## BAL-B                    20                  1424                    26                   336                    27
## BAL-C                    11                     7                    11                     8                    11
## BAL-D                     7                     3                    19                     2                     1
## BAL-E                    10                    13                    48                     3                     7
## BAL-F                     5                     8                    10                     1                     2
## BAL-G                   768                     6                     9                     2                    10
## BAL-H                    20                    19                   776                     9                    14

3.1 Local CLR normalization

The first step of CMDdemux is to normalize the hash count data using a local CLR normalization.

balb1.clr.norm <- LocalCLRNorm(balb1.hash.count)
balb1.clr.norm[,1:10]

##       X1_AAACCCACACTTCCTG.1 X1_AAACCCACAGACAAAT.1 X1_AAACCCACAGGACGAT.1 X1_AAACCCACATCCTAAG.1 X1_AAACCCAGTAACATCC.1
## BAL-A          0.0548880871            0.31300363            -0.3231980          -0.849920557           -0.32668529
## BAL-B         -0.1169621698            0.10536427             0.2645886           0.320150696            0.02471260
## BAL-C         -0.3246015346            2.85362089            -0.1408765          -0.002622696           -0.56307407
## BAL-D          2.7629382882           -1.76643791             2.8548558           2.346482412           -0.96853917
## BAL-E         -0.4046442423           -0.66782562            -0.5463416          -0.482195777            3.34494092
## BAL-F         -1.2801129796           -0.51367494            -1.2394888          -0.408087805           -0.78621762
## BAL-G         -0.6923263147           -0.26236051            -1.0163452          -0.967703593           -0.78621762
## BAL-H          0.0008208658           -0.06168982             0.1468056           0.043897319            0.06108024
##       X1_AAACCCAGTACAGTTC.1 X1_AAACCCAGTACTGTTG.1 X1_AAACCCAGTGTCACAT.1 X1_AAACCCAGTTCTCAGA.1 X1_AAACCCAGTTGTCATG.1
## BAL-A            0.26036650           -0.13718328           -0.25946966          -0.005058491            3.13273342
## BAL-B           -0.06240689            4.35215509           -0.09912701           3.735582898            0.77120579
## BAL-C           -0.62202268           -0.83033046           -0.91005723           0.112724545           -0.07609207
## BAL-D           -1.02748778           -1.52347764           -0.39923160          -0.985887744           -1.86785154
## BAL-E           -0.70903405           -0.27071468            0.49685642          -0.698205672           -0.48155718
## BAL-F           -1.31516986           -0.71254743           -0.99706860          -1.391352852           -1.46238643
## BAL-G            3.53816164           -0.96386186           -1.09237878          -0.985887744           -0.16310345
## BAL-H           -0.06240689            0.08596027            3.26047647           0.218085060            0.14705148

The output of LocalCLRNorm is a normalized data matrix with rows representing hashtag samples and columns representing cells. Each entry is a normalized value.

3.2 K-medoids clustering

For high-quality data, we recommend that users use the default parameters directly. The number of clusters equal to the number of hashtags is sufficient for clustering, which is 8 clusters for this dataset.

balb1.kmed.cl <- KmedCluster(balb1.clr.norm)

We then used a boxplot to check whether 8 clusters are appropriate for this data.

CheckCLRBoxPlot(balb1.clr.norm, balb1.kmed.cl)

It is quite apparent that each hashtag is uniquely and highly expressed in a single cluster. Therefore, 8 clusters are sufficient for this dataset.

Next, we calculate the Euclidean distance of all cells to their respective cluster centroids.

balb1.cl.dist <- EuclideanClusterDist(balb1.clr.norm, balb1.kmed.cl)

3.3 Definition of non-core cells

The distribution of these distances is then examined using EuclideanDistPlot. The cut-off for each cluster is determined by the eu_cut_q parameter, which represents the quantile of the distribution. We recommend that users select a value in the tail of the distribution and adjust the parameter by visualizing the red line in each plot.

EuclideanDistPlot(balb1.cl.dist, eu_cut_q = c(0.79, 0.85, 0.81, 0.78, 0.83, 0.89, 0.84, 0.84))

The Euclidean distances within each cluster are bimodally distributed. The peak with smaller distances corresponds to core cells, which are closer to the cluster centroid, while the peak with larger distances corresponds to non-core cells, which are further away. The cut-off for each cluster is chosen at the anti-mode of the distribution.

The non-core cells are defined by the DefineNonCore function, using the Euclidean distance output from EuclideanClusterDist, the k-medoids clustering output from KmedCluster, and the cut-offs examined from EuclideanDistPlot.

balb1.noncore <- DefineNonCore(balb1.cl.dist, balb1.kmed.cl, eu_cut_q = c(0.79, 0.85, 0.81, 0.78, 0.83, 0.89, 0.84, 0.84))
table(balb1.noncore)

## balb1.noncore
## Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8 non-core 
##     2202     3436     3044     2490     2912     2575     1712     2166     4267

Core cells are labelled according to their cluster, while non-core cells are labelled as “non-core.”

The classified core and non-core cells are then examined using CLRPlot.

CLRPlot(balb1.clr.norm, balb1.kmed.cl, balb1.noncore)

The plots above show pairwise CLR-normalized values within each cell. The cluster centroids are indicated by black cross symbols. In these plots, non-core cells are located farther from their cluster centroids, while core cells are closer, as expected.

3.4 Label clusters by sample

For each cluster, its sample of origin can be assigned using LabelClusterHTO with CLR-normalized values from LocalCLRNorm, k-medoids clustering from KmedCluster, and non-core cells from DefineNonCore. Generally, the default value of the label_method parameter works for most datasets. If it does not, the alternative value “expression” is recommended.

balb1.cluster.assign <- LabelClusterHTO(balb1.clr.norm, balb1.kmed.cl, balb1.noncore, label_method = "medoids")
balb1.cluster.assign

## BAL-A BAL-B BAL-C BAL-D BAL-E BAL-F BAL-G BAL-H 
##     7     5     2     1     3     8     4     6

The output shows each hashtag sample and its corresponding cluster index.

3.5 Computing the Mahalanobis distance

The Mahalanobis distance for each single cell can be calculated via CalculateMD.

balb1.md.mat <- CalculateMD(balb1.clr.norm, balb1.noncore, balb1.kmed.cl, balb1.cluster.assign)
balb1.md.mat[,1:10]

##       X1_AAACCCACACTTCCTG.1 X1_AAACCCACAGACAAAT.1 X1_AAACCCACAGGACGAT.1 X1_AAACCCACATCCTAAG.1 X1_AAACCCAGTAACATCC.1
## BAL-A             10.927815             10.237574             11.667071             12.028096             12.168967
## BAL-B             11.479211             11.179391             10.945886             10.323254             12.297034
## BAL-C             11.159726              2.413062             10.980606             10.231049             12.005398
## BAL-D              1.707717             10.272518              1.882647              2.478202             10.470651
## BAL-E             11.687767             12.055713             12.129021             11.577098              1.381756
## BAL-F             10.986433              9.444398             11.074167              9.340607             10.610735
## BAL-G             10.288890              9.141810             10.859263             10.242215             10.917122
## BAL-H             10.780809             10.398264             10.657356             10.287044             11.569898
##       X1_AAACCCAGTACAGTTC.1 X1_AAACCCAGTACTGTTG.1 X1_AAACCCAGTGTCACAT.1 X1_AAACCCAGTTCTCAGA.1 X1_AAACCCAGTTGTCATG.1
## BAL-A             11.466232             12.320324             11.732997             11.209385              2.405890
## BAL-B             12.601233              1.717611             12.149781              1.681982             10.589038
## BAL-C             12.236479             12.525331             12.392677             10.207126             10.570715
## BAL-D             10.489353             11.107960              9.083088              9.470056             10.250325
## BAL-E             12.819532             12.833288             10.682070             12.691715             11.697810
## BAL-F             11.282977             10.774135             10.714846             10.820999             10.845493
## BAL-G              2.321026             11.743999             11.326663             10.822682              9.465207
## BAL-H             11.674930             11.861854              3.475674             10.625543             10.571685

The output is a Mahalanobis distance matrix. Each row of the matrix represents a hashtag sample, and each column represents a cell. Each entry of the matrix represents the Mahalanobis distance of the cell to the sample.

3.6 Detect outlier cells

Outlier cells are negatives and doublets. They are identified based on the distribution of the minimum Mahalanobis distance across all cells, which can be visualized using MDDensPlot. The cut-off is determined by the md_cut_q parameter, representing the quantile of the distribution. The red line in the plot indicates the cut-off. We recommend that users adjust the md_cut_q parameter to find an optimal cut-off between singlets and outliers.

MDDensPlot(balb1.md.mat, md_cut_q = 0.84)

The Mahalanobis distances are clearly bimodally distributed. The peak with smaller Mahalanobis distances corresponds to singlets, while the peak with larger distances corresponds to outlier cells. The cut-off should be set at the anti-mode of this distribution.

The selected cut-off can then be used to define outlier cells via AssignOutlierDrop.

balb1.outlier.assign <- AssignOutlierDrop(balb1.md.mat, md_cut_q = 0.84)
table(balb1.outlier.assign)

## balb1.outlier.assign
## Outlier Singlet 
##    3969   20835

The output of AssignOutlierDrop assigns each cell as either “Outlier” or “Singlet”.”

3.7 Demultiplexing

The HTO library sizes are used to further distinguish negatives and doublets within the outlier cells. The distribution of HTO library sizes can be visualized using OutlierHTOPlot. The library sizes are generally multimodally distributed, and the num_modes parameter specifies the number of modes in the distribution.

OutlierHTOPlot(balb1.hash.count, balb1.outlier.assign, num_modes = 4)

In this dataset, the HTO library sizes for outlier cells are bimodally distributed. The mode with smaller library sizes corresponds to potential negatives, while the mode with larger library sizes corresponds to potential doublets. The red lines represent the modes and anti-modes of the distribution. The red line with index 4 is an appropriate cut-off for distinguishing negatives and doublets among the outlier cells.

balb1.cmddemux.assign <- CMDdemuxClass(balb1.md.mat, balb1.hash.count, balb1.outlier.assign, num_modes = 4, cut_no = 4)
head(balb1.cmddemux.assign)

##                       demux_id global_class demux_global_class doublet_class
## X1_AAACCCACACTTCCTG.1    BAL-D      Singlet              BAL-D         BAL-D
## X1_AAACCCACAGACAAAT.1    BAL-C      Singlet              BAL-C         BAL-C
## X1_AAACCCACAGGACGAT.1    BAL-D      Singlet              BAL-D         BAL-D
## X1_AAACCCACATCCTAAG.1    BAL-D      Singlet              BAL-D         BAL-D
## X1_AAACCCAGTAACATCC.1    BAL-E      Singlet              BAL-E         BAL-E
## X1_AAACCCAGTACAGTTC.1    BAL-G      Singlet              BAL-G         BAL-G

Since gene expression data is not available for this dataset, only cell hashing data is used for demultiplexing. The num_modes parameter can be determined using OutlierHTOPlot, and the cut_no parameter is selected based on the index of the appropriate red line in the plot.

CMDdemux provides four types of outputs:
demux_id: the identity for each cell, determined by the sample with the minimum Mahalanobis distance for that cell.
global_class: classification of each cell as Singlet, Doublet, or Negative.
demux_global_class: singlet cells are labeled with their original sample, while negatives and doublets are labeled accordingly.
doublet_class: similar to demux_global_class, but doublets are labeled with their sample of origin. This is particularly useful for combined-hashtag experiments.

table(balb1.cmddemux.assign$demux_global_class)

## 
##    BAL-A    BAL-B    BAL-C    BAL-D    BAL-E    BAL-F    BAL-G    BAL-H  Doublet Negative 
##     1737     2883     3520     2282     3076     2176     2611     2550     3071      898

The output above shows the number of cells demultiplexed into each category.

References

Howitt, George, Yuzhou Feng, Lucas Tobar, Dane Vassiliadis, Peter Hickey, Mark A Dawson, Sarath Ranganathan, et al. 2023. “Benchmarking Single-Cell Hashtag Oligo Demultiplexing Methods.” NAR Genomics and Bioinformatics 5 (4): lqad086.