Chapter 2 Application to Non-Small Cell Lung Cancer Cells (NSCLC) Data
2.1 NSCLC dataset
The Non-Small Cell Lung Cancer Cells (NSCLC) dataset is freely available from the 10x Genomics website (10x Genomics Website 2022). The cells have been sequenced in one batch, so the dataset is not expected to contain batch effects. However, we need to remove library size effects from the sequencing data, allowing for different cell sizes. This dataset is expected to contain large epithelial cells and smaller immune cells (e.g T cells). Hence, the between-cells differences in library size consist of both biological and technical components. We will use RUV-III-NB to identify and remove the unwanted (technical) library size effects, while preserving the biological variation of interest, and demonstrate carrying out downstream analyses with the adjusted data.
2.2 Data Preprocessing
Prior to going through this section, make sure that the preliminary set-up described in section Preliminary settings has been completed.
We have prepared the NSCLC dataset and its metadata in two R objects that can be loaded using the following commands.
<-readRDS(system.file("extdata", "nsclc_obs.rds", package="ruvIIInb"))
nsclc_obs<-readRDS(system.file("extdata", "nsclc_samples.rds", package="ruvIIInb")) nsclc_samples
The nsclc_obs
object contains raw sequencing count with 33660 genes in rows and 7802 cells in columns, and the nsclc_samples
object contains cell-specific metadata. We combine these two objects onto a SingleCellExperiment
object that will enable us to take advantage of facilities in packages such as scuttle
and scater
when pre-procesing the data.
<- SingleCellExperiment(assays=list(counts=nsclc_obs),
sce colData=as.data.frame(nsclc_samples))
We now carry out quality control (QC) steps to filer low quality cells and low abundant genes using Bioconductor package scuttle
.
2.2.1 Filtering low quality cells
The cell-level quality control metrics such as the number of genes that have non-zero counts and the percentage of counts that comes from Mitochondrial genes for each cell can be computed and added to the SingleCellExperiment
object as follows:
<- addPerCellQCMetrics(x = sce,subsets=list(Mito=grep("MT-",rownames(sce)))) sce
The metrics calculated above may then be used to identify potential outliers. We recommend visualising the distribution of cells which contain a high proportion of cell-level counts for Mitochondrial genes as well as those which have low cell-level total counts, that are a certain number of median absolute deviations (MADs) away from the median value.
<- isOutlier(
libsize_drop metric = sce$total,
nmads = 2,
type = "lower",
log = TRUE)
colData(sce)$libsize_drop<-libsize_drop
<- isOutlier(
mito_drop metric = colData(sce)$subsets_Mito_percent,
nmads = 3,
type = "higher")
colData(sce)$mito_drop<-mito_drop
Figures 2.1 and 2.2 visualise the distribution of cells that are flagged this way.
<- data.frame(logtotal=log(sce$total),
plot_df libsize_drop=factor(libsize_drop),
mito_drop=factor(mito_drop),
logdetected=log(sce$detected))
$mito_drop<-relevel(plot_df$mito_drop, "TRUE")
plot_df$libsize_drop<-relevel(plot_df$libsize_drop, "TRUE")
plot_df
#A histogram showing the distribution of flagged cells
<- plot_df %>%
p ggplot( aes(x=logtotal, fill=libsize_drop)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=30) +
scale_fill_manual(values=c( "#69b3a2","#404080")) +
theme_ipsum() +
labs(fill="") +
xlab("Cell level log total count")+
ylab("Frequency")
+ggtitle('The distribution of cell-level log total counts\n flagging cells with low library size') p
#A plot showing the number of detected genes vs total cell count
<- plot_df %>%
p2arrange(desc(mito_drop)) %>%
ggplot( aes(x=logtotal,y=logdetected, fill=mito_drop)) +
geom_point(
mapping = aes(colour = mito_drop, shape = libsize_drop),size = 2,alpha = 5 / 6) + scale_color_manual(values=c("#69b3a2","#404080"))+
xlab("Log total cell-level count")+ylab("Log number of cell-level detected genes")
+ggtitle('The cell-level number of detected genes vs total cell-level count (on log scale)') p2
2.2.2 Filtering low abundant genes
Similarly, we recommend computing gene-level quality control metrics such as the mean count across all cells for each gene, and the percentage of cells with non-zero counts for each gene. The code below adds these measures to the SingleCellExperiment
object and flags genes with a low mean cell count (see Figure 2.3).
<- addPerFeatureQCMetrics(x = sce)
sce
#Remove genes with zero counts for each gene
<-subset(sce, rowData(sce)$mean>0 )
sce
# detect low abundant genes
<-log(rowData(sce)$mean)< -5 lowcount_drop
#mean count of genes across all cells
<- data.frame(mean_genecount=log(rowData(sce)$mean), lowcount_drop=factor(lowcount_drop))
plot_df2 <- plot_df2 %>%
p arrange(desc(lowcount_drop)) %>%
ggplot( aes(x=mean_genecount, fill=lowcount_drop)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=50) +
scale_fill_manual(values=c("#404080", "#69b3a2")) +
theme_ipsum() +
labs(fill="") +xlab("Log mean count across all cells")+ylab("Frequency")
+ggtitle('The distribution of log mean count across\n all cells, flagging those with a low mean count') p
Once the quality control metrics have been calculated, we retain only the selected genes and cells.
<- sce[!(lowcount_drop), !(libsize_drop | mito_drop)] sce
Following quality control, the count data matrix consists of 14585 genes in rows and 6382 cells in columns.
2.3 Normalising the data
Following the preliminary steps described above, the pre-processed NSCLC dataset consists of 14585 genes in rows and 6382 cells in columns.
2.3.0.1 Using scHK as negative control genes
RUV-III-NB uses single-cell housekeeping genes (Lin et al. 2019) as the default negative control genes. The negative control genes play a vital role in estimating the unknown unwanted factors. There are two single-cell housekeeping genesets in the RUV-III-NB package. Here, we will use the set for Human (Homo Sapiens), available as part of the package.
# Reading in the control genes
data(Hs.schk)
<- as.character(Hs.schk)
ctl # Creating a logical vector to idetify control genes
rowData(sce)$ctlLogical<-rownames(assays(sce)$counts) %in% ctl
2.3.0.2 Identification of pseudo-replicates
The pseudo-replicates are cells with similar biology (Salim et al. 2022). To define suitable pseudo-replicates, we need to first define the biological factor of interest. In this study, the biological factor of interest is the cell-type, however cell-type information is unavailable. Our strategy here is to perform clustering using scran-normalized data and identify cells of the same cluster as pseudo-replicates. This simple cluster-based approach can be used for data that come from one batch. When there are multiple batches, a more sophisticated approach is needed (see the section [Application to cell-line data]).
# Perform initial clustering to identify pseudo-replicates
<- computeSumFactors(sce,assay.type="counts")
sce <- sweep(assays(sce)$counts,2,sce$sizeFactor,'/')
data_norm_pre assays(sce, withDimnames=FALSE)$lognormcounts<- log(data_norm_pre+1)
<- buildSNNGraph(sce, assay.type = "lognormcounts")
snn_gr_init <- igraph::cluster_louvain(snn_gr_init)
clusters_init $cluster_init <- factor(clusters_init$membership)
sce
<- runUMAP(sce,exprs_values = "lognormcounts")
sce <- runPCA(sce,exprs_values = "lognormcounts") sce
Figure 2.4 visualises the initial clusters for identifying pseudo-replicates.
dittoDimPlot(sce, "cluster_init")
Figure 2.5 indicates a potential association between biology (cell type) and an unwanted factor (in this case library size). This potential association will be taken into account when RUV-III-NB estimates and subsequently removes the variation due to the unwanted factors.
$logLS <- log(colSums(assays(sce)$counts))
sceplotUMAP(sce,colour_by='logLS')
2.3.0.3 Running RUV-III-NB
The code shown below performs RUV-III-NB normalisation.
# Construct the replicate matrix M using pseudo-replicates identified using initial clustering
<- matrix(0,ncol(assays(sce)$counts),length(unique(sce$cluster_init)))
M <- sort(unique(as.numeric(unique(sce$cluster_init))))
clfor(CL in cl)
which(as.numeric(sce$cluster_init)==CL),CL] <- 1
M[
#RUV-III-NB code
<-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts), # count matrix with genes as rows and cells as columns
ruv3nb_outM=M, #Replicate matrix constructed as above
ctl=rowData(sce)$ctlLogical, #A vector denoting control genes
k=2, # dimension of unwanted variation factors
ncores = 6
)
After carrying out the RUV-III-NB normalisation, we can explore the estimated unwanted variation component which are stored in the W
component of the output. In this example, we can see that the first component of unwanted variation is highly correlated with the log library size (see Fifure 2.6).
<-data.frame(W1=-ruv3nb_out$W[,1],
W1dataW2=ruv3nb_out$W[,2],
LogLS=colData(sce)$logLS)
<-ggplot(W1data, aes(LogLS, W1)) +
p1geom_point(aes(colour = LogLS)) +
scale_colour_gradient(low = "yellow", high = "dark green")+ theme(legend.position = "none")
<-ggplot(W1data, aes(W2, W1)) +
p2geom_point(aes(colour = LogLS)) +
scale_colour_gradient(low = "yellow", high = "dark green")
::grid.arrange(p1,p2,ncol=2) gridExtra
2.3.0.4 Running RUV-III-NB with pseudo-cells to strengthen pseudo-replicates
In the example above, the pseudo-replicates were identified through clustering. As an optional step, we can strengthen the pseudo-replicates by adding pseudo-cells to define pseudo-replicates. The pseudo-cells are generated using pool-and-divide strategy (Salim et al. 2022), and can be done using use.pseudosample=TRUE
option as shown below.
<-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts),
ruv3nb_out_psM=M,
ctl=rowData(sce)$ctlLogical,
k=2,
use.pseudosample=TRUE, #Indicating whether pseudo-cells are to be used to to define pseudo-replicates
batch=rep(1,dim(sce)[2])# NSCLC data comes from a single batch
)
2.4 Downstream analyses using adjusted data
After estimating the parameters, either with or without use.pseudosample=TRUE
, the RUV-III-NB output can be used to a produce a SingleCellExperiment
object that will hold the adjusted data in the logPAC
and Pearson
components of the assay
slot.
#Creating a SingleCellExperiment object
<- makeSCE(ruv3nb_out,cData=colData(sce)) sce_ruv3nb
2.4.1 Clustering and annotation
Once created, the SingleCellExperiment (SCE) object that holds the adjusted data can be used for downstream analyses. Here we are presenting the codes for dimension reduction, clustering, and annotation of the clusters and cells using the log of percentile-invariant adjusted count.
# PCA of RUV-III-NB log PAC data
<- runPCA(sce_ruv3nb, exprs_values = "logPAC")
sce_ruv3nb
#UMAP
<- runUMAP(sce_ruv3nb, exprs_values = "logPAC") sce_ruv3nb
# PCA visualization
<- plotPCA(sce, ncomponents=2,colour_by = "logLS",point_alpha=0.5, point_size=1) + ggtitle('Scran normalized')+ theme(legend.position = "none")
p1
<- plotPCA(sce_ruv3nb, ncomponents=2,
p2 colour_by = "logLS",point_alpha=0.5, point_size=1) + ggtitle('RUV-III-NB log PAC')
::grid.arrange(p1,p2,ncol=2) gridExtra
<-plotUMAP(sce,colour_by='logLS')+ggtitle('Scran normalised')
pu1<- plotUMAP(sce_ruv3nb,colour_by='logLS')+ggtitle('RUV-III-NB log PAC')
pu2 ::grid.arrange(pu1,pu2,ncol=2) gridExtra
# Perform clustering
<- buildSNNGraph(sce_ruv3nb, assay.type = "logPAC")
snn_gr <- igraph::cluster_louvain(snn_gr)
clusters $cluster_ruv <- factor(clusters$membership)
sce_ruv3nb
# Identify cell-type for each cluster
<- HumanPrimaryCellAtlasData()
ref_se
# annotate cluster
<- SingleR(test = sce_ruv3nb, ref = ref_se,
pred10xnsclc_ruv labels = ref_se$label.fine,
assay.type.test='logPAC',
clusters=factor(sce_ruv3nb$cluster_ruv))
$ctype_ruv <- pred10xnsclc_ruv$labels[sce_ruv3nb$cluster_ruv]
sce_ruv3nb
# anotate cells
<- SingleR(test = sce_ruv3nb, ref = ref_se,
pred10xnsclc_ruv2 labels=ref_se$label.fine,
assay.type.test='logPAC')
$ctype_ruv2 <- pred10xnsclc_ruv2$labels sce_ruv3nb
dittoDimPlot(sce_ruv3nb, "ctype_ruv" )
2.4.2 Identification of differentially expressed genes
The log PAC adjusted data can also be used to identify differentially-expressed genes. Here, we provide an example on how to use the adjusted data for finding cluster-specific marker genes using scran
package. The code below demonstrates the identification of top differentially expressed genes for a specified cluster using RUV-III-NB adjusted data and visualising the log fold-changes in Figure 2.10.
<- findMarkers(x=as.matrix(assays(sce_ruv3nb)$logPAC),
markers_ruv3nb groups=sce_ruv3nb$ctype_ruv)#identify a combination of marker genes that together defines one cluster against the rest.
<- markers_ruv3nb[[3]] #Cluster 3 is used here for demonstration purposes.
cluster3_markers <- cluster3_markers[cluster3_markers $Top <= 5,] #Collects the top differentially expressed genes from each pairwise comparison involving cluster 3
cluster3_top5 <- getMarkerEffects(cluster3_top5) #Extracts log fold changes logFCs_cluster3_top5
pheatmap(logFCs_cluster3_top5, breaks=seq(-5, 5, length.out=101))
2.4.3 Integration with Seurat
For seamless integration with the Seurat
R package, the SingleCellExperiment
object can be converted to a Seurat
object using the code below and the functions in the Seurat
package can then be used.
<- as.Seurat(sce_ruv3nb, counts = "counts", data ="logPAC") seurat_ruv3nb