The Non-Small Cell Lung Cancer Cells (NSCLC) dataset is freely available from the 10x Genomics website (10x Genomics Website 2022). The cells have been sequenced in one batch, so the dataset is not expected to contain batch effects. However, we need to remove library size effects from the sequencing data, allowing for different cell sizes. This dataset is expected to contain large epithelial cells and smaller immune cells (e.g T cells). Hence, the between-cells differences in library size consist of both biological and technical components. We will use RUV-III-NB to identify and remove the unwanted (technical) library size effects, while preserving the biological variation of interest, and demonstrate carrying out downstream analyses with the adjusted data.
Prior to going through this section, make sure that the preliminary set-up described in section Preliminary settings has been completed.
We have prepared the NSCLC dataset and its metadata in two R objects that can be loaded using the following commands.
<-readRDS(system.file("extdata", "nsclc_obs.rds", package="ruvIIInb")) nsclc_obs<-readRDS(system.file("extdata", "nsclc_samples.rds", package="ruvIIInb"))nsclc_samples
nsclc_obs object contains raw sequencing count with 33660 genes in rows and 7802 cells in columns, and the
nsclc_samples object contains cell-specific metadata. We combine these two objects onto a
SingleCellExperiment object that will enable us to take advantage of facilities in packages such as
scater when pre-procesing the data.
<- SingleCellExperiment(assays=list(counts=nsclc_obs), sce colData=as.data.frame(nsclc_samples))
We now carry out quality control (QC) steps to filer low quality cells and low abundant genes using Bioconductor package
The cell-level quality control metrics such as the number of genes that have non-zero counts and the percentage of counts that comes from Mitochondrial genes for each cell can be computed and added to the
SingleCellExperiment object as follows:
<- addPerCellQCMetrics(x = sce,subsets=list(Mito=grep("MT-",rownames(sce))))sce
The metrics calculated above may then be used to identify potential outliers. We recommend visualising the distribution of cells which contain a high proportion of cell-level counts for Mitochondrial genes as well as those which have low cell-level total counts, that are a certain number of median absolute deviations (MADs) away from the median value.
<- isOutlier( libsize_drop metric = sce$total, nmads = 2, type = "lower", log = TRUE) colData(sce)$libsize_drop<-libsize_drop <- isOutlier( mito_drop metric = colData(sce)$subsets_Mito_percent, nmads = 3, type = "higher") colData(sce)$mito_drop<-mito_drop
<- data.frame(logtotal=log(sce$total), plot_df libsize_drop=factor(libsize_drop), mito_drop=factor(mito_drop), logdetected=log(sce$detected)) $mito_drop<-relevel(plot_df$mito_drop, "TRUE") plot_df$libsize_drop<-relevel(plot_df$libsize_drop, "TRUE") plot_df #A histogram showing the distribution of flagged cells <- plot_df %>% p ggplot( aes(x=logtotal, fill=libsize_drop)) + geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=30) + scale_fill_manual(values=c( "#69b3a2","#404080")) + theme_ipsum() + labs(fill="") + xlab("Cell level log total count")+ ylab("Frequency") +ggtitle('The distribution of cell-level log total counts\n flagging cells with low library size')p
#A plot showing the number of detected genes vs total cell count <- plot_df %>% p2arrange(desc(mito_drop)) %>% ggplot( aes(x=logtotal,y=logdetected, fill=mito_drop)) + geom_point( mapping = aes(colour = mito_drop, shape = libsize_drop),size = 2,alpha = 5 / 6) + scale_color_manual(values=c("#69b3a2","#404080"))+ xlab("Log total cell-level count")+ylab("Log number of cell-level detected genes") +ggtitle('The cell-level number of detected genes vs total cell-level count (on log scale)')p2
Similarly, we recommend computing gene-level quality control metrics such as the mean count across all cells for each gene, and the percentage of cells with non-zero counts for each gene. The code below adds these measures to the
SingleCellExperiment object and flags genes with a low mean cell count (see Figure 2.3).
<- addPerFeatureQCMetrics(x = sce) sce #Remove genes with zero counts for each gene <-subset(sce, rowData(sce)$mean>0 ) sce # detect low abundant genes <-log(rowData(sce)$mean)< -5lowcount_drop
#mean count of genes across all cells <- data.frame(mean_genecount=log(rowData(sce)$mean), lowcount_drop=factor(lowcount_drop)) plot_df2 <- plot_df2 %>% p arrange(desc(lowcount_drop)) %>% ggplot( aes(x=mean_genecount, fill=lowcount_drop)) + geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=50) + scale_fill_manual(values=c("#404080", "#69b3a2")) + theme_ipsum() + labs(fill="") +xlab("Log mean count across all cells")+ylab("Frequency") +ggtitle('The distribution of log mean count across\n all cells, flagging those with a low mean count')p
Once the quality control metrics have been calculated, we retain only the selected genes and cells.
<- sce[!(lowcount_drop), !(libsize_drop | mito_drop)]sce
Following quality control, the count data matrix consists of 14585 genes in rows and 6382 cells in columns.
Following the preliminary steps described above, the pre-processed NSCLC dataset consists of 14585 genes in rows and 6382 cells in columns.
RUV-III-NB uses single-cell housekeeping genes (Lin et al. 2019) as the default negative control genes. The negative control genes play a vital role in estimating the unknown unwanted factors. There are two single-cell housekeeping genesets in the RUV-III-NB package. Here, we will use the set for Human (Homo Sapiens), available as part of the package.
# Reading in the control genes data(Hs.schk) <- as.character(Hs.schk) ctl # Creating a logical vector to idetify control genes rowData(sce)$ctlLogical<-rownames(assays(sce)$counts) %in% ctl
The pseudo-replicates are cells with similar biology (Salim et al. 2022). To define suitable pseudo-replicates, we need to first define the biological factor of interest. In this study, the biological factor of interest is the cell-type, however cell-type information is unavailable. Our strategy here is to perform clustering using scran-normalized data and identify cells of the same cluster as pseudo-replicates. This simple cluster-based approach can be used for data that come from one batch. When there are multiple batches, a more sophisticated approach is needed (see the section [Application to cell-line data]).
# Perform initial clustering to identify pseudo-replicates <- computeSumFactors(sce,assay.type="counts") sce <- sweep(assays(sce)$counts,2,sce$sizeFactor,'/') data_norm_pre assays(sce, withDimnames=FALSE)$lognormcounts<- log(data_norm_pre+1) <- buildSNNGraph(sce, assay.type = "lognormcounts") snn_gr_init <- igraph::cluster_louvain(snn_gr_init) clusters_init $cluster_init <- factor(clusters_init$membership) sce <- runUMAP(sce,exprs_values = "lognormcounts") sce <- runPCA(sce,exprs_values = "lognormcounts")sce
Figure 2.4 visualises the initial clusters for identifying pseudo-replicates.
Figure 2.5 indicates a potential association between biology (cell type) and an unwanted factor (in this case library size). This potential association will be taken into account when RUV-III-NB estimates and subsequently removes the variation due to the unwanted factors.
$logLS <- log(colSums(assays(sce)$counts)) sceplotUMAP(sce,colour_by='logLS')
The code shown below performs RUV-III-NB normalisation.
# Construct the replicate matrix M using pseudo-replicates identified using initial clustering <- matrix(0,ncol(assays(sce)$counts),length(unique(sce$cluster_init))) M <- sort(unique(as.numeric(unique(sce$cluster_init)))) clfor(CL in cl) which(as.numeric(sce$cluster_init)==CL),CL] <- 1 M[ #RUV-III-NB code <-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts), # count matrix with genes as rows and cells as columns ruv3nb_outM=M, #Replicate matrix constructed as above ctl=rowData(sce)$ctlLogical, #A vector denoting control genes k=2, # dimension of unwanted variation factors ncores = 6 )
After carrying out the RUV-III-NB normalisation, we can explore the estimated unwanted variation component which are stored in the
W component of the output. In this example, we can see that the first component of unwanted variation is highly correlated with the log library size (see Fifure 2.6).
<-data.frame(W1=-ruv3nb_out$W[,1], W1dataW2=ruv3nb_out$W[,2], LogLS=colData(sce)$logLS) <-ggplot(W1data, aes(LogLS, W1)) + p1geom_point(aes(colour = LogLS)) + scale_colour_gradient(low = "yellow", high = "dark green")+ theme(legend.position = "none") <-ggplot(W1data, aes(W2, W1)) + p2geom_point(aes(colour = LogLS)) + scale_colour_gradient(low = "yellow", high = "dark green") ::grid.arrange(p1,p2,ncol=2)gridExtra
In the example above, the pseudo-replicates were identified through clustering. As an optional step, we can strengthen the pseudo-replicates by adding pseudo-cells to define pseudo-replicates. The pseudo-cells are generated using pool-and-divide strategy (Salim et al. 2022), and can be done using
use.pseudosample=TRUE option as shown below.
<-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts), ruv3nb_out_psM=M, ctl=rowData(sce)$ctlLogical, k=2, use.pseudosample=TRUE, #Indicating whether pseudo-cells are to be used to to define pseudo-replicates batch=rep(1,dim(sce))# NSCLC data comes from a single batch )
After estimating the parameters, either with or without
use.pseudosample=TRUE, the RUV-III-NB output can be used to a produce a
SingleCellExperiment object that will hold the adjusted data in the
Pearson components of the
#Creating a SingleCellExperiment object <- makeSCE(ruv3nb_out,cData=colData(sce))sce_ruv3nb
Once created, the SingleCellExperiment (SCE) object that holds the adjusted data can be used for downstream analyses. Here we are presenting the codes for dimension reduction, clustering, and annotation of the clusters and cells using the log of percentile-invariant adjusted count.
# PCA of RUV-III-NB log PAC data <- runPCA(sce_ruv3nb, exprs_values = "logPAC") sce_ruv3nb #UMAP <- runUMAP(sce_ruv3nb, exprs_values = "logPAC")sce_ruv3nb
# PCA visualization <- plotPCA(sce, ncomponents=2,colour_by = "logLS",point_alpha=0.5, point_size=1) + ggtitle('Scran normalized')+ theme(legend.position = "none") p1 <- plotPCA(sce_ruv3nb, ncomponents=2, p2 colour_by = "logLS",point_alpha=0.5, point_size=1) + ggtitle('RUV-III-NB log PAC') ::grid.arrange(p1,p2,ncol=2)gridExtra
<-plotUMAP(sce,colour_by='logLS')+ggtitle('Scran normalised') pu1<- plotUMAP(sce_ruv3nb,colour_by='logLS')+ggtitle('RUV-III-NB log PAC') pu2 ::grid.arrange(pu1,pu2,ncol=2)gridExtra
# Perform clustering <- buildSNNGraph(sce_ruv3nb, assay.type = "logPAC") snn_gr <- igraph::cluster_louvain(snn_gr) clusters $cluster_ruv <- factor(clusters$membership) sce_ruv3nb # Identify cell-type for each cluster <- HumanPrimaryCellAtlasData() ref_se # annotate cluster <- SingleR(test = sce_ruv3nb, ref = ref_se, pred10xnsclc_ruv labels = ref_se$label.fine, assay.type.test='logPAC', clusters=factor(sce_ruv3nb$cluster_ruv)) $ctype_ruv <- pred10xnsclc_ruv$labels[sce_ruv3nb$cluster_ruv] sce_ruv3nb # anotate cells <- SingleR(test = sce_ruv3nb, ref = ref_se, pred10xnsclc_ruv2 labels=ref_se$label.fine, assay.type.test='logPAC') $ctype_ruv2 <- pred10xnsclc_ruv2$labelssce_ruv3nb
dittoDimPlot(sce_ruv3nb, "ctype_ruv" )
The log PAC adjusted data can also be used to identify differentially-expressed genes. Here, we provide an example on how to use the adjusted data for finding cluster-specific marker genes using
scran package. The code below demonstrates the identification of top differentially expressed genes for a specified cluster using RUV-III-NB adjusted data and visualising the log fold-changes in Figure 2.10.
<- findMarkers(x=as.matrix(assays(sce_ruv3nb)$logPAC), markers_ruv3nb groups=sce_ruv3nb$ctype_ruv)#identify a combination of marker genes that together defines one cluster against the rest. <- markers_ruv3nb[] #Cluster 3 is used here for demonstration purposes. cluster3_markers <- cluster3_markers[cluster3_markers $Top <= 5,] #Collects the top differentially expressed genes from each pairwise comparison involving cluster 3 cluster3_top5 <- getMarkerEffects(cluster3_top5) #Extracts log fold changeslogFCs_cluster3_top5
pheatmap(logFCs_cluster3_top5, breaks=seq(-5, 5, length.out=101))
For seamless integration with the
Seurat R package, the
SingleCellExperiment object can be converted to a
Seurat object using the code below and the functions in the
Seurat package can then be used.
<- as.Seurat(sce_ruv3nb, counts = "counts", data ="logPAC")seurat_ruv3nb