Chapter 2 Application to Non-Small Cell Lung Cancer Cells (NSCLC) Data

2.1 NSCLC dataset

The Non-Small Cell Lung Cancer Cells (NSCLC) dataset is freely available from the 10x Genomics website (10x Genomics Website 2022). The cells have been sequenced in one batch, so the dataset is not expected to contain batch effects. However, we need to remove library size effects from the sequencing data, allowing for different cell sizes. This dataset is expected to contain large epithelial cells and smaller immune cells (e.g T cells). Hence, the between-cells differences in library size consist of both biological and technical components. We will use RUV-III-NB to identify and remove the unwanted (technical) library size effects, while preserving the biological variation of interest, and demonstrate carrying out downstream analyses with the adjusted data.

2.2 Data Preprocessing

Prior to going through this section, make sure that the preliminary set-up described in section Preliminary settings has been completed.

We have prepared the NSCLC dataset and its metadata in two R objects that can be loaded using the following commands.

nsclc_obs<-readRDS(system.file("extdata", "nsclc_obs.rds", package="ruvIIInb"))
nsclc_samples<-readRDS(system.file("extdata", "nsclc_samples.rds", package="ruvIIInb"))

The nsclc_obs object contains raw sequencing count with 33660 genes in rows and 7802 cells in columns, and the nsclc_samples object contains cell-specific metadata. We combine these two objects onto a SingleCellExperiment object that will enable us to take advantage of facilities in packages such as scuttle and scater when pre-procesing the data.

sce <- SingleCellExperiment(assays=list(counts=nsclc_obs),
                            colData=as.data.frame(nsclc_samples))

We now carry out quality control (QC) steps to filer low quality cells and low abundant genes using Bioconductor package scuttle.

2.2.1 Filtering low quality cells

The cell-level quality control metrics such as the number of genes that have non-zero counts and the percentage of counts that comes from Mitochondrial genes for each cell can be computed and added to the SingleCellExperiment object as follows:

sce <- addPerCellQCMetrics(x = sce,subsets=list(Mito=grep("MT-",rownames(sce))))

The metrics calculated above may then be used to identify potential outliers. We recommend visualising the distribution of cells which contain a high proportion of cell-level counts for Mitochondrial genes as well as those which have low cell-level total counts, that are a certain number of median absolute deviations (MADs) away from the median value.

libsize_drop <- isOutlier(
  metric = sce$total,
  nmads = 2,
  type = "lower",
  log = TRUE)
colData(sce)$libsize_drop<-libsize_drop


mito_drop <- isOutlier(
  metric = colData(sce)$subsets_Mito_percent,
  nmads = 3,
  type = "higher")
colData(sce)$mito_drop<-mito_drop

Figures 2.1 and 2.2 visualise the distribution of cells that are flagged this way.

plot_df <- data.frame(logtotal=log(sce$total),
                      libsize_drop=factor(libsize_drop),
                      mito_drop=factor(mito_drop),
                      logdetected=log(sce$detected))
plot_df$mito_drop<-relevel(plot_df$mito_drop, "TRUE")
plot_df$libsize_drop<-relevel(plot_df$libsize_drop, "TRUE")

#A histogram showing the distribution of flagged cells
p <- plot_df %>%
  ggplot( aes(x=logtotal, fill=libsize_drop)) +
  geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=30) +
  scale_fill_manual(values=c( "#69b3a2","#404080")) +
  theme_ipsum() +
  labs(fill="") +
  xlab("Cell level log total count")+
  ylab("Frequency")
p+ggtitle('The distribution of cell-level log total counts\n  flagging cells with low library size')

Figure 2.1: A histogram showing the distribution of log cell level total counts flagging those with low library size

#A plot showing the number of detected genes vs total cell count
p2<- plot_df %>%
  arrange(desc(mito_drop)) %>%
  ggplot( aes(x=logtotal,y=logdetected, fill=mito_drop)) +
  geom_point(
    mapping = aes(colour = mito_drop, shape = libsize_drop),size = 2,alpha = 5 / 6) +     scale_color_manual(values=c("#69b3a2","#404080"))+
  xlab("Log total cell-level count")+ylab("Log number of cell-level detected genes")
p2+ggtitle('The cell-level number of detected genes vs total cell-level count (on log scale)')

Figure 2.2: A scatterplot showing cell-level log number of detected genes against log total count

2.2.2 Filtering low abundant genes

Similarly, we recommend computing gene-level quality control metrics such as the mean count across all cells for each gene, and the percentage of cells with non-zero counts for each gene. The code below adds these measures to the SingleCellExperiment object and flags genes with a low mean cell count (see Figure 2.3).

sce <- addPerFeatureQCMetrics(x = sce)

#Remove genes with zero counts for each gene
sce<-subset(sce, rowData(sce)$mean>0 )

# detect low abundant genes
lowcount_drop <-log(rowData(sce)$mean)< -5

#mean count of genes across all cells
plot_df2 <- data.frame(mean_genecount=log(rowData(sce)$mean), lowcount_drop=factor(lowcount_drop))
p <- plot_df2 %>%
  arrange(desc(lowcount_drop)) %>%
  ggplot( aes(x=mean_genecount, fill=lowcount_drop)) +
  geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity',bins=50) +
  scale_fill_manual(values=c("#404080", "#69b3a2")) +
  theme_ipsum() +
  labs(fill="") +xlab("Log mean count across all cells")+ylab("Frequency")
p+ggtitle('The distribution of log mean count across\n all cells, flagging those with a low mean count')

Figure 2.3: A histogram showing the distribution of gene level log mean counts

Once the quality control metrics have been calculated, we retain only the selected genes and cells.

sce <- sce[!(lowcount_drop), !(libsize_drop | mito_drop)]

Following quality control, the count data matrix consists of 14585 genes in rows and 6382 cells in columns.

2.3 Normalising the data

Following the preliminary steps described above, the pre-processed NSCLC dataset consists of 14585 genes in rows and 6382 cells in columns.

2.3.0.1 Using scHK as negative control genes

RUV-III-NB uses single-cell housekeeping genes (Lin et al. 2019) as the default negative control genes. The negative control genes play a vital role in estimating the unknown unwanted factors. There are two single-cell housekeeping genesets in the RUV-III-NB package. Here, we will use the set for Human (Homo Sapiens), available as part of the package.

# Reading in the control genes
data(Hs.schk)
ctl <- as.character(Hs.schk)
# Creating a logical vector to idetify control genes
rowData(sce)$ctlLogical<-rownames(assays(sce)$counts) %in% ctl

2.3.0.2 Identification of pseudo-replicates

The pseudo-replicates are cells with similar biology (Salim et al. 2022). To define suitable pseudo-replicates, we need to first define the biological factor of interest. In this study, the biological factor of interest is the cell-type, however cell-type information is unavailable. Our strategy here is to perform clustering using scran-normalized data and identify cells of the same cluster as pseudo-replicates. This simple cluster-based approach can be used for data that come from one batch. When there are multiple batches, a more sophisticated approach is needed (see the section [Application to cell-line data]).

# Perform initial clustering to identify pseudo-replicates
sce <- computeSumFactors(sce,assay.type="counts")
data_norm_pre <- sweep(assays(sce)$counts,2,sce$sizeFactor,'/')
assays(sce, withDimnames=FALSE)$lognormcounts<- log(data_norm_pre+1)
snn_gr_init <- buildSNNGraph(sce, assay.type = "lognormcounts")
clusters_init <- igraph::cluster_louvain(snn_gr_init)
sce$cluster_init <- factor(clusters_init$membership)

sce <- runUMAP(sce,exprs_values = "lognormcounts")
sce <- runPCA(sce,exprs_values = "lognormcounts")

Figure 2.4 visualises the initial clusters for identifying pseudo-replicates.

dittoDimPlot(sce, "cluster_init")

Figure 2.4: Visualising the initial clusters for identifying pseudo-replicates

Figure 2.5 indicates a potential association between biology (cell type) and an unwanted factor (in this case library size). This potential association will be taken into account when RUV-III-NB estimates and subsequently removes the variation due to the unwanted factors.

sce$logLS <- log(colSums(assays(sce)$counts))
plotUMAP(sce,colour_by='logLS')

Figure 2.5: UMAP plot coloured by log library size

2.3.0.3 Running RUV-III-NB

The code shown below performs RUV-III-NB normalisation.

# Construct the replicate matrix M using pseudo-replicates identified using initial clustering
M <- matrix(0,ncol(assays(sce)$counts),length(unique(sce$cluster_init)))
cl<- sort(unique(as.numeric(unique(sce$cluster_init))))
for(CL in cl)
  M[which(as.numeric(sce$cluster_init)==CL),CL] <- 1


#RUV-III-NB code
ruv3nb_out<-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts), # count matrix with genes as rows and cells as columns
                          M=M, #Replicate matrix constructed as above
                          ctl=rowData(sce)$ctlLogical, #A vector denoting control genes
                          k=2, # dimension of unwanted variation factors
                          ncores = 6
)

After carrying out the RUV-III-NB normalisation, we can explore the estimated unwanted variation component which are stored in the W component of the output. In this example, we can see that the first component of unwanted variation is highly correlated with the log library size (see Fifure 2.6).

W1data<-data.frame(W1=-ruv3nb_out$W[,1],
                   W2=ruv3nb_out$W[,2],
                   LogLS=colData(sce)$logLS)
p1<-ggplot(W1data, aes(LogLS, W1)) +
  geom_point(aes(colour = LogLS)) +
  scale_colour_gradient(low = "yellow", high = "dark green")+ theme(legend.position = "none")
p2<-ggplot(W1data, aes(W2, W1)) +
  geom_point(aes(colour = LogLS)) +
  scale_colour_gradient(low = "yellow", high = "dark green")
gridExtra::grid.arrange(p1,p2,ncol=2)

Figure 2.6: Scatterplots showing the log library size vs components of unwanted variation

2.3.0.4 Running RUV-III-NB with pseudo-cells to strengthen pseudo-replicates

In the example above, the pseudo-replicates were identified through clustering. As an optional step, we can strengthen the pseudo-replicates by adding pseudo-cells to define pseudo-replicates. The pseudo-cells are generated using pool-and-divide strategy (Salim et al. 2022), and can be done using use.pseudosample=TRUE option as shown below.

ruv3nb_out_ps<-fastruvIII.nb(Y=DelayedArray(assays(sce)$counts),
                             M=M,
                             ctl=rowData(sce)$ctlLogical,
                             k=2,
                             use.pseudosample=TRUE, #Indicating whether pseudo-cells are to be used to to define pseudo-replicates
                             batch=rep(1,dim(sce)[2])# NSCLC data comes from a single batch
)

2.4 Downstream analyses using adjusted data

After estimating the parameters, either with or without use.pseudosample=TRUE, the RUV-III-NB output can be used to a produce a SingleCellExperiment object that will hold the adjusted data in the logPAC and Pearson components of the assay slot.

#Creating a SingleCellExperiment object
sce_ruv3nb <- makeSCE(ruv3nb_out,cData=colData(sce))

2.4.1 Clustering and annotation

Once created, the SingleCellExperiment (SCE) object that holds the adjusted data can be used for downstream analyses. Here we are presenting the codes for dimension reduction, clustering, and annotation of the clusters and cells using the log of percentile-invariant adjusted count.

# PCA of RUV-III-NB log PAC data
sce_ruv3nb <- runPCA(sce_ruv3nb, exprs_values = "logPAC")

#UMAP
sce_ruv3nb <- runUMAP(sce_ruv3nb, exprs_values = "logPAC")

# PCA visualization
p1 <- plotPCA(sce, ncomponents=2,colour_by = "logLS",point_alpha=0.5, point_size=1) + ggtitle('Scran normalized')+ theme(legend.position = "none")

p2 <- plotPCA(sce_ruv3nb, ncomponents=2,
              colour_by = "logLS",point_alpha=0.5, point_size=1) +  ggtitle('RUV-III-NB log PAC')

gridExtra::grid.arrange(p1,p2,ncol=2)

Figure 2.7: Visualising PCA following Scran and RUV-III-NB normalisation

pu1<-plotUMAP(sce,colour_by='logLS')+ggtitle('Scran normalised')
pu2 <- plotUMAP(sce_ruv3nb,colour_by='logLS')+ggtitle('RUV-III-NB log PAC')
gridExtra::grid.arrange(pu1,pu2,ncol=2)

Figure 2.8: UMAP following Scran and RUV-III-NB normalisations

# Perform clustering
snn_gr <- buildSNNGraph(sce_ruv3nb, assay.type = "logPAC")
clusters <- igraph::cluster_louvain(snn_gr)
sce_ruv3nb$cluster_ruv <- factor(clusters$membership)

# Identify cell-type for each cluster
ref_se <- HumanPrimaryCellAtlasData()

# annotate cluster
pred10xnsclc_ruv  <- SingleR(test = sce_ruv3nb, ref = ref_se,
                             labels = ref_se$label.fine,
                             assay.type.test='logPAC',
                             clusters=factor(sce_ruv3nb$cluster_ruv))
sce_ruv3nb$ctype_ruv <- pred10xnsclc_ruv$labels[sce_ruv3nb$cluster_ruv]

# anotate cells
pred10xnsclc_ruv2  <- SingleR(test = sce_ruv3nb, ref = ref_se,
                              labels=ref_se$label.fine,
                              assay.type.test='logPAC')
sce_ruv3nb$ctype_ruv2 <- pred10xnsclc_ruv2$labels

dittoDimPlot(sce_ruv3nb, "ctype_ruv" )

Figure 2.9: Visualising the clusters following RUV-III-NB normalisation

2.4.2 Identification of differentially expressed genes

The log PAC adjusted data can also be used to identify differentially-expressed genes. Here, we provide an example on how to use the adjusted data for finding cluster-specific marker genes using scran package. The code below demonstrates the identification of top differentially expressed genes for a specified cluster using RUV-III-NB adjusted data and visualising the log fold-changes in Figure 2.10.

markers_ruv3nb <- findMarkers(x=as.matrix(assays(sce_ruv3nb)$logPAC),
                              groups=sce_ruv3nb$ctype_ruv)#identify a combination of marker genes that together defines one cluster against the rest.
cluster3_markers <- markers_ruv3nb[[3]] #Cluster 3 is used here for demonstration purposes.
cluster3_top5 <- cluster3_markers[cluster3_markers $Top <= 5,] #Collects the top differentially expressed genes from each pairwise comparison involving cluster 3
logFCs_cluster3_top5 <- getMarkerEffects(cluster3_top5) #Extracts log fold changes

pheatmap(logFCs_cluster3_top5, breaks=seq(-5, 5, length.out=101))

Figure 2.10: A heatmap of the differentially expressed log fold changes

2.4.3 Integration with Seurat

For seamless integration with the Seurat R package, the SingleCellExperiment object can be converted to a Seurat object using the code below and the functions in the Seurat package can then be used.

seurat_ruv3nb <- as.Seurat(sce_ruv3nb, counts = "counts", data ="logPAC")

References

10x Genomics Website. 2022. https://www.10xgenomics.com/resources/datasets/nsclc-tumor-5-gene-expression-1-standard-2-2-0.

Lin, Yingxin, Shila Ghazanfar, Dario Strbenac, Andy Wang, Ellis Patrick, David M Lin, Terence Speed, Jean YH Yang, and Pengyi Yang. 2019. “Evaluating Stably Expressed Genes in Single Cells.” GigaScience 8 (9): giz106.

Salim, Agus, Ramyar Molania, Jianan Wang, Alysha De Livera, Rachel Thijssen, and Terence P Speed. 2022. “RUV-III-NB: Normalization of Single Cell RNA-Seq Data.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkac486.