We applied the SpaceX method on two spatial transcriptomics datasets which are obtained from the preoptic region of the mouse hypothalamus (Moffitt et al. 2018) and the human breast cancer dataset (Ståhl et al. 2016). Here we provide details of preprocessing and exploratory analysis of both datasets in section C.1. We illustrate the detailed application of the community detection algorithm on those two datasets in section C.2.
The MERFISH dataset is obtained from the preoptic area of the mouse hypothalamus (Moffitt et al. 2018). The dataset consists of \(160\) genes and corresponding gene expressions are measured in \(4975\) spatial locations. There are \(7\) pre-determined spatial clusters in the dataset named Astrocyte, Endothelial, Ependymal, Excitatory, Inhibitory, Immature, Mature, and the corresponding sizes are \(724\), \(503\), \(314\), \(1024\), \(1694\), \(168\), \(385\) respectively. The dataset consists of \(2\) more clusters named Microglia, Pericytes with cluster sizes \(90\), \(73\) respectively which are less than \(100\). Those two clusters are removed from the dataset. After removing those two clusters, we have gene expressions from \(4812\) locations corresponding to \(160\) genes. There are no genes with more than \(95\%\) zeros reads. The left panel of Figure C.1 shows the violin plot of the percentage of zero reads among the genes for each cluster in the MERFISH dataset. The Umap representation of the Merfish data has been provided on the right panel of Figure C.1.
The human breast cancer dataset contains expression levels from \(5262\) genes measured at \(250\) locations (Ståhl et al. 2016). We use the SPARK method with \(5\%\) FDR cut-off on p-values to detect \(290\) spatially expressed genes to carry forward our analysis. The violin plot of the percentage of zero reads among the genes for each spatially contiguous cluster in the Breast cancer dataset is shown in the left panel of Figure C.2. On the right panel of Figure C.2, we have provided the Umap.
The community detection is a downstream analysis of the shared and cluster-specific networks which are obtained from the SpaceX method. The communities are detected by optimizing modularity over partitions in a network structure (Brandes et al. 2007). Figure C.3 and C.4 show the detected community modules from shared and cluster-specific co-expression networks for MERFISH and breast cancer data respectively.
In this section, we benchmark our models on two real spatial transcriptomics datasets based on model fitting criteria. To this end, we use information-based criteria – a standard and well-established technique to compare the model fits between hierarchical Bayesian models (Gelman, Hwang, and Vehtari 2014). In this case, we use two information criteria-based metrics to assess our model fitting: (i) Bayesian analogue of AIC (Akaike 1998), defined as the Bayesian information criteria (BIC, Watanabe (2013)); and (ii) Watanabe-Akaike information criterion (WAIC) (Watanabe 2010), an improvement on the AIC and a fully Bayesian approach to measure model accuracy computed with log pointwise posterior predictive density and then adding a correction for the effective number of parameters to adjust for over-fitting. These criterion based methods are often used for model selection and specifically for spatial datasets (Banerjee, Wall, and Carlin 2003; Banerjee, Gelfand, and Polasek 2000; Lee and Ghosh 2009). In both cases, lower (relative) values indicate better model fits.
Table C.1 shows the BIC and WAIC values for the SpaceX and non-spatial Poisson model for both the mouse hypothalamus and breast cancer data. Based on the criteria based values from the Table C.1, we can conclude that the SpaceX model is a better fit to both spatial transcriptomics datasets than the non-spatial Poisson model. For example, there is 64.7% and 46.6% of relative gain in accuracy of model fitting of the SpaceX model and non-spatial model w.r.t. BIC and WAIC respectively in case of Merfish data. A similar inference can be drawn for the breast cancer data where the relative gains are 66.4% and 45.5% for BIC and WAIC respectively in case of model fitting.
|BIC (Merfish)||WAIC (Merfish)||BIC (Breast cancer)||WAIC (Breast cancer)|
|Non-spatial Poisson model||38274||82045||72523||99474|
A detailed list of hub genes and top edges for both the datasets can be found at https://github.com/SatwikAch/SpaceX.
To corroborate some of our findings, we consider the TCGA-based gene expression from 67 breast cancer tissues and 20,000 genes using parallel high-throughput sequencing (Wirth et al. 2011; Weinstein et al. 2013). To make a fair “apples-to-apples” comparison, we used the same intersecting gene set from the spatial transcriptomics based breast cancer data used in our paper . We used a network-based algorithm: personalized cancer-specific integrated network estimation (PRECISE, Ha et al. (2018)) to obtain gene networks. PRECISE is Bayesian method for gene-network reconstruction for bulk-sequencing data that uses a regression-based approach. The PRECISE method detected \(77\) hub genes out of total \(290\) genes compared to the SpaceX method, which detected \(59\) hub genes – with \(19\) intersecting hub genes using both methods. The list of all the hub genes detected from each method and intersection hub genes from both method can be found at the webiste mentioned below under the name BC_Hub_genes_TCGA.csv (https://github.com/SatwikAch/SpaceX/tree/main/Hub%20genes).
Interestingly, multiple collagens genes (COL16A1, COL6A2, COL5A1) are detected as hub genes by both methods. Collagen biosynthesis can be regulated by cancer cells through mutated genes, transcription factors and signaling pathways (Xu et al. 2019). Understanding of the structural properties and functions of collagen in cancer will lead to anticancer therapy. The LUM gene is associated with collagen genes and effectively regulates estrogen receptors and function properties of breast cancer cells (Karamanou et al. 2017). Upregulation in FN1 gene indicates development various types of tumors (Y. Sun et al. 2020). XBP1 can induce cell invasion and metastasis in breast cancer cells by promoting high expression (S. Chen et al. 2020). VIM gene is used as a biomarker for the early detection of cancer (Mohebi et al. 2020).
We further evaluated the performance of SpaceX to detect similarity between cell-cell interactions. To this end, we used Hamming distance, a well-established similarity measure between two networks, which has been used in several network topology based research studies (Tian and Shen 2005, 2006; Ehounou et al. 2020). In our case, the Hamming distance is equivalent to the distance between their two co-expression networks, i.e., the number of elements having a similar (or different) values in each of the two networks. A low (high) value in Hamming distance between two networks implies those two networks are more (less) similar to each other.
The mouse hypothalamus data consists of 7 cell-type based clusters among 4812 spatial locations. The SpaceX method provides gene co-expression networks specifically for each cell types. Using the Hamming distance as similarity metric, we measure the network similarity between cell-type specific networks obtained from the Mouse hypthalamaous data analyses in Section 4.1 of the paper. The heatmap of the Hamming distances between cell-type specific networks is shown in Figure C.5. We can observe that the co-expression network of immature cell-type is further apart than other cell type specific network in terms of Hamming distance. We rescale the Hamming distance with maximum value such that the distances are in [0,1] interval. Specifically, the Hamming distances of immature cell type network with other cell type (Endothelial, Astrocyte, Mature, Inhibitory, Excitatory, Ependymal) networks are 1, 0.73, 0.83, 0.79, 0.82, 0.73 respectively. Based on Figure C.5, network of endothelial cell type is distant from other cell-type based networks except for the immature cell-type. The distance of Astrocyte cell-type netwrok from Ependymal and Excitatory are 0.27 and 0.32 respectively. The neuronal cell type specific networks have lower distance than others which leads to infer a higher level of similarity between cell-type based networks than others. This similarity and disparity based finding aligns with multiple prior works which discuss about hypothalamic cell diversity (R. Chen et al. 2017; Mickelsen et al. 2020).