C Real Data Analysis

We applied the SpaceX method on two spatial transcriptomics datasets which are obtained from the preoptic region of the mouse hypothalamus (Moffitt et al. 2018) and the human breast cancer dataset (Ståhl et al. 2016). Here we provide details of preprocessing and exploratory analysis of both datasets in section C.1. We illustrate the detailed application of the community detection algorithm on those two datasets in section C.2.

C.1 Exploratory analysis of the datasets

C.1.1 Merfish Data

The MERFISH dataset is obtained from the preoptic area of the mouse hypothalamus (Moffitt et al. 2018). The dataset consists of \(160\) genes and corresponding gene expressions are measured in \(4975\) spatial locations. There are \(7\) pre-determined spatial clusters in the dataset named Astrocyte, Endothelial, Ependymal, Excitatory, Inhibitory, Immature, Mature, and the corresponding sizes are \(724\), \(503\), \(314\), \(1024\), \(1694\), \(168\), \(385\) respectively. The dataset consists of \(2\) more clusters named Microglia, Pericytes with cluster sizes \(90\), \(73\) respectively which are less than \(100\). Those two clusters are removed from the dataset. After removing those two clusters, we have gene expressions from \(4812\) locations corresponding to \(160\) genes. There are no genes with more than \(95\%\) zeros reads. The left panel of Figure C.1 shows the violin plot of the percentage of zero reads among the genes for each cluster in the MERFISH dataset. The Umap representation of the Merfish data has been provided on the right panel of Figure C.1.

Left panel shows the violin plot of percentage of zero reads among the genes for each cluster w.r.t. Merfish data and the right panel shows the Umap.

Figure C.1: Left panel shows the violin plot of percentage of zero reads among the genes for each cluster w.r.t. Merfish data and the right panel shows the Umap.

C.1.2 Breast Cancer Data

The human breast cancer dataset contains expression levels from \(5262\) genes measured at \(250\) locations (Ståhl et al. 2016). We use the SPARK method with \(5\%\) FDR cut-off on p-values to detect \(290\) spatially expressed genes to carry forward our analysis. The violin plot of the percentage of zero reads among the genes for each spatially contiguous cluster in the Breast cancer dataset is shown in the left panel of Figure C.2. On the right panel of Figure C.2, we have provided the Umap.

On the left panel, we have violin plot of percentage of zero reads among the genes for each cluster w.r.t. Breast Cancer data and the Umap is shown on the right panel.

Figure C.2: On the left panel, we have violin plot of percentage of zero reads among the genes for each cluster w.r.t. Breast Cancer data and the Umap is shown on the right panel.

C.2 Community detection

The community detection is a downstream analysis of the shared and cluster-specific networks which are obtained from the SpaceX method. The communities are detected by optimizing modularity over partitions in a network structure (Brandes et al. 2007). Figure C.3 and C.4 show the detected community modules from shared and cluster-specific co-expression networks for MERFISH and breast cancer data respectively.

Shared and cell-type specific community detection for Merfish data

Figure C.3: Shared and cell-type specific community detection for Merfish data

Shared and cell-type specific community detection for Breast cancer data

Figure C.4: Shared and cell-type specific community detection for Breast cancer data

C.3 Benchmarking on real spatial transcriptomics data

In this section, we benchmark our models on two real spatial transcriptomics datasets based on model fitting criteria. To this end, we use information-based criteria – a standard and well-established technique to compare the model fits between hierarchical Bayesian models (Gelman, Hwang, and Vehtari 2014). In this case, we use two information criteria-based metrics to assess our model fitting: (i) Bayesian analogue of AIC (Akaike 1998), defined as the Bayesian information criteria (BIC, Watanabe (2013)); and (ii) Watanabe-Akaike information criterion (WAIC) (Watanabe 2010), an improvement on the AIC and a fully Bayesian approach to measure model accuracy computed with log pointwise posterior predictive density and then adding a correction for the effective number of parameters to adjust for over-fitting. These criterion based methods are often used for model selection and specifically for spatial datasets (Banerjee, Wall, and Carlin 2003; Banerjee, Gelfand, and Polasek 2000; Lee and Ghosh 2009). In both cases, lower (relative) values indicate better model fits.

Table C.1 shows the BIC and WAIC values for the SpaceX and non-spatial Poisson model for both the mouse hypothalamus and breast cancer data. Based on the criteria based values from the Table C.1, we can conclude that the SpaceX model is a better fit to both spatial transcriptomics datasets than the non-spatial Poisson model. For example, there is 64.7% and 46.6% of relative gain in accuracy of model fitting of the SpaceX model and non-spatial model w.r.t. BIC and WAIC respectively in case of Merfish data. A similar inference can be drawn for the breast cancer data where the relative gains are 66.4% and 45.5% for BIC and WAIC respectively in case of model fitting.

BIC (Merfish) WAIC (Merfish) BIC (Breast cancer) WAIC (Breast cancer)
SpaceX Model 13520 43783 24346 54179
Non-spatial Poisson model 38274 82045 72523 99474
Table C.1: Criteria based values for application of the SpaceX and non-spatial Poisson model to spatial transcriptomics data i.e. mouse hypothalamus and Breast cancer data.

C.4 List of hub genes and edges

A detailed list of hub genes and top edges for both the datasets can be found at https://github.com/SatwikAch/SpaceX.

C.5 Corroboration with TCGA Breast Cancer Data

To corroborate some of our findings, we consider the TCGA-based gene expression from 67 breast cancer tissues and 20,000 genes using parallel high-throughput sequencing (Wirth et al. 2011; Weinstein et al. 2013). To make a fair “apples-to-apples” comparison, we used the same intersecting gene set from the spatial transcriptomics based breast cancer data used in our paper . We used a network-based algorithm: personalized cancer-specific integrated network estimation (PRECISE, Ha et al. (2018)) to obtain gene networks. PRECISE is Bayesian method for gene-network reconstruction for bulk-sequencing data that uses a regression-based approach. The PRECISE method detected \(77\) hub genes out of total \(290\) genes compared to the SpaceX method, which detected \(59\) hub genes – with \(19\) intersecting hub genes using both methods. The list of all the hub genes detected from each method and intersection hub genes from both method can be found at the webiste mentioned below under the name BC_Hub_genes_TCGA.csv (https://github.com/SatwikAch/SpaceX/tree/main/Hub%20genes).

Interestingly, multiple collagens genes (COL16A1, COL6A2, COL5A1) are detected as hub genes by both methods. Collagen biosynthesis can be regulated by cancer cells through mutated genes, transcription factors and signaling pathways (Xu et al. 2019). Understanding of the structural properties and functions of collagen in cancer will lead to anticancer therapy. The LUM gene is associated with collagen genes and effectively regulates estrogen receptors and function properties of breast cancer cells (Karamanou et al. 2017). Upregulation in FN1 gene indicates development various types of tumors (Y. Sun et al. 2020). XBP1 can induce cell invasion and metastasis in breast cancer cells by promoting high expression (S. Chen et al. 2020). VIM gene is used as a biomarker for the early detection of cancer (Mohebi et al. 2020).

C.6 Network similarity between cell-type specific networks

We further evaluated the performance of SpaceX to detect similarity between cell-cell interactions. To this end, we used Hamming distance, a well-established similarity measure between two networks, which has been used in several network topology based research studies (Tian and Shen 2005, 2006; Ehounou et al. 2020). In our case, the Hamming distance is equivalent to the distance between their two co-expression networks, i.e., the number of elements having a similar (or different) values in each of the two networks. A low (high) value in Hamming distance between two networks implies those two networks are more (less) similar to each other.

The mouse hypothalamus data consists of 7 cell-type based clusters among 4812 spatial locations. The SpaceX method provides gene co-expression networks specifically for each cell types. Using the Hamming distance as similarity metric, we measure the network similarity between cell-type specific networks obtained from the Mouse hypthalamaous data analyses in Section 4.1 of the paper. The heatmap of the Hamming distances between cell-type specific networks is shown in Figure C.5. We can observe that the co-expression network of immature cell-type is further apart than other cell type specific network in terms of Hamming distance. We rescale the Hamming distance with maximum value such that the distances are in [0,1] interval. Specifically, the Hamming distances of immature cell type network with other cell type (Endothelial, Astrocyte, Mature, Inhibitory, Excitatory, Ependymal) networks are 1, 0.73, 0.83, 0.79, 0.82, 0.73 respectively. Based on Figure C.5, network of endothelial cell type is distant from other cell-type based networks except for the immature cell-type. The distance of Astrocyte cell-type netwrok from Ependymal and Excitatory are 0.27 and 0.32 respectively. The neuronal cell type specific networks have lower distance than others which leads to infer a higher level of similarity between cell-type based networks than others. This similarity and disparity based finding aligns with multiple prior works which discuss about hypothalamic cell diversity (R. Chen et al. 2017; Mickelsen et al. 2020).

The Figure shows heatmap of Hamming distances between cell-type specific networks.

Figure C.5: The Figure shows heatmap of Hamming distances between cell-type specific networks.

References

Akaike, Hirotogu. 1998. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Selected Papers of Hirotugu Akaike, 199–213. Springer.
Banerjee, Sudipto, Alan E Gelfand, and Wolfgang Polasek. 2000. “Geostatistical Modelling for Spatial Interaction Data with Application to Postal Service Performance.” Journal of Statistical Planning and Inference 90 (1): 87–105.
Banerjee, Sudipto, Melanie M Wall, and Bradley P Carlin. 2003. “Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota.” Biostatistics 4 (1): 123–42.
Brandes, Ulrik, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. 2007. “On Modularity Clustering.” IEEE Transactions on Knowledge and Data Engineering 20 (2): 172–88.
Chen, Renchao, Xiaoji Wu, Lan Jiang, and Yi Zhang. 2017. “Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity.” Cell Reports 18 (13): 3227–41.
Chen, Shanshan, Jing Chen, Xin Hua, Yue Sun, Rui Cui, Jun Sha, and Xiaoli Zhu. 2020. “The Emerging Role of Xbp1 in Cancer.” Biomedicine & Pharmacotherapy 127: 110069.
Ehounou, Wilfried Joseph, Dominique Barth, Arnaud De Moissac, Dimitri Watel, and Marc-Antoine Weisser. 2020. “Minimizing the Hamming Distance Between a Graph and a Line-Graph to Discover the Topology of an Electrical Network.” J. Graph Algorithms Appl. 24 (3): 133–53.
Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6): 997–1016.
Ha, Min Jin, Sayantan Banerjee, Rehan Akbani, Han Liang, Gordon B Mills, Kim-Anh Do, and Veerabhadran Baladandayuthapani. 2018. “Personalized Integrated Network Modeling of the Cancer Proteome Atlas.” Scientific Reports 8 (1): 1–14.
Karamanou, Konstantina, Marco Franchi, Zoi Piperigkou, Corinne Perreau, Francois-Xavier Maquart, Demitrios H Vynios, and Stephane Brezillon. 2017. “Lumican Effectively Regulates the Estrogen Receptors-Associated Functional Properties of Breast Cancer Cells, Expression of Matrix Effectors and Epithelial-to-Mesenchymal Transition.” Scientific Reports 7 (1): 1–15.
Lee, Hyeyoung, and Sujit K Ghosh. 2009. “Performance of Information Criteria for Spatial Models.” Journal of Statistical Computation and Simulation 79 (1): 93–106.
Mickelsen, Laura E, William F Flynn, Kristen Springer, Lydia Wilson, Eric J Beltrami, Mohan Bolisetty, Paul Robson, and Alexander C Jackson. 2020. “Cellular Taxonomy and Spatial Organization of the Murine Ventral Posterior Hypothalamus.” Elife 9: e58901.
Moffitt, Jeffrey R., Dhananjay Bambah-Mukku, Stephen W. Eichhorn, Eric Vaughn, Karthik Shekhar, Julio D. Perez, Nimrod D. Rubinstein, et al. 2018. “Molecular, Spatial, and Functional Single-Cell Profiling of the Hypothalamic Preoptic Region.” Science 362 (6416). https://doi.org/10.1126/science.aau5324.
Mohebi, Mehdi, Soudeh Ghafouri-Fard, Mohammad Hossein Modarressi, Sepideh Dashti, Ali Zekri, Vahid Kholghi-Oskooei, and Mohammad Taheri. 2020. “Expression Analysis of Vimentin and the Related lncRNA Network in Breast Cancer.” Experimental and Molecular Pathology 115: 104439.
Ståhl, Patrik L, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, et al. 2016. “Visualization and Analysis of Gene Expression in Tissue Sections by Spatial Transcriptomics.” Science 353 (6294): 78–82.
Sun, Yang, Chunlin Zhao, Yanwei Ye, Zhen Wang, Yuanhang He, Yulin Li, and Haoxun Mao. 2020. “High Expression of Fibronectin 1 Indicates Poor Prognosis in Gastric Cancer.” Oncology Letters 19 (1): 93–102.
Tian, Hui, and Hong Shen. 2005. “Hamming Distance and Hop Count Based Classification for Multicast Network Topology Inference.” In 19th International Conference on Advanced Information Networking and Applications (AINA’05) Volume 1 (AINA Papers), 1:267–72. IEEE.
———. 2006. “Multicast-Based Inference for Topology and Network-Internal Loss Performance from End-to-End Measurements.” Computer Communications 29 (11): 1936–47.
Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (Dec): 3571–94.
———. 2013. “A Widely Applicable Bayesian Information Criterion.” Journal of Machine Learning Research 14 (Mar): 867–97.
Weinstein, John N, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. 2013. “The Cancer Genome Atlas Pan-Cancer Analysis Project.” Nature Genetics 45 (10): 1113–20.
Wirth, Henry, Markus Löffler, Martin von Bergen, and Hans Binder. 2011. “Expression Cartography of Human Tissues Using Self Organizing Maps.” Nature Precedings, 1–1.
Xu, Shuaishuai, Huaxiang Xu, Wenquan Wang, Shuo Li, Hao Li, Tianjiao Li, Wuhu Zhang, Xianjun Yu, and Liang Liu. 2019. “The Role of Collagen in Cancer: From Bench to Bedside.” Journal of Translational Medicine 17 (1): 1–22.