C Real Data Analysis

We applied the SpaceX method on two spatial transcriptomics datasets which are obtained from the preoptic region of the mouse hypothalamus (Moffitt et al. 2018) and the human breast cancer dataset (Ståhl et al. 2016). Here we provide details of preprocessing and exploratory analysis of both datasets in section C.1. We illustrate the detailed application of the community detection algorithm on those two datasets in section C.2.

C.1 Exploratory analysis of the datasets

C.1.1 Merfish Data

The MERFISH dataset is obtained from the preoptic area of the mouse hypothalamus (Moffitt et al. 2018). The dataset consists of $160$ genes and corresponding gene expressions are measured in $4975$ spatial locations. There are $7$ pre-determined spatial clusters in the dataset named Astrocyte, Endothelial, Ependymal, Excitatory, Inhibitory, Immature, Mature, and the corresponding sizes are $724$, $503$, $314$, $1024$, $1694$, $168$, $385$ respectively. The dataset consists of $2$ more clusters named Microglia, Pericytes with cluster sizes $90$, $73$ respectively which are less than $100$. Those two clusters are removed from the dataset. After removing those two clusters, we have gene expressions from $4812$ locations corresponding to $160$ genes. There are no genes with more than $95\%$ zeros reads. The left panel of Figure C.1 shows the violin plot of the percentage of zero reads among the genes for each cluster in the MERFISH dataset. The Umap representation of the Merfish data has been provided on the right panel of Figure C.1.

Left panel shows the violin plot of percentage of zero reads among the genes for each cluster w.r.t. Merfish data and the right panel shows the Umap.

Figure C.1: Left panel shows the violin plot of percentage of zero reads among the genes for each cluster w.r.t. Merfish data and the right panel shows the Umap.

C.1.2 Breast Cancer Data

The human breast cancer dataset contains expression levels from $5262$ genes measured at $250$ locations (Ståhl et al. 2016). We use the SPARK method with $5\%$ FDR cut-off on p-values to detect $290$ spatially expressed genes to carry forward our analysis. The violin plot of the percentage of zero reads among the genes for each spatially contiguous cluster in the Breast cancer dataset is shown in the left panel of Figure C.2. On the right panel of Figure C.2, we have provided the Umap.

On the left panel, we have violin plot of percentage of zero reads among the genes for each cluster w.r.t. Breast Cancer data and the Umap is shown on the right panel.

Figure C.2: On the left panel, we have violin plot of percentage of zero reads among the genes for each cluster w.r.t. Breast Cancer data and the Umap is shown on the right panel.

C.2 Community detection

The community detection is a downstream analysis of the shared and cluster-specific networks which are obtained from the SpaceX method. The communities are detected by optimizing modularity over partitions in a network structure (Brandes et al. 2007). Figure C.3 and C.4 show the detected community modules from shared and cluster-specific co-expression networks for MERFISH and breast cancer data respectively.

Figure C.3: Shared and cell-type specific community detection for Merfish data

Figure C.4: Shared and cell-type specific community detection for Breast cancer data

C.3 Benchmarking on real spatial transcriptomics data

In this section, we benchmark our models on two real spatial transcriptomics datasets based on model fitting criteria. To this end, we use information-based criteria – a standard and well-established technique to compare the model fits between hierarchical Bayesian models (Gelman, Hwang, and Vehtari 2014). In this case, we use two information criteria-based metrics to assess our model fitting: (i) Bayesian analogue of AIC (Akaike 1998), defined as the Bayesian information criteria (BIC, Watanabe (2013)); and (ii) Watanabe-Akaike information criterion (WAIC) (Watanabe 2010), an improvement on the AIC and a fully Bayesian approach to measure model accuracy computed with log pointwise posterior predictive density and then adding a correction for the effective number of parameters to adjust for over-fitting. These criterion based methods are often used for model selection and specifically for spatial datasets (Banerjee, Wall, and Carlin 2003; Banerjee, Gelfand, and Polasek 2000; Lee and Ghosh 2009). In both cases, lower (relative) values indicate better model fits.

Table C.1 shows the BIC and WAIC values for the SpaceX and non-spatial Poisson model for both the mouse hypothalamus and breast cancer data. Based on the criteria based values from the Table C.1, we can conclude that the SpaceX model is a better fit to both spatial transcriptomics datasets than the non-spatial Poisson model. For example, there is 64.7% and 46.6% of relative gain in accuracy of model fitting of the SpaceX model and non-spatial model w.r.t. BIC and WAIC respectively in case of Merfish data. A similar inference can be drawn for the breast cancer data where the relative gains are 66.4% and 45.5% for BIC and WAIC respectively in case of model fitting.

Table C.1: Criteria based values for application of the SpaceX and non-spatial Poisson model to spatial transcriptomics data i.e. mouse hypothalamus and Breast cancer data.
	BIC (Merfish)	WAIC (Merfish)	BIC (Breast cancer)	WAIC (Breast cancer)
SpaceX Model	13520	43783	24346	54179
Non-spatial Poisson model	38274	82045	72523	99474

C.4 List of hub genes and edges

A detailed list of hub genes and top edges for both the datasets can be found at https://github.com/SatwikAch/SpaceX.

C.5 Corroboration with TCGA Breast Cancer Data

To corroborate some of our findings, we consider the TCGA-based gene expression from 67 breast cancer tissues and 20,000 genes using parallel high-throughput sequencing (Wirth et al. 2011; Weinstein et al. 2013). To make a fair “apples-to-apples” comparison, we used the same intersecting gene set from the spatial transcriptomics based breast cancer data used in our paper . We used a network-based algorithm: personalized cancer-specific integrated network estimation (PRECISE, Ha et al. (2018)) to obtain gene networks. PRECISE is Bayesian method for gene-network reconstruction for bulk-sequencing data that uses a regression-based approach. The PRECISE method detected $77$ hub genes out of total $290$ genes compared to the SpaceX method, which detected $59$ hub genes – with $19$ intersecting hub genes using both methods. The list of all the hub genes detected from each method and intersection hub genes from both method can be found at the webiste mentioned below under the name BC_Hub_genes_TCGA.csv (https://github.com/SatwikAch/SpaceX/tree/main/Hub%20genes).

Interestingly, multiple collagens genes (COL16A1, COL6A2, COL5A1) are detected as hub genes by both methods. Collagen biosynthesis can be regulated by cancer cells through mutated genes, transcription factors and signaling pathways (Xu et al. 2019). Understanding of the structural properties and functions of collagen in cancer will lead to anticancer therapy. The LUM gene is associated with collagen genes and effectively regulates estrogen receptors and function properties of breast cancer cells (Karamanou et al. 2017). Upregulation in FN1 gene indicates development various types of tumors (Y. Sun et al. 2020). XBP1 can induce cell invasion and metastasis in breast cancer cells by promoting high expression (S. Chen et al. 2020). VIM gene is used as a biomarker for the early detection of cancer (Mohebi et al. 2020).

C.6 Application of SpaceX method on Spatial Transcriptomics data from Alzheimer’s Disease

W. T. Chen et al. (2020) (C2020, henceforth) used spatial transcriptomics techniques to measure spatial gene expression from small tissue domains related to Alzheimer’s disease to characterize the molecular changes and cellular interactions. The study collected gene expression data from a total of $9,957$ spots that were annotated into $14$ brain regions in the original study (as shown in Figure 1B of C2020). The tissue domains and their contained number of spots are provided in Table C.2.

Table C.2: Tissue domains and their corresponding sizes.
RSP	PTL	SSp	AUD	TEP	ENTI	OLF
262	585	532	860	150	112	749
CNU	CTXsp	FB	TH	HY	HPs	HPd
136	709	1155	2114	1097	437	1059

For the network analyses and corresponding biological interpretations C2020 focused on $57$ plaque-induced genes (PIGs) which are highly responsive towards $\beta$-amyloid (A$\beta$) plaques deposition as highlighted in the several studies (Krasemann et al. 2017; Sala Frigerio et al. 2019). PIGs are gradually co-expressed with increase in A$\beta$ load and responsible for endosomes and lysosomes, oxidation-reduction, and inflammation. We obtained the data from their data repository mentioned in the data and code availability Section of C2020. The authors of C2020 used a WGCNA based analysis to discover gene modules consisting of genes with similar co-expression patterns and focused on PIGs which were most responsive to A$\beta$ as shown in Figure S3B of their paper.

For confirmatory analyses, we analogously apply our SpaceX model to same $57$ PIGs assessed on the $9957$ spots divided into $14$ tissue domains following C2020. SpaceX detects multiple hub genes across tissue domains and their intersections as shown in Figure C.5. In total, we found $40$ hub genes across the 14 tissue domains among which five (marked in red in Figure C.5) intersect with those found by C2020 who listed 10 top hub genes: Ctsd, C4b, Cst3, Apoe, C4a, Gfap, Tyrobp, Lyz2, Trem2, and B2m (Figure 3D in C2020). Out of these 10 top hub genes, we are able to identify 5 genes: B2m, Lyz2, Ctsd, Trem2, Tyrobp based on a correlation level cut-off $0.5$. C2020 can only detect them as overall hub genes but our analysis goes a step further to detect hub genes for specific tissue domains and their intersections. The common 5 genes are identified as hub genes across multiple tissue domains e.g. B2m is for HPs; Lyz2 is for PTL, SSp, CTXsp, AUD, FB, HY, TH; Ctsd is for PTL, HPs, HPd, SSp, CTXsp, AUD, FB, HY, TH etc.

We do not detect rest of top 5 hub genes (C4b, C4a, Cst3, Apoe, Gfap) because of the low-correlation levels ($\le$ $0.25$) in the dataset provided. Figure C.6 shows the boxplot of marginal correlations between top 10 hub genes (as detected by C2020) with other genes across all the clusters. In Figure C.6, last 5 genes are not identified as hub genes by the SpaceX method because low level of correlation which can clearly observed by the corresponding boxplots which have spread around $0$. Furthermore, the differences can be be attributed to different analytical methods since C2020 employed a WGCNA-based network method to detect their hub genes, which to best of our understanding, does not incorporate spatial domain information in their models. In contrast, our SpaceX method uses a factor-model based network reconstruction that effectively leverages the spatial information to construct co-expression networks which as we have shown through simulations has higher power and better signal detection (see Section 3 of the paper). This engenders identification of hub genes across multiple tissue domains such as C1qb, C1qc, Fcrls, Hexb, Mpeg1 which is shared across all $14$ domains – which is not detected by C2020. We provide further discussion about these hub genes and related functionalities and the shared and tissue specific networks are shown in Figure C.7.

Tyrobp is a top hub gene which is detected across all the tissue domains as shown in Figure C.5 and encodes transmembrane signaling polypeptide which is connected to immune cell responses across multiple tissue domains (Humphrey et al. 2004). Different variants of another hub gene Trem2 (identified as a conserved hub gene across the following tissue domains: RSP, PLT, SSp, AUD, OLF, CTXsp, FB, TH, HY, HPs, HPd as shown in Figure C.5) associated with Alzheimer’s disease induce partial loss of memory and alter the behaviour of microglial cells, including their response to amyloid plaques (Carmona et al. 2018). Ctsd’s function is the processing proteins linked to Alzheimer’s disease, as well as in autophagy (Di Domenico, Tramutola, and Perluigi 2016). Lysozyme 2 (Lyz2) is a microglia gene which is associated with A$\beta$ plaque phagocytosis (Grubman et al. 2021). B2m regulates age-related cognitive dysfunction and impaired neurogenesis (Smith et al. 2015). These interpretations align with the findings of C2020. Additionally, we have detected C1qb, C1qc, Fcrls, Hexb, Mpeg1 as hub genes across multiple tissue domains as shown in Figure @ref(fig:PIG_upset_plot). C1q genes (C1qb, C1qc) are a classical pathway which is multifunctional protein known to be expressed in brain of AD tissue (Fonseca et al. 2004). Infected mouse brains exhibits downregulation of the Fc receptor-like S, scavenger receptor (Fcrls) gene (Tanaka et al. 2013).

The upset plot shows hub genes for each tissue domain and different spatial intersections. The red colored hub genes are also detected as top hub genes in @CHEN2020976.

Figure C.5: The upset plot shows hub genes for each tissue domain and different spatial intersections. The red colored hub genes are also detected as top hub genes in W. T. Chen et al. (2020).

Boxplot of correlations across all the clusters for top $10$ hub genes as detected by C2020. First five genes (marked in red) which are also detected as hub genes by SpaceX method. Last five genes are not identified as hub genes by the SpaceX method.

Figure C.6: Boxplot of correlations across all the clusters for top $10$ hub genes as detected by C2020. First five genes (marked in red) which are also detected as hub genes by SpaceX method. Last five genes are not identified as hub genes by the SpaceX method.

Figure C.7: Shared and tissue domain specific networks.

C.7 Network similarity between cell-type specific networks

We further evaluated the performance of SpaceX to detect similarity between cell-cell interactions. To this end, we used Hamming distance, a well-established similarity measure between two networks, which has been used in several network topology based research studies (Tian and Shen 2005, 2006; Ehounou et al. 2020). In our case, the Hamming distance is equivalent to the distance between their two co-expression networks, i.e., the number of elements having a similar (or different) values in each of the two networks. A low (high) value in Hamming distance between two networks implies those two networks are more (less) similar to each other.

The mouse hypothalamus data consists of 7 cell-type based clusters among 4812 spatial locations. The SpaceX method provides gene co-expression networks specifically for each cell types. Using the Hamming distance as similarity metric, we measure the network similarity between cell-type specific networks obtained from the Mouse hypthalamaous data analyses in Section 4.1 of the paper. The heatmap of the Hamming distances between cell-type specific networks is shown in Figure C.8. We can observe that the co-expression network of immature cell-type is further apart than other cell type specific network in terms of Hamming distance. We rescale the Hamming distance with maximum value such that the distances are in [0,1] interval. Specifically, the Hamming distances of immature cell type network with other cell type (Endothelial, Astrocyte, Mature, Inhibitory, Excitatory, Ependymal) networks are 1, 0.73, 0.83, 0.79, 0.82, 0.73 respectively. Based on Figure C.8, network of endothelial cell type is distant from other cell-type based networks except for the immature cell-type. The distance of Astrocyte cell-type netwrok from Ependymal and Excitatory are 0.27 and 0.32 respectively. The neuronal cell type specific networks have lower distance than others which leads to infer a higher level of similarity between cell-type based networks than others. This similarity and disparity based finding aligns with multiple prior works which discuss about hypothalamic cell diversity (R. Chen et al. 2017; Mickelsen et al. 2020).

Figure C.8: The Figure shows heatmap of Hamming distances between cell-type specific networks.

References

Akaike, Hirotogu. 1998. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Selected Papers of Hirotugu Akaike, 199–213. Springer.

Banerjee, Sudipto, Alan E Gelfand, and Wolfgang Polasek. 2000. “Geostatistical Modelling for Spatial Interaction Data with Application to Postal Service Performance.” Journal of Statistical Planning and Inference 90 (1): 87–105.

Banerjee, Sudipto, Melanie M Wall, and Bradley P Carlin. 2003. “Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota.” Biostatistics 4 (1): 123–42.

Brandes, Ulrik, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. 2007. “On Modularity Clustering.” IEEE Transactions on Knowledge and Data Engineering 20 (2): 172–88.

Carmona, Susana, Kathleen Zahs, Elizabeth Wu, Kelly Dakin, Jose Bras, and Rita Guerreiro. 2018. “The Role of TREM2 in Alzheimer’s Disease and Other Neurodegenerative Disorders.” The Lancet Neurology 17 (8): 721–30.

Chen, Renchao, Xiaoji Wu, Lan Jiang, and Yi Zhang. 2017. “Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity.” Cell Reports 18 (13): 3227–41.

Chen, Shanshan, Jing Chen, Xin Hua, Yue Sun, Rui Cui, Jun Sha, and Xiaoli Zhu. 2020. “The Emerging Role of Xbp1 in Cancer.” Biomedicine & Pharmacotherapy 127: 110069.

Chen, Wei Ting, Ashley Lu, Katleen Craessaerts, Benjamin Pavie, Carlo Sala Frigerio, Nikky Corthout, Xiaoyan Qian, et al. 2020. “Spatial Transcriptomics and In Situ Sequencing to Study Alzheimer’s Disease.” Cell 182 (4): 976–991.e19. https://doi.org/https://doi.org/10.1016/j.cell.2020.06.038.

Di Domenico, Fabio, Antonella Tramutola, and Marzia Perluigi. 2016. “Cathepsin d as a Therapeutic Target in Alzheimer’s Disease.” Expert Opinion on Therapeutic Targets 20 (12): 1393–95.

Ehounou, Wilfried Joseph, Dominique Barth, Arnaud De Moissac, Dimitri Watel, and Marc-Antoine Weisser. 2020. “Minimizing the Hamming Distance Between a Graph and a Line-Graph to Discover the Topology of an Electrical Network.” J. Graph Algorithms Appl. 24 (3): 133–53.

Fonseca, Maria Isabel, Jun Zhou, Marina Botto, and Andrea J Tenner. 2004. “Absence of C1q Leads to Less Neuropathology in Transgenic Mouse Models of Alzheimer’s Disease.” Journal of Neuroscience 24 (29): 6457–65.

Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6): 997–1016.

Grubman, Alexandra, Xin Yi Choo, Gabriel Chew, John F. Ouyang, Guizhi Sun, Nathan P. Croft, Fernando J. Rossello, et al. 2021. “Transcriptional Signature in Microglia Associated with A$\beta$ Plaque Phagocytosis.” Nature Communications 12 (1): 1–22.

Ha, Min Jin, Sayantan Banerjee, Rehan Akbani, Han Liang, Gordon B Mills, Kim-Anh Do, and Veerabhadran Baladandayuthapani. 2018. “Personalized Integrated Network Modeling of the Cancer Proteome Atlas.” Scientific Reports 8 (1): 1–14.

Humphrey, Mary Beth, Kouetsu Ogasawara, Wei Yao, Steven C Spusta, Michael R Daws, Nancy E Lane, Lewis L Lanier, and Mary C Nakamura. 2004. “The Signaling Adapter Protein DAP12 Regulates Multinucleation During Osteoclast Development.” Journal of Bone and Mineral Research 19 (2): 224–34.

Karamanou, Konstantina, Marco Franchi, Zoi Piperigkou, Corinne Perreau, Francois-Xavier Maquart, Demitrios H Vynios, and Stephane Brezillon. 2017. “Lumican Effectively Regulates the Estrogen Receptors-Associated Functional Properties of Breast Cancer Cells, Expression of Matrix Effectors and Epithelial-to-Mesenchymal Transition.” Scientific Reports 7 (1): 1–15.

Krasemann, Susanne, Charlotte Madore, Ron Cialic, Caroline Baufeld, Narghes Calcagno, Rachid El Fatimy, Lien Beckers, et al. 2017. “The TREM2-APOE Pathway Drives the Transcriptional Phenotype of Dysfunctional Microglia in Neurodegenerative Diseases.” Immunity 47 (3): 566–581.e9. https://doi.org/https://doi.org/10.1016/j.immuni.2017.08.008.

Lee, Hyeyoung, and Sujit K Ghosh. 2009. “Performance of Information Criteria for Spatial Models.” Journal of Statistical Computation and Simulation 79 (1): 93–106.

Mickelsen, Laura E, William F Flynn, Kristen Springer, Lydia Wilson, Eric J Beltrami, Mohan Bolisetty, Paul Robson, and Alexander C Jackson. 2020. “Cellular Taxonomy and Spatial Organization of the Murine Ventral Posterior Hypothalamus.” Elife 9: e58901.

Moffitt, Jeffrey R., Dhananjay Bambah-Mukku, Stephen W. Eichhorn, Eric Vaughn, Karthik Shekhar, Julio D. Perez, Nimrod D. Rubinstein, et al. 2018. “Molecular, Spatial, and Functional Single-Cell Profiling of the Hypothalamic Preoptic Region.” Science 362 (6416). https://doi.org/10.1126/science.aau5324.

Mohebi, Mehdi, Soudeh Ghafouri-Fard, Mohammad Hossein Modarressi, Sepideh Dashti, Ali Zekri, Vahid Kholghi-Oskooei, and Mohammad Taheri. 2020. “Expression Analysis of Vimentin and the Related lncRNA Network in Breast Cancer.” Experimental and Molecular Pathology 115: 104439.

Sala Frigerio, Carlo, Leen Wolfs, Nicola Fattorelli, Nicola Thrupp, Iryna Voytyuk, Inga Schmidt, Renzo Mancuso, et al. 2019. “The Major Risk Factors for Alzheimer’s Disease: Age, Sex, and Genes Modulate the Microglia Response to a$\beta$ Plaques.” Cell Reports 27 (4): 1293–1306.e6. https://doi.org/https://doi.org/10.1016/j.celrep.2019.03.099.

Smith, Lucas K, Yingbo He, Jeong-Soo Park, Gregor Bieri, Cedric E Snethlage, Karin Lin, Geraldine Gontier, et al. 2015. “$\beta$2-Microglobulin Is a Systemic Pro-Aging Factor That Impairs Cognitive Function and Neurogenesis.” Nature Medicine 21 (8): 932–37.

Ståhl, Patrik L, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, et al. 2016. “Visualization and Analysis of Gene Expression in Tissue Sections by Spatial Transcriptomics.” Science 353 (6294): 78–82.

Sun, Yang, Chunlin Zhao, Yanwei Ye, Zhen Wang, Yuanhang He, Yulin Li, and Haoxun Mao. 2020. “High Expression of Fibronectin 1 Indicates Poor Prognosis in Gastric Cancer.” Oncology Letters 19 (1): 93–102.

Tanaka, Sachi, Maki Nishimura, Fumiaki Ihara, Junya Yamagishi, Yutaka Suzuki, and Yoshifumi Nishikawa. 2013. “Transcriptome Analysis of Mouse Brain Infected with Toxoplasma Gondii.” Infection and Immunity 81 (10): 3609–19.

Tian, Hui, and Hong Shen. 2005. “Hamming Distance and Hop Count Based Classification for Multicast Network Topology Inference.” In 19th International Conference on Advanced Information Networking and Applications (AINA’05) Volume 1 (AINA Papers), 1:267–72. IEEE.

———. 2006. “Multicast-Based Inference for Topology and Network-Internal Loss Performance from End-to-End Measurements.” Computer Communications 29 (11): 1936–47.

Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (Dec): 3571–94.

———. 2013. “A Widely Applicable Bayesian Information Criterion.” Journal of Machine Learning Research 14 (Mar): 867–97.

Weinstein, John N, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. 2013. “The Cancer Genome Atlas Pan-Cancer Analysis Project.” Nature Genetics 45 (10): 1113–20.

Wirth, Henry, Markus Löffler, Martin von Bergen, and Hans Binder. 2011. “Expression Cartography of Human Tissues Using Self Organizing Maps.” Nature Precedings, 1–1.

Xu, Shuaishuai, Huaxiang Xu, Wenquan Wang, Shuo Li, Hao Li, Tianjiao Li, Wuhu Zhang, Xianjun Yu, and Liang Liu. 2019. “The Role of Collagen in Cancer: From Bench to Bedside.” Journal of Translational Medicine 17 (1): 1–22.