A Methodology

In this section, we start with a brief discussion of the model and interpretations of each term of the model. Next, A detailed discussion of the estimation procedure for the SpaceX model in equation 1 in section 2 of the paper (also mentioned in equation (A.2) ) is provided. Finally we discuss the methodological novelty of our SpaceX method.

A.1 SpaceX model

The gene expression is modeled with a Poisson distribution as \[\begin{equation} y^{c}_{g}({\bf s}_{i}) \sim \text{Poi}\{ M^{c}({\bf s}_{i}) \lambda^{c}_{g}({\bf s}_{i})\}. \tag{A.1} \end{equation}\] The interpretations of the notations in equation (A.1) remains same as mentioned in the section 2.2 of the paper. We denote \(\Lambda^{c}\) as a \(G \times N_{c}\) (\(N_{c}\) denotes size of the c-th cluster) matrix containing the rate parameters for all genes and c-th cluster. Subsequently, we model the cluster specific and spatially dependent rate parameter \(\bf \Lambda^{c}\) with an additive log-linear equation, i.e.  \[\begin{equation} \text{log}({\bf \Lambda}^{c}) = {\bf B}^{c}{\bf X}^{c} + {\bf S}^{c} + {\bf \Phi F} + {\bf \Psi^{c} D^{c}} + {\boldsymbol{\mathcal{E}}}^{c}. \tag{A.2} \end{equation}\]

We clearly mention the shared and cluster-specific parameters and their interpretations table A.1.

Parameters Shared Cluster-specific Interpretations
\({\bf B}^{c}{\bf X}^{c}\) \(\times\) \(\checkmark\) Covariate effect
\({\bf S}^{c}\) \(\times\) \(\checkmark\) Spatial information
\({\bf \Phi F}\) \(\checkmark\) \(\times\) Shared loadings and factors
\(\bf \Psi^{c} D^{c}\) \(\times\) \(\checkmark\) Cluster loadings and factors
\(\boldsymbol{\mathcal{E}}^{c}\) \(\times\) \(\checkmark\) Error matrix
Table A.1: Shared and cluster-specific parameters along with their corresponding interpretations.

A full-scale MCMC will be computationally expensive on a complex hierarchical model. For computational advantage, we decompose the model into two parts (I) sPMM: spatial Poisson mixed model (S. Sun et al. 2018) and (II) MSFA: Multi-study factor analysis model (De Vito et al. 2021). We enable this model decomposition through a standard Gaussian random variable.

A.2 Poisson Mixed Model

We can break the SpaceX model (A.2) and write the spatial Poisson mixed model as

\[\begin{align} \begin{split} & \text{log}({\boldsymbol \lambda}^{c}_{g}) = {{\bf X}^{c}}^{T} {\boldsymbol \beta}^{c}_{g} + {\bf s}^{c} + {\bf z}^{c}_{g}, \\ & {\boldsymbol \lambda}^{c}_{g} = ( \lambda^{c}_{1g}, \dots , \lambda^{c}_{N_{c}g} )^{T}, \\ &{\bf s}^{c} = ( s^{c}_{1}, \dots , s^{c}_{N_{c}} )^{T} \sim \text{MVN}(0, \sigma_{1}^{2}\Omega^{c}(s)), \\ &{\bf z}^{c}_{g} = (z^{1}_{1g}, \dots , z^{c}_{N_{c}g})^{T} \sim \text{MVN}(0, \sigma_{2}^{2} I_{N_{c} \times N_{c} }). \end{split} \tag{A.3} \end{align}\]

Here \(\Omega^{c}(s_{1}, s_{2}) = \text{exp} (- \mid \mid s_{1} - s_{2} \mid \mid^{2} / 2 \rho^{2}_{c} ),\) \(c = 1, \dots, C\). We estimate the length scale parameter of spatial kernel \(\rho_{c}\) based on the steps discussed in section 1 of supplementary information in S. Sun, Zhu, and Zhou (2020). Here \(Z^{c}_{g}\) captures the cluster specific latent gene expressions and a multi-variate hierarchical modeling of \(Z^{c}_{g}(s_{i})\) will help us to identify the gene co-expression network.

A.3 Multi-Study Factor Model (MSFA)

The 2nd stage of the modeling framework is multi-study factor analysis (De Vito et al. 2021) which is provided as follows \[\begin{align} \begin{split} & \hat{\bf z}^{c}_{i} = \boldsymbol \Phi {\bf f}_{i} + \boldsymbol \Psi^{c} {\bf d}^{c}_{i} + {\bf e}^{c}_{i}, \\ & {\bf f}_{i} \sim N_{K}(0, {\bf I}_{K}), \hspace{0.5cm} {\bf d}^{c}_{i} \sim N_{K_{c}}(0,{\bf I}_{K_{c}}),\\ & {\bf e}^{c}_{i} \sim N_{G}(0,\boldsymbol \Xi_{c}), \hspace{0.5cm} \boldsymbol \Xi_{c} = \text{diag}(\xi^{c}_{1}, \dots, \xi^{c}_{G} ). \end{split} \tag{A.4} \end{align}\] The marginal distribution of \(\hat{\bf z}^{c}_{i}\) is a multivariate normal distribution with mean \(0\) and covariance matrix \(\Sigma_{c}\) s.t. \[\begin{align} \Sigma_{c} = \Phi \Phi^{T} + \Psi^{c} {\Psi^{c}}^{T} + \Xi_{c} = \Sigma_{\Phi} + \Sigma_{\Psi^{c}} + \Xi_{c} \tag{A.5} \end{align}\] \(\Sigma_{\Phi} = \Phi \Phi^{T}\) and \(\Sigma_{\Psi^{c}} = \Psi^{c} {\Psi^{c}}^{T}\) are covariance of shared and cluster specific factors respectively. The decomposition of \(\Sigma_{c}\) in (A.5) is not a unique since we can set \(\Phi^{*} = \Phi Q\) and \({\Psi^{*}}^{c} = \Psi^{c} Q_{c}\) where \(Q\) and \(Q_{c}\) are square orthonormal matrices. This will also lead to the same decomposition \(\Sigma^{c} = \Phi^{*} {\Phi^{*}}^{T} + {\Psi^{*}}^{c} {{\Psi^{*}}^{c}}^{T} = \Phi \Phi^{T} + \Psi_{c} \Psi_{c}^{T}\). To overcome the indeterminacy through orthonormal matrices, the factor loading matrices are restricted to be lower triangular matrices (Geweke and Zhou 1996; Lopes and West 2004).

A.4 Multiplicative gamma shrinkage prior

We follow the same steps from De Vito et al. (2021) and place multiplicative gamma shrinkage prior (Bhattacharya and Dunson 2011) prior on the shared and cluster specific loading matrices i.e. \(\Phi\) and \(\Psi_{c}\) \(c=1,\dots,C\). The shared and cluster specific latent factors (\(K\) and \(K_{c}\) respectively) are selected following methodology described in section 3.3 of De Vito et al. (2021). The multiplicative gamma prior on elements of shared covariance matrices are provided as follows \[\begin{equation} \begin{split} & \phi_{gk} \mid \delta_{gk}, \eta_{k} \sim N(0, \delta_{gk}^{-1} \eta_{k}^{-1}), \hspace{0.5cm} g = 1, \dots , G, \hspace{0.25cm} k = 1, \dots, \infty, \\ & \delta_{gk} \sim \Gamma(\frac{\nu}{2}, \frac{\nu}{2}) \hspace{0.5cm} \eta_{k} = \prod_{j=1}^{k} \zeta_{j} \hspace{0.5cm} \zeta_{1} \sim \Gamma(a_{1},1) \hspace{0.5cm} \zeta_{j} \sim \Gamma(a_{2},1), \hspace{0.2cm} j \ge 2. \end{split} \end{equation}\] Here \(\delta_{gk}\) is the local shrinkage parameter for \(G\) column elements of \(k\)th column and \(\eta_{k}\) is the global shrinkage parameter where \(\zeta_{j}\) \((j=1,2. \dots)\) are independent. We repeat the same process to posit prior on the elements of cluster-specific loading matrices

\[\begin{equation} \begin{split} & \psi^{c}_{gk_{c}} \mid \delta^{c}_{gk_{c}}, \eta^{c}_{k_{c}} \sim N(0, {\delta^{c^{-1}}_{gk_{c}}} {\eta^{c^{-1}}_{k_{c}}}), \hspace{0.5cm} g = 1, \dots , G, \hspace{0.2cm} k_{c} = 1, \dots, \infty \hspace{0.2cm} \text{and} \hspace{0.2cm} c = 1, \dots , C,\\ & \delta^{c}_{gk_{c}} \sim \Gamma(\frac{\nu^{c}}{2}, \frac{\nu^{c}}{2}) \hspace{0.5cm} \eta^{c}_{k_{c}} = \prod_{j=1}^{k_{c}} \zeta^{c}_{j} \hspace{0.5cm} \zeta^{c}_{1} \sim \Gamma(a^{c}_{1},1) \hspace{0.5cm} \zeta^{c}_{j} \sim \Gamma(a^{c}_{2},1), \hspace{0.2cm} j \ge 2. \end{split} \end{equation}\] Here \(\delta^{c}_{gk_{c}}\) \(\eta^{c}_{k_{c}}\) are local and global parameters respectively and \(\zeta_{j}^{c}\) \((c=1,2, \dots C)\) are independent of each other. We determine \(K\) and \(K_{c}\) following methodology described in section 3.3 of De Vito et al. (2021).

A.5 Novelty in Methodology and Estimation Procedure

The SpaceX model (A.2) incorporates the spatial information in Poisson likelihood and builds on a factor model based components to estimate the gene co-expression networks. The PQLseq algorithm (S. Sun et al. 2018) is based on a Poisson likelihood but it does not incorporate the spatial information. The hierarchical factor analysis model (MSFA, De Vito et al. (2021)) model considers a Gaussian likelihood and does not take into account the spatial information. The SpaceX model incorporates the discrete nature of the single cell sequencing data and builds on a Poisson likelihood. The model accounts for the spatial effect whereas other two models do not. One can infer the gene co-expression networks while considering the spatial information but MSFA fails to incorporate the spatial locations. The detailed comparison between 3 models is summarized in Table A.2.

An MCMC based algorithm for the SpaceX model will be computationally inefficient. We develop a tractable and computationally scalable Bayesian algorithm for the estimation procedure of the novel joint modeling framework. We decompose the whole model into two essential components (I) spatial Poisson model and (II) hierarchical factor analysis model.

SpaceX model PQLseq MSFA
Spatial information \(\checkmark\) X X
Poisson likelihood \(\checkmark\) \(\checkmark\) X
Network inference \(\checkmark\) X \(\checkmark\)
Gaussian likelihood X X \(\checkmark\)
Table A.2: Methodological comparison between SpaceX, PQLseq, MSFA models.

A.5.1 Differences from individual methods (PQLseq, SPARK and MSFA)

For spatial Poisson model, we use the scalable penalized quasi-likelihood algorithm named PQLseq. The algorithm is based on the generalized linear mixed model framework which accounts for the count nature of the single cell sequencing data. The algorithm can be used for detection of deferentially expressed genes. The PQLseq algorithm does not take into account the spatial information and the SPARK (S. Sun, Zhu, and Zhou 2020) method was developed to address this limitation. Although, the SPARK method only estimates parameters while setting the spatial variance parameter to zero and develops a score test based procedure to identify spatially expressed genes. We build on the PQLseq algorithm for our estimation procedure of spatial Poisson mixed model such that the algorithm can accommodate the spatial information and estimates the parameters without setting the spatial variance parameter to zero. We use the modified version of the PQLseq algorithm to obtain the latent gene expressions and carry it forward in our next framework.

We use the multi-study factor analysis (MSFA) model (De Vito et al. 2021) on the latent gene expression matrix to estimate shared and cluster specific gene co-expressions. The MSFA model can not adapt for the count nature of the sequencing data and corresponding spatial locations. Our joint modeling framework in the SpaceX model addresses both concerns.

A.6 Modifiend version of the model for cell-type based clusters

We aimed to modify our model to explicitly accommodate the spatial correlation between cell-type based clusters. To do so, we follow the standard notations as discussed throughout the paer and Supplementary Materials. Here, \(y^{c}_{g}(s_{i})\) represents the gene expression for g-th gene for c-th cluster w.r.t. \(s_{i}\) spatial locations. The gene expression is being modeled with a Poisson distribution as where \(M\) is the normalizing constant and \(\lambda\) is the rate parameter. The spatially dependent rate parameter \(\lambda\) is modeled with a log-linear equation \[\begin{equation} \text{log}({\bf \Lambda}^{c}) = {\bf B}^{c}{\bf X}^{c} + {\bf S}^{c} + {\bf \Phi F} + {\bf \Gamma^{c}} {\bf \Psi^{c} D^{c}} + {\boldsymbol{\mathcal{E}}}^{c}. \tag{A.6} \end{equation}\] Explanation of all the parameters except for the new parameter (\(\bf \Gamma^{c}\)) remains the same. In this model, \(\bf \Gamma^{c}\) now models the cell-type specific spatial process based on spatial distance between cell types. We adjust for spatial correlations for nearby cell-types through the term \(\bf S^{C}\) and cross cell-type specific adjustment is factored in through \(\bf \Gamma^{C}\). While the extended model looks appealing, unfortunately, we realized that there are several levels of difficulties that need careful thinking before being fully incorporated in the model. For example, we are working with a Poisson likelihood which is non-Gaussian and spatial information needs to be included in the model. Identifiability issue is the main challenge for the model in (A.6). Based on factor models literature, the loading matrices are usually restricted to be a lower triangular matrix for identifiability reasons in case of a Gaussian likelihood (Geweke and Zhou 1996; Lopes and West 2004). We need non-standard identifiable restrictions to estimate \(\bf \Gamma^{c}\) and \(\bf \Psi^{c}\) since they are incorporated in the model in multiplicative manner. Given these methodological underpinnings that require more rigorous theoretical work to overcome such difficulties, we leave this non-trivial methodological estimation algorithm as future work.

References

Bhattacharya, Anirban, and David B Dunson. 2011. “Sparse Bayesian Infinite Factor Models.” Biometrika, 291–306.
De Vito, Roberta, Ruggero Bellio, Lorenzo Trippa, and Giovanni Parmigiani. 2021. “Bayesian Multistudy Factor Analysis for High-Throughput Biological Data.” The Annals of Applied Statistics 15 (4): 1723–41.
Geweke, John, and Guofu Zhou. 1996. “Measuring the Pricing Error of the Arbitrage Pricing Theory.” The Review of Financial Studies 9 (2): 557–87.
Lopes, Hedibert Freitas, and Mike West. 2004. “Bayesian Model Assessment in Factor Analysis.” Statistica Sinica, 41–67.
Sun, Shiquan, Jiaqiang Zhu, Sahar Mozaffari, Carole Ober, Mengjie Chen, and Xiang Zhou. 2018. “Heritability Estimation and Differential Analysis of Count Data with Generalized Linear Mixed Models in Genomic Sequencing Studies.” Bioinformatics 35 (3): 487–96.
Sun, Shiquan, Jiaqiang Zhu, and Xiang Zhou. 2020. “Statistical Analysis of Spatial Expression Patterns for Spatially Resolved Transcriptomic Studies.” Nature Methods 17 (2): 193–200.