3.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model
Let’s study the multivariate regression setting where there are N-dimensional vectors {\boldsymbol{y}}_m, for m = 1, 2, \dots, M, such that {\boldsymbol{y}}_m = {\boldsymbol{X}} \boldsymbol{\beta}_m + \mu_m. Here, {\boldsymbol{X}} represents the set of common regressors, and \mu_m is the N-dimensional vector of stochastic errors for each equation. We assume that {\boldsymbol{U}} = [\mu_1 \ \mu_2 \ \dots \ \mu_M] \sim MN_{N,M}({\boldsymbol{0}}, {\boldsymbol{I}}_N, {\boldsymbol{\Sigma}}), which is a matrix variate normal distribution where \boldsymbol{\Sigma} is the covariance matrix of each i-th row of {\boldsymbol{U}}, for i = 1, 2, \dots, N, and we assume independence between the rows. Consequently, we have that vec({\boldsymbol{U}}) \sim N_{N \times M}({\boldsymbol{0}}, \boldsymbol{\Sigma} \otimes {\boldsymbol{I}}_N).23
This framework can be written in matrix form \begin{align*} \underbrace{ \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1M}\\ y_{21} & y_{22} & \dots & y_{2M}\\ \vdots & \vdots & \dots & \vdots\\ y_{N1} & y_{N2} & \dots & y_{NM}\\ \end{bmatrix}}_{\boldsymbol{Y}} &= \underbrace{\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K}\\ x_{21} & x_{22} & \dots & x_{2K}\\ \vdots & \vdots & \dots & \vdots\\ x_{N1} & x_{N2} & \dots & x_{NK}\\ \end{bmatrix}}_{\boldsymbol{X}} \underbrace{ \begin{bmatrix} \beta_{11} & \beta_{12} & \dots & \beta_{1M}\\ \beta_{21} & \beta_{22} & \dots & \beta_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \beta_{K1} & \beta_{K2} & \dots & \beta_{KM}\\ \end{bmatrix}}_{\boldsymbol{B}}\\ &+ \underbrace{\begin{bmatrix} \mu_{11} & \mu_{12} & \dots & \mu_{1M}\\ \mu_{21} & \mu_{22} & \dots & \mu_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \mu_{N1} & \mu_{N2} & \dots & \mu_{NM}\\ \end{bmatrix}}_{\boldsymbol{U}}. \end{align*}
Therefore, {\boldsymbol{Y}}\sim N_{N\times M}({\boldsymbol{X}}{\boldsymbol{B}},\boldsymbol{\Sigma}\otimes {\boldsymbol{I}}_N), \begin{align*} p({\boldsymbol{Y}}\mid {\boldsymbol{B}},{\boldsymbol{\Sigma}}, {\boldsymbol{X}})&\propto |{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{Y}}-{\boldsymbol{X}}{\boldsymbol{B}})^{\top}({\boldsymbol{Y}}-{\boldsymbol{X}}{\boldsymbol{B}}){{\boldsymbol \Sigma}}^{-1}\right]\right\rbrace \\ &=|{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[\left({\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})\right){{\boldsymbol \Sigma}}^{-1}\right]\right\rbrace, \end{align*}
where {\boldsymbol{S}}= ({\boldsymbol{Y}}-{\boldsymbol{X}}\widehat{\boldsymbol{B}})^{\top}({\boldsymbol{Y}}-{\boldsymbol{X}}\widehat{\boldsymbol{B}}), \widehat{\boldsymbol{B}}= ({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^{\top}{\boldsymbol{Y}} (see Exercise 9).
The conjugate prior for this models is \pi({\boldsymbol{B}},{\boldsymbol{\Sigma}})=\pi({\boldsymbol{B}}\mid {\boldsymbol{\Sigma}})\pi({\boldsymbol{\Sigma}}) where {\boldsymbol{B}}\mid {\boldsymbol \Sigma}\sim N_{K\times M}({\boldsymbol{B}}_{0},{\boldsymbol{V}}_{0},{\boldsymbol{\Sigma}}) and {\boldsymbol{\Sigma}}\sim IW({\boldsymbol{\Psi}}_{0},\alpha_{0}), that is, \begin{align*} \pi ({\boldsymbol{B}},{\boldsymbol{\Sigma}})\propto &\left|{\boldsymbol{\Sigma}} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0}){\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ & \times \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\boldsymbol{\Psi}}_{0} {\boldsymbol \Sigma}^{-1}\right] \right\rbrace. \end{align*}
The posterior distribution is given by \begin{align*} \pi({\boldsymbol{B}},{\boldsymbol{\Sigma}}\mid {\boldsymbol{Y}},{\boldsymbol{X}})&\propto p({\boldsymbol{Y}}\mid {\boldsymbol{B}},{\boldsymbol{\Sigma}},{\boldsymbol{X}}) \pi({\boldsymbol{B}}\mid {\boldsymbol \Sigma})\pi({\boldsymbol{\Sigma}})\\ &\propto \left|{\boldsymbol{\Sigma}} \right|^{-\frac{N+K+\alpha_{0}+M+1}{2}}\\ &\times\exp\left\lbrace -\frac{1}{2}tr\left[(\boldsymbol{\Psi}_{0}+{\boldsymbol{S}} +({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})\right.\right.\\ &\left.\left. +({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}))\boldsymbol{\Sigma}^{-1}\right]\right\rbrace . \end{align*} Completing the squares on {\boldsymbol{B}} and collecting the remaining terms in the bracket yields
\begin{align*} {\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}} +({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}) & = ({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n, \end{align*}
where \begin{align*} {\boldsymbol{B}}_n = &({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}({\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+{\boldsymbol{X}}^{\top}{\boldsymbol{Y}})=({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}({\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}}\widehat{\boldsymbol{B}}),\\ {\boldsymbol{V}}_n = &({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1},\\ {\boldsymbol{\Psi}}_n= &{\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}}+{\boldsymbol{B}}_{0}^{\top}{\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+\widehat{\boldsymbol{B}}^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}\widehat{\boldsymbol{B}}-{\boldsymbol{B}}_n^{\top}{\boldsymbol{V}}_n^{-1}{\boldsymbol{B}}_n. \end{align*} Thus, the posterior distribution can be written as \begin{align*} \pi({\boldsymbol{B}},{\boldsymbol \Sigma}\mid {\boldsymbol{Y}}, {\boldsymbol{X}})\propto &\left|{\boldsymbol \Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2} tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n) {\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ \times & \left|{\boldsymbol \Sigma} \right|^{-\frac{N+\alpha_{0}+M+1}{2}}\exp\left\lbrace -\frac{1}{2} tr \left[ {\boldsymbol{\Psi}}_n{\boldsymbol \Sigma}^{-1}\right] \right\rbrace . \end{align*} That is \pi({\boldsymbol{B}},{\boldsymbol \Sigma}\mid {\boldsymbol{Y}}, {\boldsymbol{X}})=\pi ({\boldsymbol{B}}\mid {\boldsymbol \Sigma},{\boldsymbol{Y}},{\boldsymbol{X}})\pi({\boldsymbol \Sigma}\mid {\boldsymbol{Y}},{\boldsymbol{X}}) where {\boldsymbol{B}}\mid {\boldsymbol \Sigma},{\boldsymbol{Y}}, {\boldsymbol{X}} \sim N_{K\times M}({\boldsymbol{B}}_n,{\boldsymbol{V}}_n,{\boldsymbol \Sigma}) and {\boldsymbol \Sigma}\mid {\boldsymbol{Y}},{\boldsymbol{X}} \sim IW({\boldsymbol{\Psi}}_n,{\alpha}_n), \alpha_n= N+\alpha_{0}. Observe again that we can write down the posterior mean as a weighted average between prior and sample information such that {\boldsymbol{V}}_0\rightarrow\infty implies {\boldsymbol{B}}_n\rightarrow\hat{{\boldsymbol{B}}}, as we show in the univariate linear model.
The marginal posterior for {\boldsymbol{B}} is given by \begin{align*} \pi({\boldsymbol{B}}\mid {\boldsymbol{Y}},{\boldsymbol{X}})&\propto \int_{\boldsymbol{\mathcal{S}}} \left|{\boldsymbol \Sigma} \right|^{-(\alpha_n+K+M+1)/2}\\ &\times\exp\left\lbrace -\frac{1}{2} tr\left\{\left[({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n \right] {\boldsymbol \Sigma}^{-1}\right\}\right\rbrace d{\boldsymbol{\Sigma}} \\ &\propto|({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n|^{-(K+\alpha_n)/2}\\ &=\left[|{\boldsymbol{\Psi}}_n|\times|{\boldsymbol{I}}_K+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}|\right]^{-(\alpha_n+1-M+K+M-1)/2}\\ &\propto|{\boldsymbol{I}}_K+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}|^{-(\alpha_n+1-M+K+M-1)/2}. \end{align*}
The second line uses the inverse Wishart distribution, the third line the Sylverter’s theorem, and the last line is the kernel of a matrix t-distribution, that is, {\boldsymbol{B}}\mid {\boldsymbol{Y}},{\boldsymbol{X}}\sim T_{K\times M}({\boldsymbol{B}}_n,{\boldsymbol{V}}_n,{\boldsymbol{\Psi}}_n) with \alpha_n+1-M degrees of freedom.
Observe that vec({\boldsymbol{B}}) has mean vec({\boldsymbol{B}}_n) and variance ({\boldsymbol{V}}_n\otimes{\boldsymbol{\Psi}}_n)/(\alpha_n-M-1) based on its marginal distribution. On the other hand, the variance based on the conditional distribution is {\boldsymbol{V}}_n\otimes{\boldsymbol{\Sigma}}, where the mean of {\boldsymbol{\Sigma}} is {\boldsymbol{\Psi}}_n/(\alpha_n-M-1).
The marginal likelihood is the following, \begin{align*} p({\boldsymbol{Y}})&=\int_{\mathcal{B}}\int_{\mathcal{S}}\left\{ (2\pi)^{-NM/2} |{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})\right]{{\boldsymbol \Sigma}}^{-1}\right\rbrace\right.\\ &\times (2\pi)^{-KM/2}\left|{\boldsymbol V}_0 \right|^{-M/2} \left|{\boldsymbol{\Sigma}} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0}){\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ &\left. \times \frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)} \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\boldsymbol{\Psi}}_{0} {\boldsymbol \Sigma}^{-1}\right] \right\rbrace \right\} d{\boldsymbol{\Sigma}} d{\boldsymbol B}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times\int_{\mathcal{B}}\int_{\mathcal{S}} \left\{ \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+N+K+M+1)/2}\right.\\ &\left. \exp\left\lbrace -\frac{1}{2}tr\left[{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})+({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+{\boldsymbol{\Psi}}_0\right]{{\boldsymbol \Sigma}}^{-1}\right\rbrace\right\}d{\boldsymbol{\Sigma}} d{\boldsymbol B}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left|{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})+({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+{\boldsymbol{\Psi}}_0\right|^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left|({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n\right|^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left[|{\boldsymbol{\Psi}}_n|\times |{\boldsymbol{I}}_{K}+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}|\right]^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=|{\boldsymbol{\Psi}}_n|^{-(\alpha_n+K)/2}(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times \int_{\mathcal{{\boldsymbol{B}}}}\left| {\boldsymbol{I}}_{K}+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}\right|^{-(\alpha_n+1-M+K+M-1)/2}d{\boldsymbol{B}}\\ &=|{\boldsymbol{\Psi}}_n|^{-(\alpha_n+K)/2}(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times \pi^{MK/2}\frac{\Gamma_M((\alpha_n+1-M+M-1)/2)}{\Gamma_M((\alpha_n+1-M+K+M-1)/2)}|{\boldsymbol{\Psi}}_n|^{K/2}|{\boldsymbol{V}}_n|^{M/2}\\ &=\frac{|{\boldsymbol{V}}_n|^{M/2}}{|{\boldsymbol{V}}_0|^{M/2}}\frac{|{\boldsymbol{\Psi}}_0|^{\alpha_0/2}}{|{\boldsymbol{\Psi}}_n|^{\alpha_n/2}}\frac{\Gamma_M(\alpha_n/2)}{\Gamma_M(\alpha_0/2)}\pi^{-MN/2}. \end{align*}
The third equality follows from the kernel of an inverse Wishart distribution, the fifth from Sylvester’s theorem, and the seventh from the kernel of a matrix t-distribution.
Observe that this last expression is the multivariate case of the marginal likelihood of the univariate regression model. Taking into account that \begin{align*} ({\boldsymbol{A}}+{\boldsymbol{B}})^{-1}&={\boldsymbol{A}}^{-1}-({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1})^{-1}{\boldsymbol{A}}^{-1}\\ &={\boldsymbol{B}}^{-1}-({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1})^{-1}{\boldsymbol{B}}^{-1}\\ &={\boldsymbol{A}}^{-1}({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1}){\boldsymbol{B}}^{-1}, \end{align*}
we can show that {\boldsymbol{\Psi}}_{n}={\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}}+(\hat{\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{n}(\hat{\boldsymbol{B}}-{\boldsymbol{B}}_{0}) (see Exercise 7). Therefore, the marginal likelihood rewards fit (smaller sum of squares, {\boldsymbol{S}}), similarity between prior and sample information regarding location parameters, and information gains in variability from {\boldsymbol{V}}_0 to {\boldsymbol{V}}_n.
Given a matrix of regressors {\boldsymbol{X}}_0 for N_0 unobserved units, the predictive density of {\boldsymbol{Y}}_0 given {\boldsymbol{Y}}, \pi({\boldsymbol{Y}}_0\mid {\boldsymbol{Y}}) is a matrix t distribution T_{N_0,M}(\alpha_n-M+1,{\boldsymbol{X}}_0{\boldsymbol{B}}_n,{\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{V}}_n{\boldsymbol{X}}_0^{\top},{\boldsymbol{\Psi}}_n) (see Exercise 6). Observe that the prediction is centered at {\boldsymbol{X}}_0{\boldsymbol{B}}_n, and the covariance matrix of vec({\boldsymbol{Y}}_0) is \frac{({\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{V}}_n{\boldsymbol{X}}_0^{\top})\otimes{\boldsymbol{\Psi}}_n}{\alpha_n-M-1}.
vec denotes the vectorization operation, and \otimes denotes the Kronecker product.↩︎