## 4.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model

Let’s study the multivariate regression setting where there are $$M$$ $$N$$-dimensional vectors $${\bf{y}}_m$$, $$m=1,2,\dots,M$$ such that $${\bf{y}}_m={\bf{X}}\beta_m+\mu_m$$, $${\bf{X}}$$ is the set of common regressors, and $$\mu_m$$ is the $$N$$-dimensional vector of stochastic errors for each equation such that $${\bf{U}}=[\mu_1 \ \mu_2 \ \dots \ \mu_M]\sim MN_{N,M}({\bf{0}}, {\bf{I}}_N, {\bf\Sigma})$$, that is, a matrix variate normal distribution where $$\bf\Sigma$$ is the covariance matrix of each $$i$$-th row of $${\bf{U}}$$, $$i=1,2,\dots,N$$, and we are assuming independece between the rows. Then, $$vec({\bf U})\sim N_{N\times M}({\bf 0}, \bf{\Sigma}\otimes {\bf{I}}_N)$$.4

This framework can be written in matricial form

\begin{align} \underbrace{ \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1M}\\ y_{21} & y_{22} & \dots & y_{2M}\\ \vdots & \vdots & \dots & \vdots\\ y_{N1} & y_{N2} & \dots & y_{NM}\\ \end{bmatrix}}_{\bf{Y}} &= \underbrace{\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K}\\ x_{21} & x_{22} & \dots & x_{2K}\\ \vdots & \vdots & \dots & \vdots\\ x_{N1} & x_{N2} & \dots & x_{NK}\\ \end{bmatrix}}_{\bf{X}} \underbrace{ \begin{bmatrix} \beta_{11} & \beta_{12} & \dots & \beta_{1M}\\ \beta_{21} & \beta_{22} & \dots & \beta_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \beta_{K1} & \beta_{K2} & \dots & \beta_{KM}\\ \end{bmatrix}}_{\bf{B}}\\ &+ \underbrace{\begin{bmatrix} \mu_{11} & \mu_{12} & \dots & \mu_{1M}\\ \mu_{21} & \mu_{22} & \dots & \mu_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \mu_{N1} & \mu_{N2} & \dots & \mu_{NM}\\ \end{bmatrix}}_{\bf{U}} \end{align}

Therefore, $${\bf{Y}}\sim N_{N\times M}({\bf{X}}{\bf{B}},\bf{\Sigma}\otimes {\bf{I}}_N)$$,5

\begin{align} p({\bf{Y}}| {\bf{B}},{\bf\Sigma})&\propto |{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{Y}}-{\bf{X}}{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}{\bf{B}}){{\bf \Sigma}}^{-1}\right]\right\rbrace \\ &=|{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[{\bf{S}}+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}})\right]{{\bf \Sigma}}^{-1}\right\rbrace, \end{align}

where $${\bf{S}}= ({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})$$, $$\widehat{\bf{B}}= ({\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{Y}}$$ (see Exercise 7).

The conjugate prior for this models is $$\pi({\bf{B}},{\bf{\Sigma}})=\pi({\bf{B}}|{\bf{\Sigma}})\pi({\bf{\Sigma}})$$ where $$\pi({\bf{B}}|{\bf \Sigma})\sim N_{K\times M}({\bf{B}}_{0},{\bf\Sigma} \otimes {\bf{V}}_{0})$$ and $$\pi({\bf\Sigma})\sim IW({\bf{\Psi}}_{0},\alpha_{0})$$, that is,

\begin{align} \pi ({\bf{B}},{\bf\Sigma})\propto &\left|{\bf\Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right] {\bf \Sigma}^{-1}\right\rbrace \\ & \times \left|{\bf \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\bf{\Psi}}_{0} {\bf \Sigma}^{-1}\right] \right\rbrace. \end{align}

The posterior distribution is given by

\begin{align} \pi({\bf{B}},{\bf\Sigma}|{\bf{Y}},{\bf{X}})&\propto p({\bf{Y}}|{\bf{B}},{\bf\Sigma},{\bf{X}}) \pi({\bf{B}}| {\bf \Sigma})\pi({\bf\Sigma})\\ &\propto \left|{\bf\Sigma} \right|^{-\frac{N+K+\alpha_{0}+M+1}{2}}\\ &\times\exp\left\lbrace -\frac{1}{2}tr\left[(\bf{\Psi}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right.\right.\\ &\left.\left. +({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}))\bf{\Sigma}^{-1}\right]\right\rbrace . \end{align} Completing the squares on $${\bf{B}}$$ and collecting the remaining terms in the bracket yields \begin{align} {\bf{\Psi}}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}) & = ({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n)+{\bf{\Psi}}_n, \end{align} where \begin{align} {\bf{B}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{Y}})=({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}),\\ {\bf{V}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1},\\ {\bf{\Psi}}_n= &{\bf{\Psi}}_{0}+{\bf{S}}+{\bf{B}}_{0}^{\top}{\bf{V}}_{0}^{-1}{\bf{B}}_{0}+\widehat{\bf{B}}^{\top}{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}-{\bf{B}}_n^{\top}{\bf{V}}_n^{-1}{\bf{B}}_n. \end{align} Thus, the posterior distribution can be written as \begin{align} \pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})\propto &\left|{\bf \Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2} tr\left[({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n) \right] {\bf \Sigma}^{-1}\right\rbrace \\ \times & \left|{\bf \Sigma} \right|^{-\frac{N+\alpha_{0}+M+1}{2}}\exp\left\lbrace -\frac{1}{2} tr \left[ {\bf{\Psi}}_n{\bf \Sigma}^{-1}\right] \right\rbrace . \end{align} That is $$\pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})=\pi ({\bf{B}}| {\bf \Sigma},{\bf{Y}},{\bf{X}})\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}})$$ where $$\pi({\bf{B}}| {\bf \Sigma},{\bf{Y}}, {\bf{X}}) \sim N_{K\times M}({\bf{B}}_n,{\bf \Sigma}\otimes {\bf{V}}_n )$$ and $$\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}}) \sim IW({\bf{\Psi}}_n,{\alpha}_n)$$, where $$\alpha_n= N+\alpha_{0}$$.

The marginal posterior for $${\bf{B}}$$ is …

The marginal likelihood is …

The predictive density is …

1. $$vec$$ denotes the vectorization operation, and $$\otimes$$ denotes the kronecker product↩︎

2. We can write down the former expression in a more familiar way using vectorization properties, $$\underbrace{vec(Y)}_{\bf{y}}=\underbrace{({\bf{I}}_M\otimes {\bf{X}})}_{{\bf{Z}}}\underbrace{vec({\bf{B}})}_{\beta}+\underbrace{vec({\bf{U}})}_{\mu}$$, where $${\bf{y}}\sim N_{N\times M}({\bf{Z}}\beta,\bf{\Sigma}\otimes {\bf{I}}_N)$$.↩︎