3.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model

Let’s study the multivariate regression setting where there are $M$ $N$ -dimensional vectors ${\bf{y}}_m$ , $m=1,2,\dots,M$ such that ${\bf{y}}_m={\bf{X}}\beta_m+\mu_m$ , ${\bf{X}}$ is the set of common regressors, and $\mu_m$ is the $N$ -dimensional vector of stochastic errors for each equation such that ${\bf{U}}=[\mu_1 \ \mu_2 \ \dots \ \mu_M]\sim MN_{N,M}({\bf{0}}, {\bf{I}}_N, {\bf\Sigma})$ , that is, a matrix variate normal distribution where $\bf\Sigma$ is the covariance matrix of each $i$ -th row of ${\bf{U}}$ , $i=1,2,\dots,N$ , and we are assuming independece between the rows. Then, $vec({\bf U})\sim N_{N\times M}({\bf 0}, \bf{\Sigma}\otimes {\bf{I}}_N)$ .²¹

This framework can be written in matricial form

$\begin{align} \underbrace{ \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1M}\\ y_{21} & y_{22} & \dots & y_{2M}\\ \vdots & \vdots & \dots & \vdots\\ y_{N1} & y_{N2} & \dots & y_{NM}\\ \end{bmatrix}}_{\bf{Y}} &= \underbrace{\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K}\\ x_{21} & x_{22} & \dots & x_{2K}\\ \vdots & \vdots & \dots & \vdots\\ x_{N1} & x_{N2} & \dots & x_{NK}\\ \end{bmatrix}}_{\bf{X}} \underbrace{ \begin{bmatrix} \beta_{11} & \beta_{12} & \dots & \beta_{1M}\\ \beta_{21} & \beta_{22} & \dots & \beta_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \beta_{K1} & \beta_{K2} & \dots & \beta_{KM}\\ \end{bmatrix}}_{\bf{B}}\\ &+ \underbrace{\begin{bmatrix} \mu_{11} & \mu_{12} & \dots & \mu_{1M}\\ \mu_{21} & \mu_{22} & \dots & \mu_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \mu_{N1} & \mu_{N2} & \dots & \mu_{NM}\\ \end{bmatrix}}_{\bf{U}} \end{align}$

Therefore, ${\bf{Y}}\sim N_{N\times M}({\bf{X}}{\bf{B}},\bf{\Sigma}\otimes {\bf{I}}_N)$ ,²²

$\begin{align} p({\bf{Y}}| {\bf{B}},{\bf\Sigma})&\propto |{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{Y}}-{\bf{X}}{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}{\bf{B}}){{\bf \Sigma}}^{-1}\right]\right\rbrace \\ &=|{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[{\bf{S}}+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}})\right]{{\bf \Sigma}}^{-1}\right\rbrace, \end{align}$

where ${\bf{S}}= ({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})$ , $\widehat{\bf{B}}= ({\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{Y}}$ (see Exercise 7).

The conjugate prior for this models is $\pi({\bf{B}},{\bf{\Sigma}})=\pi({\bf{B}}|{\bf{\Sigma}})\pi({\bf{\Sigma}})$ where $\pi({\bf{B}}|{\bf \Sigma})\sim N_{K\times M}({\bf{B}}_{0},{\bf\Sigma} \otimes {\bf{V}}_{0})$ and $\pi({\bf\Sigma})\sim IW({\bf{\Psi}}_{0},\alpha_{0})$ , that is,

$\begin{align} \pi ({\bf{B}},{\bf\Sigma})\propto &\left|{\bf\Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right] {\bf \Sigma}^{-1}\right\rbrace \\ & \times \left|{\bf \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\bf{\Psi}}_{0} {\bf \Sigma}^{-1}\right] \right\rbrace. \end{align}$

The posterior distribution is given by

$\begin{align} \pi({\bf{B}},{\bf\Sigma}|{\bf{Y}},{\bf{X}})&\propto p({\bf{Y}}|{\bf{B}},{\bf\Sigma},{\bf{X}}) \pi({\bf{B}}| {\bf \Sigma})\pi({\bf\Sigma})\\ &\propto \left|{\bf\Sigma} \right|^{-\frac{N+K+\alpha_{0}+M+1}{2}}\\ &\times\exp\left\lbrace -\frac{1}{2}tr\left[(\bf{\Psi}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right.\right.\\ &\left.\left. +({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}))\bf{\Sigma}^{-1}\right]\right\rbrace . \end{align}$ Completing the squares on ${\bf{B}}$ and collecting the remaining terms in the bracket yields $\begin{align} {\bf{\Psi}}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}) & = ({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n)+{\bf{\Psi}}_n, \end{align}$ where $\begin{align} {\bf{B}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{Y}})=({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}),\\ {\bf{V}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1},\\ {\bf{\Psi}}_n= &{\bf{\Psi}}_{0}+{\bf{S}}+{\bf{B}}_{0}^{\top}{\bf{V}}_{0}^{-1}{\bf{B}}_{0}+\widehat{\bf{B}}^{\top}{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}-{\bf{B}}_n^{\top}{\bf{V}}_n^{-1}{\bf{B}}_n. \end{align}$ Thus, the posterior distribution can be written as $\begin{align} \pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})\propto &\left|{\bf \Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2} tr\left[({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n) \right] {\bf \Sigma}^{-1}\right\rbrace \\ \times & \left|{\bf \Sigma} \right|^{-\frac{N+\alpha_{0}+M+1}{2}}\exp\left\lbrace -\frac{1}{2} tr \left[ {\bf{\Psi}}_n{\bf \Sigma}^{-1}\right] \right\rbrace . \end{align}$ That is $\pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})=\pi ({\bf{B}}| {\bf \Sigma},{\bf{Y}},{\bf{X}})\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}})$ where $\pi({\bf{B}}| {\bf \Sigma},{\bf{Y}}, {\bf{X}}) \sim N_{K\times M}({\bf{B}}_n,{\bf \Sigma}\otimes {\bf{V}}_n )$ and $\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}}) \sim IW({\bf{\Psi}}_n,{\alpha}_n)$ , where $\alpha_n= N+\alpha_{0}$ .

The marginal posterior for ${\bf{B}}$ is …

The marginal likelihood is …

The predictive density is …

$vec$ denotes the vectorization operation, and $\otimes$ denotes the kronecker product↩︎
We can write down the former expression in a more familiar way using vectorization properties, $\underbrace{vec(Y)}_{\bf{y}}=\underbrace{({\bf{I}}_M\otimes {\bf{X}})}_{{\bf{Z}}}\underbrace{vec({\bf{B}})}_{\beta}+\underbrace{vec({\bf{U}})}_{\mu}$ , where ${\bf{y}}\sim N_{N\times M}({\bf{Z}}\beta,\bf{\Sigma}\otimes {\bf{I}}_N)$ .↩︎