4.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model

Let’s study the multivariate regression setting where there are \(M\) \(N\)-dimensional vectors \({\bf{y}}_m\), \(m=1,2,\dots,M\) such that \({\bf{y}}_m={\bf{X}}\beta_m+\mu_m\), \({\bf{X}}\) is the set of common regressors, and \(\mu_m\) is the \(N\)-dimensional vector of stochastic errors for each equation such that \({\bf{U}}=[\mu_1 \ \mu_2 \ \dots \ \mu_M]\sim MN_{N,M}({\bf{0}}, {\bf{I}}_N, {\bf\Sigma})\), that is, a matrix variate normal distribution where \(\bf\Sigma\) is the covariance matrix of each \(i\)-th row of \({\bf{U}}\), \(i=1,2,\dots,N\), and we are assuming independece between the rows. Then, \(vec({\bf U})\sim N_{N\times M}({\bf 0}, \bf{\Sigma}\otimes {\bf{I}}_N)\).4

This framework can be written in matricial form

\[\begin{align} \underbrace{ \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1M}\\ y_{21} & y_{22} & \dots & y_{2M}\\ \vdots & \vdots & \dots & \vdots\\ y_{N1} & y_{N2} & \dots & y_{NM}\\ \end{bmatrix}}_{\bf{Y}} &= \underbrace{\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K}\\ x_{21} & x_{22} & \dots & x_{2K}\\ \vdots & \vdots & \dots & \vdots\\ x_{N1} & x_{N2} & \dots & x_{NK}\\ \end{bmatrix}}_{\bf{X}} \underbrace{ \begin{bmatrix} \beta_{11} & \beta_{12} & \dots & \beta_{1M}\\ \beta_{21} & \beta_{22} & \dots & \beta_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \beta_{K1} & \beta_{K2} & \dots & \beta_{KM}\\ \end{bmatrix}}_{\bf{B}}\\ &+ \underbrace{\begin{bmatrix} \mu_{11} & \mu_{12} & \dots & \mu_{1M}\\ \mu_{21} & \mu_{22} & \dots & \mu_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \mu_{N1} & \mu_{N2} & \dots & \mu_{NM}\\ \end{bmatrix}}_{\bf{U}} \end{align}\]

Therefore, \({\bf{Y}}\sim N_{N\times M}({\bf{X}}{\bf{B}},\bf{\Sigma}\otimes {\bf{I}}_N)\),5

\[\begin{align} p({\bf{Y}}| {\bf{B}},{\bf\Sigma})&\propto |{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{Y}}-{\bf{X}}{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}{\bf{B}}){{\bf \Sigma}}^{-1}\right]\right\rbrace \\ &=|{{\bf \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[{\bf{S}}+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}})\right]{{\bf \Sigma}}^{-1}\right\rbrace, \end{align}\]

where \({\bf{S}}= ({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})^{\top}({\bf{Y}}-{\bf{X}}\widehat{\bf{B}})\), \(\widehat{\bf{B}}= ({\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{Y}}\) (see Exercise 7).

The conjugate prior for this models is \(\pi({\bf{B}},{\bf{\Sigma}})=\pi({\bf{B}}|{\bf{\Sigma}})\pi({\bf{\Sigma}})\) where \(\pi({\bf{B}}|{\bf \Sigma})\sim N_{K\times M}({\bf{B}}_{0},{\bf\Sigma} \otimes {\bf{V}}_{0})\) and \(\pi({\bf\Sigma})\sim IW({\bf{\Psi}}_{0},\alpha_{0})\), that is,

\[\begin{align} \pi ({\bf{B}},{\bf\Sigma})\propto &\left|{\bf\Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right] {\bf \Sigma}^{-1}\right\rbrace \\ & \times \left|{\bf \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\bf{\Psi}}_{0} {\bf \Sigma}^{-1}\right] \right\rbrace. \end{align}\]

The posterior distribution is given by

\[\begin{align} \pi({\bf{B}},{\bf\Sigma}|{\bf{Y}},{\bf{X}})&\propto p({\bf{Y}}|{\bf{B}},{\bf\Sigma},{\bf{X}}) \pi({\bf{B}}| {\bf \Sigma})\pi({\bf\Sigma})\\ &\propto \left|{\bf\Sigma} \right|^{-\frac{N+K+\alpha_{0}+M+1}{2}}\\ &\times\exp\left\lbrace -\frac{1}{2}tr\left[(\bf{\Psi}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})\right.\right.\\ &\left.\left. +({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}))\bf{\Sigma}^{-1}\right]\right\rbrace . \end{align}\] Completing the squares on \({\bf{B}}\) and collecting the remaining terms in the bracket yields \[\begin{align} {\bf{\Psi}}_{0}+{\bf{S}} +({\bf{B}}-{\bf{B}}_{0})^{\top}{\bf{V}}_{0}^{-1}({\bf{B}}-{\bf{B}}_{0})+({\bf{B}}-\widehat{\bf{B}})^{\top}{\bf{X}}^{\top}{\bf{X}}({\bf{B}}-\widehat{\bf{B}}) & = ({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n)+{\bf{\Psi}}_n, \end{align}\] where \[\begin{align} {\bf{B}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{Y}})=({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{V}}_{0}^{-1}{\bf{B}}_{0}+{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}),\\ {\bf{V}}_n = &({\bf{V}}_{0}^{-1}+{\bf{X}}^{\top}{\bf{X}})^{-1},\\ {\bf{\Psi}}_n= &{\bf{\Psi}}_{0}+{\bf{S}}+{\bf{B}}_{0}^{\top}{\bf{V}}_{0}^{-1}{\bf{B}}_{0}+\widehat{\bf{B}}^{\top}{\bf{X}}^{\top}{\bf{X}}\widehat{\bf{B}}-{\bf{B}}_n^{\top}{\bf{V}}_n^{-1}{\bf{B}}_n. \end{align}\] Thus, the posterior distribution can be written as \[\begin{align} \pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})\propto &\left|{\bf \Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2} tr\left[({\bf{B}}-{\bf{B}}_n)^{\top}{\bf{V}}_n^{-1}({\bf{B}}-{\bf{B}}_n) \right] {\bf \Sigma}^{-1}\right\rbrace \\ \times & \left|{\bf \Sigma} \right|^{-\frac{N+\alpha_{0}+M+1}{2}}\exp\left\lbrace -\frac{1}{2} tr \left[ {\bf{\Psi}}_n{\bf \Sigma}^{-1}\right] \right\rbrace . \end{align}\] That is \(\pi({\bf{B}},{\bf \Sigma}| {\bf{Y}}, {\bf{X}})=\pi ({\bf{B}}| {\bf \Sigma},{\bf{Y}},{\bf{X}})\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}})\) where \(\pi({\bf{B}}| {\bf \Sigma},{\bf{Y}}, {\bf{X}}) \sim N_{K\times M}({\bf{B}}_n,{\bf \Sigma}\otimes {\bf{V}}_n )\) and \(\pi({\bf \Sigma}| {\bf{Y}},{\bf{X}}) \sim IW({\bf{\Psi}}_n,{\alpha}_n)\), where \(\alpha_n= N+\alpha_{0}\).

The marginal posterior for \({\bf{B}}\) is …

The marginal likelihood is …

The predictive density is …


  1. \(vec\) denotes the vectorization operation, and \(\otimes\) denotes the kronecker product↩︎

  2. We can write down the former expression in a more familiar way using vectorization properties, \(\underbrace{vec(Y)}_{\bf{y}}=\underbrace{({\bf{I}}_M\otimes {\bf{X}})}_{{\bf{Z}}}\underbrace{vec({\bf{B}})}_{\beta}+\underbrace{vec({\bf{U}})}_{\mu}\), where \({\bf{y}}\sim N_{N\times M}({\bf{Z}}\beta,\bf{\Sigma}\otimes {\bf{I}}_N)\).↩︎