4.3 Linear regression: The conjugate normal-normal/inverse gamma model

In this setting we analyze the conjugate normal-normal/inverse gamma model which is the workhorse in econometrics. In this model, the dependent variable \(y_i\) is related to a set of regressors \({\mathbf{x}}_i=(x_{i1},x_{i2},\ldots,x_{iK})^{\top}\) in a linear way, that is, \(y_i=\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_Kx_{iK}+\mu_i={\bf{x}}_i^{\top}\beta+\mu_i\) where \(\mathbf{\beta}=(\beta_1,\beta_2,\ldots,\beta_K)^{\top}\) and \(\mu_i\stackrel{iid} {\thicksim}N(0,\sigma^2)\) is an stochastic error that is independent of the regressors, \({\bf{x}}_i\perp\mu_i\).

Defining \(\mathbf{y}=\begin{bmatrix} y_1\\ y_2\\ \vdots \\ y_N \end{bmatrix}\), \(\mathbf{X}=\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1K}\\ x_{21} & x_{22} & \ldots & x_{2K}\\ \vdots & \vdots & \vdots & \vdots\\ x_{N1} & x_{N2} & \ldots & x_{NK}\\ \end{bmatrix}\) and \(\mathbf{\mu}=\begin{bmatrix} \mu_1\\ \mu_2\\ \vdots \\ \mu_N \end{bmatrix}\), we can write the model in matrix form: \({\bf{y}}={\bf{X}}\beta+\mu\), where \(\mu\sim N(\bf{0},\sigma^2{\bf{I}})\) which implies that \({\bf{y}}\sim N({\bf{X}}\beta,\sigma^2\bf{I})\). Then, the likelihood function is

\[\begin{align} p({\bf{y}}|\beta, \sigma^2, {{\bf{X}}}) & = (2\pi\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\bf{y}} - {\bf{X}}\beta)^{\top}({\bf{y}} - {\bf{X}}\beta) \right\} \\ & \propto (\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\bf{y}} - {\bf{X}}\beta)^{\top}({\bf{y}} - {\bf{X}}\beta) \right\}. \end{align}\]

The conjugate priors for the parameters are \[\begin{align} \beta|\sigma^2 & \sim N(\beta_0, \sigma^2 {\bf{B}}_0),\\ \sigma^2 & \sim IG(\alpha_0/2, \delta_0/2). \end{align}\]

Then, the posterior distribution is

\[\begin{align} \pi(\beta,\sigma^2|\mathbf{y},\mathbf{X})&\propto (\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\bf{y}} - {\bf{X}}\beta)^{\top}({\bf{y}} - {\bf{X}}\beta) \right\} \\ & \times (\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} (\beta - \beta_0)^{\top}{\bf{B}}_0^{-1}(\beta - \beta_0)\right\} \\ & \times \frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp \left\{-\frac{\delta_0}{2\sigma^2} \right\} \\ & \propto (\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} [\beta^{\top}({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})\beta - 2\beta^{\top}({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{X}}\hat{\beta})] \right\} \\ & \times \left(\frac{1}{\sigma^2}\right)^{(\alpha_0+N)/2+1}\exp \left\{-\frac{\delta_0+ {\bf{y}}^{\top}{\bf{y}} + \beta_0^{\top}{\bf{B}}_0^{-1}\beta_0}{2\sigma^2} \right\}, \end{align}\]

where \(\hat{\beta}=({\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{y}}\) is the maximum likelihood estimator.

Adding and subtracting \(\beta_n^{\top}{{\bf{B}}}_n^{-1} \beta_n\) to complete the square, where \({{\bf{B}}}_n = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}\) and \(\beta_n = {{\bf{B}}}_n({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{X}}\hat{\beta})\),

\[\begin{align} \pi(\beta,\sigma^2|\mathbf{y},\mathbf{X})&\propto \underbrace{(\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} (\beta-\beta_n)^{\top}{\bf{B}}^{-1}_n(\beta-\beta_n) \right\}}_1 \\ & \times \underbrace{(\sigma^2)^{-\left(\frac{\alpha_n}{2}+1 \right)} \exp \left\{-\frac{\delta_n}{2\sigma^2} \right\}}_2. \end{align}\]

The first expression is the kernel of a normal density function, \(\beta|\sigma^2, {\bf{y}}, {\bf{X}} \sim N(\beta_n, \sigma^2{\bf{B}}_n)\). The second expression is the kernel of a inverse gamma density, \(\sigma^2| {\bf{y}}, {\bf{X}}\sim IG(\alpha_n/2, \delta_n/2)\), where \(\alpha_n = \alpha_0 + N\) and \(\delta_n = \delta_0 + {\bf{y}}^{\top}{\bf{y}} + \beta_0^{\top}{\bf{B}}_0^{-1}\beta_0 - \beta_n^{\top}{\bf{B}}_n^{-1}\beta_n\).

Taking into account that \[\begin{align}\beta_n & = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{X}}\hat{\beta})\\ & = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{B}}_0^{-1}\beta_0 + ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1} {\bf{X}}^{\top}{\bf{X}}\hat{\beta}, \end{align}\]

where \(({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{B}}_0^{-1}=\bf{I}_K-({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{X}}\) (Smith 1973). Setting \({\bf{W}}=({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}{\bf{X}}^{\top}{\bf{X}}\) we have \(\beta_n=(\bf{I}_K-{\bf{W}})\beta_0+{\bf{W}}\hat{\beta}\), that is, the posterior mean of \(\beta\) is a weighted average between the sample and prior information, where the weights depend on the precision of each piece of information. Observe that when the prior covariance matrix is highly vague (non–informative), such that \({\bf{B}}_0^{-1}\rightarrow \bf{0}_K\), we obtain \({\bf{W}} \rightarrow I_K\), such that \(\beta_n \rightarrow \hat{\beta}\), that is, the posterior mean location parameter converges to the maximum likelihood estimator.

In addition, we know that the posterior conditional covariance matrix of the location parameters \(\sigma^2({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}=\sigma^2({\bf{X}}^{\top}{\bf{X}})^{-1}-\sigma^2\left(({\bf{X}}^{\top}{\bf{X}})^{-1}({\bf{B}}_0 + ({\bf{X}}^{\top}{\bf{X}})^{-1})^{-1}({\bf{X}}^{\top}{\bf{X}})^{-1}\right)\) is positive semi-definite.2 Given that \(\sigma^2({\bf{X}}^{\top}{\bf{X}})^{-1}\) is the covariance matrix of the maximum likelihood estimator, we observe that prior information reduces estimation uncertainty.

Now, we calculate the posterior marginal distribution of \(\beta\),

\[\begin{align} \pi(\beta|{\bf{y}},{\bf{X}}) & = \int_0^{\infty} \pi(\beta, \sigma^2|{\bf{y}},{\bf{X}}) d\sigma^2 \\ & = \int_0^{\infty} \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+K}{2} + 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2, \end{align}\] where \(s = \delta_n + (\beta - \beta_n)^{\top}{{\bf{B}}}_n^{-1}(\beta - \beta_n)\). Then we can write \[\begin{align} \pi(\beta|{\bf{y}},{\bf{X}}) & = \int_0^{\infty} \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+K}{2} + 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2 \\ & = \frac{\Gamma((\alpha_n+K)/2)}{(s/2)^{(\alpha_n+K)/2}} \int_0^{\infty} \frac{(s/2)^{(\alpha_n+K)/2}}{\Gamma((\alpha_n+K)/2)} (\sigma^2)^{-(\alpha_n+K)/2 - 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2. \end{align}\]

The right term is the integral of the probability density function of an inverse gamma distribution with parameters \(\nu = (\alpha_n+K)/2\) and \(\tau = s/2\). Since we are integrating over the whole support of \(\sigma^2\), the integral is equal to 1, and therefore \[\begin{align*} \pi(\beta|{\bf{y}},{\bf{X}}) & = \frac{\Gamma((\alpha_n+K)/2)}{(s/2)^{(\alpha_n+K)/2}} \\ & \propto s^{-(\alpha_n+K)/2} \\ & = [\delta_n + (\beta - \beta_n)^{\top}{{\bf{B}}}_n^{-1}(\beta - \beta_n)]^{-(\alpha_n+K)/2} \\ & = \left[1 + \frac{(\beta - \beta_n)^{\top}\left(\frac{\delta_n}{\alpha_n}{{\bf{B}}}_n\right)^{-1}(\beta - \beta_n)}{\alpha_n}\right]^{-(\alpha_n+K)/2}(\delta_n)^{-(\alpha_N+K)/2} \\ & \propto \left[1 + \frac{(\beta - \beta_n)^{\top}{\bf{H}}_n^{-1}(\beta - \beta_n)}{\alpha_n}\right]^{-(\alpha_n+K)/2}, \end{align*}\] where \({\bf{H}}_n = \frac{\delta_n}{\alpha_n}{\bf{B}}_n\). This last expression is a multivariate Student’s \(t\) distribution for \(\beta\), \(\beta|{\bf{y}},{\bf{X}} \sim t_K(\alpha_n, \beta_n, {\bf{H}}_n)\).

Observe that as we have incorporated the uncertainty of the variance, the posterior for \(\beta\) changes from a normal to a Students’ t distribution, which has heavier tails.

The marginal likelihood of this model is

\[\begin{align} p({\bf{y}})=\int_0^{\infty}\int_{R^K}\pi (\beta | \sigma^2,{\bf{B}}_0,\beta_0 )\pi(\sigma^2| \alpha_0/2, \delta_0/2)p({\bf{y}}|\beta, \sigma^2, {\bf{X}})d\sigma^2 d\beta. \end{align}\]

Taking into account that \(({\bf{y}}-{\bf{X}}\beta)^{\top}({\bf{y}}-{\bf{X}}\beta)+(\beta-\beta_0)^{\top}{\bf{B}}_0^{-1}(\beta-\beta_0)=(\beta-\beta_n)^{\top}{\bf{B}}_n^{-1}(\beta-\beta_n)+m\), where \(m={\bf{y}}^{\top}{\bf{y}}+\beta_0^{\top}{\bf{B}}_0^{-1}\beta_0-\beta_n^{\top}{\bf{B}}_n^{-1}\beta_n\), we have that

\[\begin{align} p({\bf{y}})&=\int_0^{\infty}\int_{R^K}\pi (\beta | \sigma^2)\pi(\sigma^2)p({\bf{y}}|\beta, \sigma^2, {\bf{X}})d\sigma^2 d\beta\\ &=\int_0^{\infty}\pi(\sigma^2) \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{1}{(2\pi\sigma^2)^{K/2}|{\bf{B}}_0|^{1/2}}\\ &\times\int_{R^K}\exp\left\{-\frac{1}{2\sigma^2}(\beta-\beta_n)^{\top}{\bf{B}}_n^{-1}(\beta-\beta_n)\right\}d\sigma^2 d\beta\\ &=\int_0^{\infty}\pi(\sigma^2) \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{|{\bf{B}}_n|^{1/2}}{|{\bf{B}}_0|^{1/2}}d\sigma^2\\ &=\int_{0}^{\infty} \frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp\left\{\left(-\frac{\delta_0}{2\sigma^2}\right)\right\} \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{|{\bf{B}}_n|^{1/2}}{|{\bf{B}}_0|^{1/2}} d\sigma^2\\ &= \frac{1}{(2\pi)^{N/2}}\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\frac{|{\bf{B}}_n|^{1/2}}{|{\bf{B}}_0|^{1/2}}\int_{0}^{\infty}\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1}\exp\left\{\left(-\frac{\delta_0+m}{2\sigma^2}\right)\right\}d\sigma^2\\ &= \frac{1}{\pi^{N/2}}\frac{\delta_0^{\alpha_0/2}}{\delta_n^{\alpha_n/2}}\frac{|{\bf{B}}_n|^{1/2}}{|{\bf{B}}_0|^{1/2}}\frac{\Gamma(\alpha_n/2)}{\Gamma(\alpha_0/2)}. \end{align}\]

The posterior predictive is equal to

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&=\int_{0}^{\infty}\int_{R^K}p({\bf{Y}}_0|\beta,\sigma^2,{\bf{y}})\pi(\beta|\sigma^2,{\bf{y}})\pi(\sigma^2|{\bf{y}})d\beta d\sigma^2\\ &=\int_{0}^{\infty}\int_{R^K}p({\bf{Y}}_0|\beta,\sigma^2)\pi(\beta|\sigma^2,{\bf{y}})\pi(\sigma^2|{\bf{y}})d\beta d\sigma^2, \end{align}\]

where we take into account independence between \({\bf{Y}}_0\) and \({\bf{Y}}\). Given \({\bf{X}}_0\), which is the \(N_0\times K\) matrix of regressors associated with \({\bf{Y}}_0\), Then,

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&=\int_{0}^{\infty}\int_{R^K}\left\{ (2\pi\sigma^2)^{-\frac{N_0}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\bf{Y}}_0 - {\bf{X}}_0\beta)^{\top}({\bf{Y}}_0 - {\bf{X}}_0\beta)^{\top} \right\}\right. \\ & \times (2\pi\sigma^2)^{-\frac{K}{2}} |{\bf{B}}_n|^{-1/2} \exp \left\{-\frac{1}{2\sigma^2} (\beta - \beta_n)^{\top}{\bf{B}}_n^{-1}(\beta - \beta_n)\right\} \\ & \left. \times \frac{(\delta_n/2)^{\alpha_n/2}}{\Gamma(\alpha_n/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_n/2+1}\exp \left\{-\frac{\delta_n}{2\sigma^2} \right\}\right\}d\beta d\sigma^2. \\ \end{align}\]

Setting \({\bf{M}}=({\bf{X}}_0^{\top}{\bf{X}}_0+{\bf{B}}_n^{-1})\) and \(\beta_*={\bf{M}}^{-1}({\bf{B}}_n^{-1}\beta_n+{\bf{X}}_0^{\top}{\bf{Y}}_0)\), we have

\[\begin{align} ({\bf{Y}}_0 - {\bf{X}}_0\beta)^{\top}({\bf{Y}}_0 - {\bf{X}}_0\beta)^{\top}+(\beta - \beta_n)^{\top}{\bf{B}}_n^{-1}(\beta - \beta_n)&=(\beta - \beta_*)^{\top}{\bf{M}}(\beta - \beta_*)\\ &+\beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-\beta_*^{\top}{\bf{M}}\beta_*, \end{align}\]

Thus,

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&\propto\int_{0}^{\infty}\left\{\left(\frac{1}{\sigma^2}\right)^{-\frac{K+N_0+\alpha_n}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-\beta_*^{\top}{\bf{M}}\beta_*+\delta_n)\right\}\right.\\ &\times\left.\int_{R^K}\exp\left\{-\frac{1}{2\sigma^2}(\beta - \beta_*)^{\top}{\bf{M}}(\beta - \beta_*)\right\}d\beta\right\} d\sigma^2,\\ \end{align}\] where the term in the second integral is the kernel of a multivariate normal density with mean \(\beta_*\) and covariance matrix \(\sigma^2{\bf{M}}^{-1}\). Then,

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&\propto\int_{0}^{\infty}\left(\frac{1}{\sigma^2}\right)^{\frac{N_0+\alpha_n}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-\beta_*^{\top}{\bf{M}}\beta_*+\delta_n)\right\}d\sigma^2,\\ \end{align}\]

which is the kernel of an inverse gamma density. Thus,

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&\propto \left[\frac{\beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-\beta_*^{\top}{\bf{M}}\beta_*+\delta_n}{2}\right]^{-\frac{\alpha_n+N_0}{2}}. \end{align}\]

Setting \({\bf{C}}^{-1}={\bf{I}}_{N_0}+{\bf{X}}_0{\bf{B}}_n{\bf{X}}_0^{\top}\) such that \({\bf{C}}={\bf{I}}_{N_0}-{\bf{X}}_0({\bf{B}}_n^{-1}+{\bf{X}}_0^{\top}{\bf{X}}_0)^{-1}{\bf{X}}_0^{\top}={\bf{I}}_{N_0}-{\bf{X}}_0{\bf{M}}^{-1}{\bf{X}}_0^{\top}\),3 and \({\bf{\beta}}_{**}={\bf{C}}^{-1}{\bf{X}}_0{\bf{M}}^{-1}{\bf{B}}_n^{-1}\beta_n\), then

\[\begin{align} \beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-\beta_*^{\top}{\bf{M}}\beta_*&= \beta_n^{\top}{\bf{B}}_n^{-1}\beta_n+{\bf{Y}}_0^{\top}{\bf{Y}}_0-(\beta_n^{\top}{\bf{B}}_n^{-1}+{\bf{Y}}_0^{\top}{\bf{X}}_0){\bf{M}}^{-1}({\bf{B}}_n^{-1}\beta_n+{\bf{X}}_0^{\top}{\bf{Y}}_0)\\ &=\beta_n^{\top}({\bf{B}}_n^{-1}-{\bf{B}}_n^{-1}{\bf{M}}^{-1}{\bf{B}}_n^{-1})\beta_n+{\bf{Y}}_0^{\top}{\bf{C}}{\bf{Y}}_0\\ &-2{\bf{Y}}_0^{\top}{\bf{C}}{\bf{C}}^{-1}{\bf{X}}_0{\bf{M}}^{-1}{\bf{B}}_n^{-1}\beta_n+{\bf{\beta}}_{**}^{\top}{\bf{C}}{\bf{\beta}}_{**}-{\bf{\beta}}_{**}^{\top}{\bf{C}}{\bf{\beta}}_{**}\\ &=\beta_n^{\top}({\bf{B}}_n^{-1}-{\bf{B}}_n^{-1}{\bf{M}}^{-1}{\bf{B}}_n^{-1})\beta_n+({\bf{Y}}_0-{\bf{\beta}}_{**})^{\top}{\bf{C}}({\bf{Y}}_0-{\bf{\beta}}_{**})\\ &-{\bf{\beta}}_{**}^{\top}{\bf{C}}{\bf{\beta}}_{**}, \end{align}\]

where \(\beta_n^{\top}({\bf{B}}_n^{-1}-{\bf{B}}_n^{-1}{\bf{M}}^{-1}{\bf{B}}_n^{-1})\beta_n={\bf{\beta}}_{**}^{\top}{\bf{C}}{\bf{\beta}}_{**}\) and \(\beta_{**}={\mathbf{X}}_0\beta_n\) (see Exercise 6).

Then,

\[\begin{align} \pi({\bf{Y}}_0|{\bf{y}})&\propto\left[\frac{({\bf{Y}}_0-{\mathbf{X}}_0\beta_n)^{\top}{\bf{C}}({\bf{Y}}_0-{\mathbf{X}}_0\beta_n)+\delta_n}{2}\right]^{-\frac{\alpha_n+N_0}{2}}\\ &\propto\left[\frac{({\bf{Y}}_0-{\mathbf{X}}_0\beta_n)^{\top}\left(\frac{{\bf{C}}\alpha_n}{\delta_n}\right)({\bf{Y}}_0-{\mathbf{X}}_0\beta_n)}{\alpha_n}+1\right]^{-\frac{\alpha_n+N_0}{2}}. \end{align}\]

Then, the posterior predictive is a multivariate Student’s t, \({\bf{Y}}_0|{\bf{y}}\sim t\left({\bf{X}}_0\beta_n,\frac{\delta_n({\bf{I}}_{N_0}+{\bf{X}}_0{\bf{B}}_n{\bf{X}}_0^{\top})}{\alpha_n},\alpha_n\right)\).

References

Smith, A. F. M. 1973. A General Bayesian Linear Model.” Journal of the Royal Statistical Society. Series B (Methodological). 35 (1): 67–75.

  1. A particular case of the Woodbury matrix identity↩︎

  2. Using this result \(({\bf{A}}+{\bf{B}}{\bf{D}}{\bf{C}})^{-1}={\bf{A}}^{-1}-{\bf{A}}^{-1}{\bf{B}}({\bf{D}}^{-1}+{\bf{C}}{\bf{A}}^{-1}{\bf{B}})^{-1}{\bf{C}}{\bf{A}}^{-1}\)↩︎