3.3 Linear regression: The conjugate normal-normal/inverse gamma model

In this setting, we analyze the conjugate normal-normal/inverse gamma model, which is a cornerstone in econometrics. In this model, the dependent variable yi is related to a set of regressors \boldsymbol{x}_i = [x_{i1} \ x_{i2} \ \dots \ x_{iK}]^{\top} in a linear way, that is: y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_K x_{iK} + \mu_i = \boldsymbol{x}_i^{\top} \boldsymbol{\beta} + \mu_i, where \boldsymbol{\beta} = [\beta_1 \ \beta_2 \ \dots \ \beta_K]^{\top} and \mu_i \stackrel{iid}{\sim} N(0, \sigma^2) is a stochastic error such that \mathbb{E}[\mu_i \mid \boldsymbol{x}_i] = 0.

Defining the vectors and matrices: \boldsymbol{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}, \quad \boldsymbol{X} = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K} \\ x_{21} & x_{22} & \dots & x_{2K} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \dots & x_{NK} \end{bmatrix}, \quad \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_N \end{bmatrix},

we can write the model in matrix form as: \boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\mu},

where \boldsymbol{\mu} \sim N(\boldsymbol{0}, \sigma^2 \boldsymbol{I}). This implies that: \boldsymbol{y} \sim N(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2 \boldsymbol{I}).

In regression analysis, to simplify notation, we depart from the conventional statistical notation, which defines lowercase letters as realizations of random variables, typically denoted by uppercase letters. We hope it is clear from the context when we refer to random vectors and matrices, and their realizations. Thus, we use bold lowercase letters for vectors and bold uppercase letters for matrices. This applies to the rest of the book.

Thus, the likelihood function is: \begin{align*} p({\boldsymbol{y}}\mid \boldsymbol{\beta}, \sigma^2, {{\boldsymbol{X}}}) & = (2\pi\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta})^{\top}({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta}) \right\} \\ & \propto (\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta})^{\top}({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta}) \right\}. \end{align*}

The conjugate priors for the parameters are \begin{align*} \boldsymbol{\beta}\mid \sigma^2 & \sim N(\boldsymbol{\beta}_0, \sigma^2 {\boldsymbol{B}}_0),\\ \sigma^2 & \sim IG(\alpha_0/2, \delta_0/2). \end{align*}

Then, the posterior distribution is \begin{align*} \pi(\boldsymbol{\beta},\sigma^2\mid \boldsymbol{y},\boldsymbol{X})&\propto (\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta})^{\top}({\boldsymbol{y}} - {\boldsymbol{X}}\boldsymbol{\beta}) \right\} \\ & \times (\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} (\boldsymbol{\beta} - \boldsymbol{\beta}_0)^{\top}{\boldsymbol{B}}_0^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_0)\right\} \\ & \times \frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp \left\{-\frac{\delta_0}{2\sigma^2} \right\} \\ & \propto (\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} [\boldsymbol{\beta}^{\top}({\boldsymbol{B}}_0^{-1} + {\boldsymbol{X}}^{\top}{\boldsymbol{X}})\boldsymbol{\beta} - 2\boldsymbol{\beta}^{\top}({\boldsymbol{B}}_0^{-1}\boldsymbol{\beta}_0 + {\boldsymbol{X}}^{\top}{\boldsymbol{X}}\hat{\boldsymbol{\beta}})] \right\} \\ & \times \left(\frac{1}{\sigma^2}\right)^{(\alpha_0+N)/2+1}\exp \left\{-\frac{\delta_0+ {\boldsymbol{y}}^{\top}{\boldsymbol{y}} + \boldsymbol{\beta}_0^{\top}{\boldsymbol{B}}_0^{-1}\boldsymbol{\beta}_0}{2\sigma^2} \right\}, \end{align*}

where \hat{\boldsymbol{\beta}}=({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^{\top}{\boldsymbol{y}} is the maximum likelihood estimator.

Adding and subtracting \boldsymbol{\beta}_n^{\top}{{\boldsymbol{B}}}_n^{-1} \boldsymbol{\beta}_n to complete the square, where \boldsymbol{B}_n = (\boldsymbol{B}_0^{-1} + \boldsymbol{X}^{\top}\boldsymbol{X})^{-1} and \boldsymbol{\beta}_n = \boldsymbol{B}_n(\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0 + \boldsymbol{X}^{\top}\boldsymbol{X}\hat{\boldsymbol{\beta}}), \begin{align*} \pi(\boldsymbol{\beta},\sigma^2\mid \boldsymbol{y},\boldsymbol{X})&\propto \underbrace{(\sigma^2)^{-\frac{K}{2}} \exp \left\{-\frac{1}{2\sigma^2} (\boldsymbol{\beta}-\boldsymbol{\beta}_n)^{\top}{\boldsymbol{B}}^{-1}_n(\boldsymbol{\beta}-\boldsymbol{\beta}_n) \right\}}_1 \\ & \times \underbrace{(\sigma^2)^{-\left(\frac{\alpha_n}{2}+1 \right)} \exp \left\{-\frac{\delta_n}{2\sigma^2} \right\}}_2. \end{align*}

The first expression is the kernel of a normal density function, \boldsymbol{\beta}\mid \sigma^2, \boldsymbol{y}, \boldsymbol{X} \sim N(\boldsymbol{\beta}_n, \sigma^2\boldsymbol{B}_n). The second expression is the kernel of a inverse gamma density, \sigma^2\mid \boldsymbol{y}, \boldsymbol{X}\sim IG(\alpha_n/2, \delta_n/2), where \alpha_n = \alpha_0 + N and \delta_n = \delta_0 + \boldsymbol{y}^{\top}\boldsymbol{y} + \boldsymbol{\beta}_0^{\top}\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0 - \boldsymbol{\beta}_n^{\top}\boldsymbol{B}_n^{-1}\boldsymbol{\beta}_n.

Taking into account that \begin{align*}\boldsymbol{\beta}_n & = (\boldsymbol{B}_0^{-1} + \boldsymbol{X}^{\top}\boldsymbol{X})^{-1}(\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0 + \boldsymbol{X}^{\top}\boldsymbol{X}\hat{\boldsymbol{\beta}})\\ & = (\boldsymbol{B}_0^{-1} + \boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0 + (\boldsymbol{B}_0^{-1} + \boldsymbol{X}^{\top}\boldsymbol{X})^{-1} \boldsymbol{X}^{\top}\boldsymbol{X}\hat{\boldsymbol{\beta}}, \end{align*}

where ({\boldsymbol{B}}_0^{-1} + {\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{B}}_0^{-1}=\boldsymbol{I}_K-({\boldsymbol{B}}_0^{-1} + {\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^{\top}{\boldsymbol{X}} (A. F. M. Smith 1973). Setting {\boldsymbol{W}}=({\boldsymbol{B}}_0^{-1} + {\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^{\top}{\boldsymbol{X}} we have \boldsymbol{\beta}_n=(\boldsymbol{I}_K-{\boldsymbol{W}})\boldsymbol{\beta}_0+{\boldsymbol{W}}\hat{\boldsymbol{\beta}}, that is, the posterior mean of \boldsymbol{\beta} is a weighted average between the sample and prior information, where the weights depend on the precision of each piece of information. Observe that when the prior covariance matrix is highly vague (non–informative), such that {\boldsymbol{B}}_0^{-1}\rightarrow \boldsymbol{0}_K, we obtain {\boldsymbol{W}} \rightarrow I_K, such that \boldsymbol{\beta}_n \rightarrow \hat{\boldsymbol{\beta}}, that is, the posterior mean location parameter converges to the maximum likelihood estimator.

In addition, we know that the posterior conditional covariance matrix of the location parameters \sigma^2({\boldsymbol{B}}_0^{-1} + {\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}=\sigma^2({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}-\sigma^2\left(({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}({\boldsymbol{B}}_0 + ({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1})^{-1}({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}\right) is positive semi-definite.21 Given that \sigma^2({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1} is the covariance matrix of the maximum likelihood estimator, we observe that prior information reduces estimation uncertainty.

Another way to see this model is by considering that both \boldsymbol{y} and \boldsymbol{\beta} are treated as random variables under the Bayesian framework. Thus, we can express the joint distribution of these two vectors as follows: \begin{align*} \begin{bmatrix} \boldsymbol{\beta}\\ \boldsymbol{y} \end{bmatrix}\sim N\left [ \begin{pmatrix} \boldsymbol{\beta}_{0} \\ \boldsymbol{X}\boldsymbol{\beta}_{0} \end{pmatrix} , \sigma^2\begin{pmatrix} \boldsymbol{B}_{0} & \boldsymbol{B}_{0} \boldsymbol{X}^{\top} \\ \boldsymbol{X}\boldsymbol{B}_{0}^{\top} & \boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+\boldsymbol{I}_N \end{pmatrix}\right ], \end{align*} where we use that Cov[\boldsymbol{\beta},\boldsymbol{y}]=\mathbb{E}[\boldsymbol{\beta}\boldsymbol{y}^{\top}]-\mathbb{E}[\boldsymbol{\beta}]\mathbb{E}[\boldsymbol{y}^{\top}]=\mathbb{E}[\boldsymbol{\beta}(\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\mu})^{\top}]-\mathbb{E}[\boldsymbol{\beta}]\mathbb{E}[\boldsymbol{y}^{\top}]=[Var[\boldsymbol{\beta}]+\mathbb{E}[\boldsymbol{\beta}]\mathbb{E}[\boldsymbol{\beta}^{\top}]]\boldsymbol{X}^{\top}-\mathbb{E}[\boldsymbol{\beta}]\mathbb{E}[\boldsymbol{y}^{\top}]=\sigma^2\boldsymbol{B}_0\boldsymbol{X}^{\top}+\boldsymbol{\beta}_0\boldsymbol{\beta}_0^{\top}\boldsymbol{X}^{\top}-\boldsymbol{\beta}_0\boldsymbol{\beta}_0^{\top}\boldsymbol{X}^{\top}=\sigma^2\boldsymbol{B}_0\boldsymbol{X}^{\top}.

Then, we can obtain the conditional distribution of \boldsymbol{\beta} \mid \boldsymbol{y} using the properties of the multivariate normal distribution. This distribution is normal with mean equal to \boldsymbol{\beta}_{0} + \boldsymbol{B}_{0} \boldsymbol{X}^{\top} \left( \boldsymbol{X} \boldsymbol{B}_{0} \boldsymbol{X}^{\top} + \boldsymbol{I}_N \right)^{-1} (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta}_{0}),

and covariance matrix \sigma^2 \left( \boldsymbol{B}_{0} - \boldsymbol{B}_{0} \boldsymbol{X}^{\top} \left( \boldsymbol{X} \boldsymbol{B}_{0} \boldsymbol{X}^{\top} + \boldsymbol{I}_N \right)^{-1} \boldsymbol{X} \boldsymbol{B}_{0}^{\top} \right).

Observe that in this representation, the posterior mean is equal to the prior mean plus a correction term that takes into account the deviation between the observations and the prior expected value (\boldsymbol{X} \boldsymbol{\beta}_{0}). The weight of this correction is given by the matrix \boldsymbol{B}_{0} \boldsymbol{X}^{\top} \left( \boldsymbol{X} \boldsymbol{B}_{0} \boldsymbol{X}^{\top} + \boldsymbol{I}_N \right)^{-1}.

This form of expressing the posterior distribution is relevant for gaining some intuition on Bayesian inference in time series models within the Gaussian linear state-space representation in Chapter @ref{Chap8}, also known as the Kalman filter in time series literature.

We can show that both conditional posterior distributions are the same. In particular, the posterior mean in this representation is [\boldsymbol{I}_K-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+ \boldsymbol{I}_N)^{-1}\boldsymbol{X}]\boldsymbol{\beta}_{0}+\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+ \boldsymbol{I}_N)^{-1}\boldsymbol{y}, where \begin{align*} \boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+ \boldsymbol{I}_N)^{-1} &=\boldsymbol{B}_{0}\boldsymbol{X}^{\top}[\boldsymbol{I}_N-\boldsymbol{I}_N\boldsymbol{X}(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{I}_N\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{I}_N]\\ &=\boldsymbol{B}_{0}[\boldsymbol{I}_K-\boldsymbol{X}^{\top}\boldsymbol{I}_N\boldsymbol{X}(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{I}_N\boldsymbol{X})^{-1}]\boldsymbol{X}^{\top}\\ &=\boldsymbol{B}_{0}[\boldsymbol{I}_K-[\boldsymbol{I}_K-\boldsymbol{B}_0^{-1}(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{I}_N\boldsymbol{X})^{-1}]]\boldsymbol{X}^{\top}\\ &=(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}, \end{align*} where the first equality uses the Woodbury matrix identity (matrix inversion lemma), and the third equality uses \boldsymbol{D}(\boldsymbol{D}+\boldsymbol{E})^{-1}=\boldsymbol{I}-\boldsymbol{E}(\boldsymbol{D}+\boldsymbol{E})^{-1}.

Thus, [\boldsymbol{I}_K-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+ \boldsymbol{I}_N)^{-1}\boldsymbol{X}]\boldsymbol{\beta}_{0}+\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^{\top}+ \boldsymbol{I}_N)^{-1}\boldsymbol{y}=[\boldsymbol{I}_K-(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{X}]\boldsymbol{\beta}_{0}+(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{y}=[\boldsymbol{I}_K-(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{X}]\boldsymbol{\beta}_{0}+(\boldsymbol{B}_0^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{X}\hat{\boldsymbol{\beta}}. Again, we see that the posterior mean is a weighted average between the prior mean, and the maximum likelihood estimator.

The equality of variances of both approaches is as follows: \begin{align*} Var[\boldsymbol{\beta}\mid \boldsymbol{y}]& = \sigma^2(\boldsymbol{B}_{0}-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{X}\boldsymbol{B}_{0}\boldsymbol{X}^\top+\boldsymbol{I}_N)^{-1} \boldsymbol{X}\boldsymbol{B}_{0})\\ &=\sigma^2(\boldsymbol{B}_{0}-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}(\boldsymbol{I}_N- \boldsymbol{I}_N\boldsymbol{X}(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{I}_N\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{I}_N)\boldsymbol{X}\boldsymbol{B}_{0})\\ &=\sigma^2(\boldsymbol{B}_{0}-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{B}_{0}+ \boldsymbol{B}_{0}\boldsymbol{X}^{\top}\boldsymbol{X}(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{B}_{0})\\ &=\sigma^2(\boldsymbol{B}_{0}-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{B}_{0}+ \boldsymbol{B}_{0}\boldsymbol{X}^{\top}\boldsymbol{X}[\boldsymbol{I}_K-(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\boldsymbol{B}_{0}^{-1}]\boldsymbol{B}_{0})\\ &=\sigma^2(\boldsymbol{B}_{0}-\boldsymbol{B}_{0}\boldsymbol{X}^{\top}\boldsymbol{X}(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1})\\ &=\sigma^2(\boldsymbol{B}_{0}[\boldsymbol{I}_K-\boldsymbol{X}^{\top}\boldsymbol{X}(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}])\\ &=\sigma^2(\boldsymbol{B}_{0}[\boldsymbol{I}_K-(\boldsymbol{I}_K-\boldsymbol{B}_{0}^{-1}(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1})])\\ &=\sigma^2(\boldsymbol{B}_{0}^{-1}+\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}, \end{align*} where the second equality uses the Woodbury matrix identity, the fourth equality uses (\boldsymbol{D}+\boldsymbol{E})^{-1}\boldsymbol{D}=\boldsymbol{I}-(\boldsymbol{D}+\boldsymbol{E})^{-1}\boldsymbol{E}, and the seventh equality uses \boldsymbol{D}(\boldsymbol{D}+\boldsymbol{E})^{-1}=\boldsymbol{I}-\boldsymbol{E}(\boldsymbol{D}+\boldsymbol{E})^{-1}.

Now, we calculate the posterior marginal distribution of \boldsymbol{\beta} following the standard approach, \begin{align*} \pi(\boldsymbol{\beta}\mid {\boldsymbol{y}},{\boldsymbol{X}}) & = \int_0^{\infty} \pi(\boldsymbol{\beta}, \sigma^2\mid {\boldsymbol{y}},{\boldsymbol{X}}) d\sigma^2 \\ & = \int_0^{\infty} \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+K}{2} + 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2, \end{align*} where s = \delta_n + (\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}{{\boldsymbol{B}}}_n^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n). Then we can write \begin{align*} \pi(\boldsymbol{\beta}\mid {\boldsymbol{y}},{\boldsymbol{X}}) & = \int_0^{\infty} \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+K}{2} + 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2 \\ & = \frac{\Gamma((\alpha_n+K)/2)}{(s/2)^{(\alpha_n+K)/2}} \int_0^{\infty} \frac{(s/2)^{(\alpha_n+K)/2}}{\Gamma((\alpha_n+K)/2)} (\sigma^2)^{-(\alpha_n+K)/2 - 1} \exp \left\{-\frac{s}{2\sigma^2}\right\} d\sigma^2. \end{align*}

The right term is the integral of the probability density function of an inverse gamma distribution with parameters \nu = (\alpha_n+K)/2 and \tau = s/2. Since we are integrating over the whole support of \sigma^2, the integral is equal to 1, and therefore \begin{align*} \pi(\boldsymbol{\beta}\mid {\boldsymbol{y}},{\boldsymbol{X}}) & = \frac{\Gamma((\alpha_n+K)/2)}{(s/2)^{(\alpha_n+K)/2}} \\ & \propto s^{-(\alpha_n+K)/2} \\ & = [\delta_n + (\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}{{\boldsymbol{B}}}_n^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n)]^{-(\alpha_n+K)/2} \\ & = \left[1 + \frac{(\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}\left(\frac{\delta_n}{\alpha_n}{{\boldsymbol{B}}}_n\right)^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n)}{\alpha_n}\right]^{-(\alpha_n+K)/2}(\delta_n)^{-(\alpha_N+K)/2} \\ & \propto \left[1 + \frac{(\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}{\boldsymbol{H}}_n^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n)}{\alpha_n}\right]^{-(\alpha_n+K)/2}, \end{align*} where {\boldsymbol{H}}_n = \frac{\delta_n}{\alpha_n}{\boldsymbol{B}}_n. This last expression is a multivariate t distribution for \boldsymbol{\beta}, \boldsymbol{\beta}\mid {\boldsymbol{y}},{\boldsymbol{X}} \sim t_K(\alpha_n, \boldsymbol{\beta}_n, {\boldsymbol{H}}_n).

Observe that as we have incorporated the uncertainty of the variance, the posterior for \boldsymbol{\beta} changes from a normal to a t distribution, which has heavier tails, indicating more uncertainty.

The marginal likelihood of this model is \begin{align*} p({\boldsymbol{y}})=\int_0^{\infty}\int_{R^K}\pi (\boldsymbol{\beta} \mid \sigma^2,{\boldsymbol{B}}_0,\boldsymbol{\beta}_0 )\pi(\sigma^2\mid \alpha_0/2, \delta_0/2)p({\boldsymbol{y}}\mid \boldsymbol{\beta}, \sigma^2, {\boldsymbol{X}})d\sigma^2 d\boldsymbol{\beta}. \end{align*}

Taking into account that ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})^{\top}({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})+(\boldsymbol{\beta}-\boldsymbol{\beta}_0)^{\top}{\boldsymbol{B}}_0^{-1}(\boldsymbol{\beta}-\boldsymbol{\beta}_0)=(\boldsymbol{\beta}-\boldsymbol{\beta}_n)^{\top}{\boldsymbol{B}}_n^{-1}(\boldsymbol{\beta}-\boldsymbol{\beta}_n)+m, where m={\boldsymbol{y}}^{\top}{\boldsymbol{y}}+\boldsymbol{\beta}_0^{\top}{\boldsymbol{B}}_0^{-1}\boldsymbol{\beta}_0-\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n, we have that

\begin{align*} p({\boldsymbol{y}})&=\int_0^{\infty}\int_{R^K}\pi (\boldsymbol{\beta} \mid \sigma^2)\pi(\sigma^2)p({\boldsymbol{y}}\mid \boldsymbol{\beta}, \sigma^2, {\boldsymbol{X}})d\sigma^2 d\boldsymbol{\beta}\\ &=\int_0^{\infty}\pi(\sigma^2) \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{1}{(2\pi\sigma^2)^{K/2}|{\boldsymbol{B}}_0|^{1/2}}\\ &\times\int_{R^K}\exp\left\{-\frac{1}{2\sigma^2}(\boldsymbol{\beta}-\boldsymbol{\beta}_n)^{\top}{\boldsymbol{B}}_n^{-1}(\boldsymbol{\beta}-\boldsymbol{\beta}_n)\right\}d\sigma^2 d\boldsymbol{\beta}\\ &=\int_0^{\infty}\pi(\sigma^2) \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{|{\boldsymbol{B}}_n|^{1/2}}{|{\boldsymbol{B}}_0|^{1/2}}d\sigma^2\\ &=\int_{0}^{\infty} \frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp\left\{\left(-\frac{\delta_0}{2\sigma^2}\right)\right\} \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left\{-\frac{1}{2\sigma^2}m \right\} \frac{|{\boldsymbol{B}}_n|^{1/2}}{|{\boldsymbol{B}}_0|^{1/2}} d\sigma^2\\ &= \frac{1}{(2\pi)^{N/2}}\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\frac{|{\boldsymbol{B}}_n|^{1/2}}{|{\boldsymbol{B}}_0|^{1/2}}\int_{0}^{\infty}\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1}\exp\left\{\left(-\frac{\delta_0+m}{2\sigma^2}\right)\right\}d\sigma^2\\ &= \frac{1}{\pi^{N/2}}\frac{\delta_0^{\alpha_0/2}}{\delta_n^{\alpha_n/2}}\frac{|{\boldsymbol{B}}_n|^{1/2}}{|{\boldsymbol{B}}_0|^{1/2}}\frac{\Gamma(\alpha_n/2)}{\Gamma(\alpha_0/2)}. \end{align*}

We can show that \delta_n=\delta_0 + {\boldsymbol{y}}^{\top}{\boldsymbol{y}} + \boldsymbol{\beta}_0^{\top}{\boldsymbol{B}}_0^{-1}\boldsymbol{\beta}_0 - \boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n=\delta_0+({\boldsymbol{y}}-{\boldsymbol{X}}\hat{\boldsymbol{\beta}})^{\top}({\boldsymbol{y}}-{\boldsymbol{X}}\hat{\boldsymbol{\beta}})+(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}_0)^{\top}(({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}+{\boldsymbol{B}}_0)^{-1}(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}_0) (see Exercise 7). Therefore, if we want to compare two models under this setting, the Bayes factor is \begin{align*} BF_{12}&=\frac{p(\boldsymbol{y}\mid \mathcal{M}_1)}{p(\boldsymbol{y}\mid \mathcal{M}_2)}\\ &=\frac{\frac{\delta_{10}^{\alpha_{10}/2}}{\delta_{1n}^{\alpha_{1n}/2}}\frac{|{\boldsymbol{B}}_{1n}|^{1/2}}{|{\boldsymbol{B}}_{10}|^{1/2}}\frac{\Gamma(\alpha_{1n}/2)}{\Gamma(\alpha_{10}/2)}}{\frac{\delta_{20}^{\alpha_{20}/2}}{\delta_{2n}^{\alpha_{2n}/2}}\frac{|{\boldsymbol{B}}_{2n}|^{1/2}}{|{\boldsymbol{B}}_{20}|^{1/2}}\frac{\Gamma(\alpha_{2n}/2)}{\Gamma(\alpha_{20}/2)}}, \end{align*}

where subscripts 1 and 2 refer to each model.

Observe that, ceteris paribus, the model with better fit, coherence between sample and prior information regarding location parameters, higher prior to posterior precision, and fewer parameters is favored by the Bayes factor. The Bayes factor rewards model fit, as the sum of squared errors appears in \delta_n; the better the fit (i.e., the lower the sum of squared errors), the better the Bayes factor. In addition, a weighted distance between sample and prior location parameters also appears in \delta_n. The greater this distance, the worse the model support. The ratio of determinants between posterior and prior covariance matrices is also present; the higher this ratio, the better the Bayes factor supports a model due to information gains.

To see the effect of a model’s parsimony, let’s consider the common situation in applications where \boldsymbol{B}_{j0} = c \boldsymbol{I}_{K_j}, then | \boldsymbol{B}_{j0} | = c^{K_j}. Hence, \left( \frac{| \boldsymbol{B}_{20} |}{| \boldsymbol{B}_{10} |} \right)^{1/2} = \left( \frac{c^{K_2/2}}{c^{K_1/2}} \right),

if \frac{K_2}{K_1} > 1 and c \to \infty (the latter implying a non-informative prior), then BF_{12} \to \infty. This means infinite evidence supporting the parsimonious model, no matter what the sample information says.

Comparing models having the same number of regressors (K_1 = K_2) is not a safe ground, as | \boldsymbol{B}_0 | depends on the measurement units of the regressors. Conclusions regarding model selection depend on this, which is not a nice property. This prevents using non-informative priors when performing model selection in the Bayesian framework. However, this is not the case when \alpha_0 \to 0 and \delta_0 \to 0, which implies a non-informative prior for the variance parameter.22

We observe that \Gamma(\alpha_{j0}) cancels out, \alpha_{jn} \to N, and \delta_{jn} \to ({\boldsymbol{y}} - {\boldsymbol{X}}_j \hat{\boldsymbol{\beta}}_j)^{\top} ({\boldsymbol{y}} - {\boldsymbol{X}}_j \hat{\boldsymbol{\beta}}_j) + (\hat{\boldsymbol{\beta}}_j - \boldsymbol{\beta}_{j0})^{\top} \left( ({\boldsymbol{X}}_j^{\top} {\boldsymbol{X}}_j)^{-1} + \boldsymbol{B}_{j0} \right)^{-1} (\hat{\boldsymbol{\beta}}_j - \boldsymbol{\beta}_{j0}),

therefore, there is no effect. This is due to \sigma^2 being a common parameter in both models.

In general, we can use non-informative priors for common parameters across all models, but we cannot use non-informative priors for non-common parameters when performing model selection using the Bayes factor. This issue raises the question of how to set informative priors. On one hand, we have those who advocate for subjective priors (Ramsey 1926; Finetti 1937; Savage 1954; D. V. Lindley 2000); on the other hand, those who prefer objective priors (T. Bayes 1763; P. Laplace 1812; H. Jeffreys 1961; J. Berger 2006).

Regarding the former, eliciting subjective priors, i.e., “formulating a person’s knowledge and beliefs about one or more uncertain quantities into a (joint) probability distribution for those quantities” (Garthwaite, Kadane, and O’Hagan 2005), is a very difficult task due to human beings’ heuristics and biases associated with representativeness, information availability, conservatism, overconfidence, and anchoring-and-adjustment issues (Tversky and Kahneman 1974). However, there have been good efforts using predictive and structural elicitation procedures (J. B. Kadane 1980; J. Kadane and Wolfson 1998).

Regarding the latter, there are reference priors that are designed to have minimal impact on the posterior distribution and to be invariant to different parametrizations of the model (J. M. Bernardo and Smith 2009). A remarkable example of reference priors is the Jeffreys’ prior (Harold Jeffreys 1946), which originated from the critique of non-informative priors that were not invariant to transformations of the parameter space. In particular, the Jeffreys’ prior is given by: \pi(\boldsymbol{\theta}) \propto |I(\boldsymbol{\theta})|^{1/2},

where I(\boldsymbol{\theta}) = \mathbb{E}\left(-\frac{\partial^2 \log p(\boldsymbol{y} \mid \boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{\top}}\right), i.e., I(\boldsymbol{\theta}) is the Fisher information matrix. However, the Jeffreys’ prior is often improper, meaning it does not work well for model selection.

Thus, a standard objective approach is to use intrinsic priors (J. O. Berger and Pericchi 1996), where a minimal training dataset is used with a reference prior to obtain a proper posterior distribution. This proper distribution is then used as a prior, and the standard Bayesian procedures are followed using the remaining dataset. In this way, we end up with meaningful Bayes factors for model selection.

Regardless of using a subjective or objective approach to define a prior distribution, it is always a good idea to assess the sensitivity of the posterior results to the prior assumptions. This is commonly done using local or pointwise assessments, such as partial derivatives (Giordano et al. 2022; Jacobi, Zhu, and Joshi 2022; Gustafson 2000) or, more often, in terms of multiple evaluations (scenario analysis) (Richardson and Green 1997; Kim and Nelson 1999; An and Schorfheide 2007). Recently, Jacobi et al. (2024) extend these approaches to perform sensitivity analysis in high-dimensional hyperparameter settings.

Returning to the linear model, the posterior predictive is equal to \begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&=\int_{0}^{\infty}\int_{R^K}p({\boldsymbol{Y}}_0\mid \boldsymbol{\beta},\sigma^2,{\boldsymbol{y}})\pi(\boldsymbol{\beta}\mid \sigma^2,{\boldsymbol{y}})\pi(\sigma^2\mid {\boldsymbol{y}})d\boldsymbol{\beta} d\sigma^2\\ &=\int_{0}^{\infty}\int_{R^K}p({\boldsymbol{Y}}_0\mid \boldsymbol{\beta},\sigma^2)\pi(\boldsymbol{\beta}\mid \sigma^2,{\boldsymbol{y}})\pi(\sigma^2\mid {\boldsymbol{y}})d\boldsymbol{\beta} d\sigma^2, \end{align*}

where we take into account independence between {\boldsymbol{y}}_0 and {\boldsymbol{y}}. Given {\boldsymbol{X}}_0, which is the N_0\times K matrix of regressors associated with {\boldsymbol{y}}_0, Then, \begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&=\int_{0}^{\infty}\int_{R^K}\left\{ (2\pi\sigma^2)^{-\frac{N_0}{2}} \exp \left\{-\frac{1}{2\sigma^2} ({\boldsymbol{y}}_0 - {\boldsymbol{X}}_0\boldsymbol{\beta})^{\top}({\boldsymbol{y}}_0 - {\boldsymbol{X}}_0\boldsymbol{\beta})^{\top} \right\}\right. \\ & \times (2\pi\sigma^2)^{-\frac{K}{2}} |{\boldsymbol{B}}_n|^{-1/2} \exp \left\{-\frac{1}{2\sigma^2} (\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}{\boldsymbol{B}}_n^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n)\right\} \\ & \left. \times \frac{(\delta_n/2)^{\alpha_n/2}}{\Gamma(\alpha_n/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_n/2+1}\exp \left\{-\frac{\delta_n}{2\sigma^2} \right\}\right\}d\boldsymbol{\beta} d\sigma^2. \\ \end{align*}

Setting {\boldsymbol{M}}=({\boldsymbol{X}}_0^{\top}{\boldsymbol{X}}_0+{\boldsymbol{B}}_n^{-1}) and \boldsymbol{\beta}_*={\boldsymbol{M}}^{-1}({\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{X}}_0^{\top}{\boldsymbol{y}}_0), we have ({\boldsymbol{y}}_0 - {\boldsymbol{X}}_0\boldsymbol{\beta})^{\top}({\boldsymbol{y}}_0 - {\boldsymbol{X}}_0\boldsymbol{\beta})^{\top}+(\boldsymbol{\beta} - \boldsymbol{\beta}_n)^{\top}{\boldsymbol{B}}_n^{-1}(\boldsymbol{\beta} - \boldsymbol{\beta}_n)=(\boldsymbol{\beta} - \boldsymbol{\beta}_*)^{\top}{\boldsymbol{M}}(\boldsymbol{\beta} - \boldsymbol{\beta}_*)+\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-\boldsymbol{\beta}_*^{\top}{\boldsymbol{M}}\boldsymbol{\beta}_*. Thus,

\begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&\propto\int_{0}^{\infty}\left\{\left(\frac{1}{\sigma^2}\right)^{-\frac{K+N_0+\alpha_n}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-\boldsymbol{\beta}_*^{\top}{\boldsymbol{M}}\boldsymbol{\beta}_*+\delta_n)\right\}\right.\\ &\times\left.\int_{R^K}\exp\left\{-\frac{1}{2\sigma^2}(\boldsymbol{\beta} - \boldsymbol{\beta}_*)^{\top}{\boldsymbol{M}}(\boldsymbol{\beta} - \boldsymbol{\beta}_*)\right\}d\boldsymbol{\beta}\right\} d\sigma^2,\\ \end{align*}

where the term in the second integral is the kernel of a multivariate normal density with mean \boldsymbol{\beta}_* and covariance matrix \sigma^2{\boldsymbol{M}}^{-1}. Then, \begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&\propto\int_{0}^{\infty}\left(\frac{1}{\sigma^2}\right)^{\frac{N_0+\alpha_n}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-\boldsymbol{\beta}_*^{\top}{\boldsymbol{M}}\boldsymbol{\beta}_*+\delta_n)\right\}d\sigma^2,\\ \end{align*}

which is the kernel of an inverse gamma density. Thus, \begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&\propto \left[\frac{\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-\boldsymbol{\beta}_*^{\top}{\boldsymbol{M}}\boldsymbol{\beta}_*+\delta_n}{2}\right]^{-\frac{\alpha_n+N_0}{2}}. \end{align*}

Setting {\boldsymbol{C}}^{-1}={\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{B}}_n{\boldsymbol{X}}_0^{\top} such that {\boldsymbol{C}}={\boldsymbol{I}}_{N_0}-{\boldsymbol{X}}_0({\boldsymbol{B}}_n^{-1}+{\boldsymbol{X}}_0^{\top}{\boldsymbol{X}}_0)^{-1}{\boldsymbol{X}}_0^{\top}={\boldsymbol{I}}_{N_0}-{\boldsymbol{X}}_0{\boldsymbol{M}}^{-1}{\boldsymbol{X}}_0^{\top}, and {\boldsymbol{\boldsymbol{\beta}}}_{**}={\boldsymbol{C}}^{-1}{\boldsymbol{X}}_0{\boldsymbol{M}}^{-1}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n, then

\begin{align*} \boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-\boldsymbol{\beta}_*^{\top}{\boldsymbol{M}}\boldsymbol{\beta}_*&= \boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{y}}_0-(\boldsymbol{\beta}_n^{\top}{\boldsymbol{B}}_n^{-1}+{\boldsymbol{y}}_0^{\top}{\boldsymbol{X}}_0){\boldsymbol{M}}^{-1}({\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{X}}_0^{\top}{\boldsymbol{y}}_0)\\ &=\boldsymbol{\beta}_n^{\top}({\boldsymbol{B}}_n^{-1}-{\boldsymbol{B}}_n^{-1}{\boldsymbol{M}}^{-1}{\boldsymbol{B}}_n^{-1})\boldsymbol{\beta}_n+{\boldsymbol{y}}_0^{\top}{\boldsymbol{C}}{\boldsymbol{y}}_0\\ &-2{\boldsymbol{y}}_0^{\top}{\boldsymbol{C}}{\boldsymbol{C}}^{-1}{\boldsymbol{X}}_0{\boldsymbol{M}}^{-1}{\boldsymbol{B}}_n^{-1}\boldsymbol{\beta}_n+{\boldsymbol{\boldsymbol{\beta}}}_{**}^{\top}{\boldsymbol{C}}{\boldsymbol{\boldsymbol{\beta}}}_{**}-{\boldsymbol{\boldsymbol{\beta}}}_{**}^{\top}{\boldsymbol{C}}{\boldsymbol{\boldsymbol{\beta}}}_{**}\\ &=\boldsymbol{\beta}_n^{\top}({\boldsymbol{B}}_n^{-1}-{\boldsymbol{B}}_n^{-1}{\boldsymbol{M}}^{-1}{\boldsymbol{B}}_n^{-1})\boldsymbol{\beta}_n+({\boldsymbol{y}}_0-{\boldsymbol{\boldsymbol{\beta}}}_{**})^{\top}{\boldsymbol{C}}({\boldsymbol{y}}_0-{\boldsymbol{\boldsymbol{\beta}}}_{**})\\ &-{\boldsymbol{\boldsymbol{\beta}}}_{**}^{\top}{\boldsymbol{C}}{\boldsymbol{\boldsymbol{\beta}}}_{**}, \end{align*}

where \boldsymbol{\beta}_n^{\top}({\boldsymbol{B}}_n^{-1}-{\boldsymbol{B}}_n^{-1}{\boldsymbol{M}}^{-1}{\boldsymbol{B}}_n^{-1})\boldsymbol{\beta}_n={\boldsymbol{\boldsymbol{\beta}}}_{**}^{\top}{\boldsymbol{C}}{\boldsymbol{\boldsymbol{\beta}}}_{**} and \boldsymbol{\beta}_{**}={\boldsymbol{X}}_0\boldsymbol{\beta}_n (see Exercise 8).

Then, \begin{align*} \pi({\boldsymbol{y}}_0\mid {\boldsymbol{y}})&\propto\left[\frac{({\boldsymbol{y}}_0-{\boldsymbol{X}}_0\boldsymbol{\beta}_n)^{\top}{\boldsymbol{C}}({\boldsymbol{y}}_0-{\boldsymbol{X}}_0\boldsymbol{\beta}_n)+\delta_n}{2}\right]^{-\frac{\alpha_n+N_0}{2}}\\ &\propto\left[\frac{({\boldsymbol{y}}_0-{\boldsymbol{X}}_0\boldsymbol{\beta}_n)^{\top}\left(\frac{{\boldsymbol{C}}\alpha_n}{\delta_n}\right)({\boldsymbol{y}}_0-{\boldsymbol{X}}_0\boldsymbol{\beta}_n)}{\alpha_n}+1\right]^{-\frac{\alpha_n+N_0}{2}}. \end{align*}

The posterior predictive is a multivariate t distribution, {\boldsymbol{y}}_0\mid {\boldsymbol{y}}\sim t\left({\boldsymbol{X}}_0\boldsymbol{\beta}_n,\frac{\delta_n({\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{B}}_n{\boldsymbol{X}}_0^{\top})}{\alpha_n},\alpha_n\right) centered at {\boldsymbol{X}}_0\boldsymbol{\beta}_n.

Example: Demand of electricity

We study in this example the determinants of the monthly demand for electricity by Colombian households. The data consists of information from 2103 households, including the following variables: the average price (USD/kWh), indicators of the socioeconomic conditions of the neighborhood where the household is located (with IndSocio1 being the lowest and IndSocio3 being the highest), an indicator for whether the household is located in a municipality that is above 1000 meters above sea level, the number of rooms in the house, the number of members in the household, the presence of children in the household (where 1 indicates yes), and the monthly income (USD). The specification is as follows: \begin{align*} \log(\text{Electricity}_i) & = \beta_1\log(\text{price}_i) + \beta_2\text{IndSocio1}_i + \beta_3\text{IndSocio2}_i + \beta_4\text{Altitude}_i \\ & + \beta_5\text{Nrooms}_i + \beta_6\text{HouseholdMem}_i + \beta_7\text{Children}_i\\ & + \beta_8\log(\text{Income}_i) + \beta_9 + \mu. \end{align*}

We use a non-informative vague prior setting such that \alpha_0=\delta_0=0.001, \boldsymbol{\beta}_0=\boldsymbol{0} and \boldsymbol{B}_0=c_0\boldsymbol{I}_k, where c_0=1000 and k is the number of regressors.

The results from the R code (see below) indicate that the posterior mean of the own-price elasticity of electricity demand is -1.09, and the 95% symmetric credible interval is (-1.47, -0.71). Households in neighborhoods with low socioeconomic conditions and those located in municipalities situated more than 1000 meters above sea level consume less electricity, with reductions of 32.7% and 19.7% on average, respectively. An additional room leads to an 8.7% increase in electricity consumption, and each additional household member increases consumption by 5.9% on average. The mean estimate for income elasticity is 0.074, meaning that a 10% increase in income results in a 0.74% increase in electricity demand.

We want to check the results of the Bayes factor comparing the previous specification (model 1) with other specification without considering the price of electricity (model 2), that is, \begin{align*} \log(\text{Electricity}_i) & = \beta_1\text{IndSocio1}_i + \beta_2\text{IndSocio2}_i + \beta_3\text{Altitude}_i + \beta_4\text{Nrooms}_i\\ & + \beta_5\text{HouseholdMem}_i + \beta_6\text{Children}_i + \beta_7\log(\text{Income}_i)\\ & + \beta_8 + \mu. \end{align*}

In particular, we examine what happens as c_0 increases from 10^{0} to 10^{20}. We observe that when c_0 = 1, BF_{12} = 8.68 \times 10^{+16}, which indicates very strong evidence in favor of the model including the price of electricity. However, as c_0 increases, the Bayes factor decreases, which suggests evidence supporting model 2. For instance, when c_0 = 10^{20}, BF_{12} = 3.11 \times 10^{-4}. This is an example of the issue with using non-informative priors to calculate the Bayes factor: there is very strong evidence supporting the parsimonious model as c_0 \rightarrow \infty.

We can obtain the posterior predictive distribution of the monthly electricity demand for a household located in the lowest socioeconomic condition in a municipality situated below 1000 meters above sea level, with 2 rooms, 3 members (with children), a monthly income of USD 500, and an electricity price of USD 0.15/kWh. The next Figure shows the histogram of the predictive posterior distribution. The highest posterior density credible interval at 95% is between 44.4 kWh and 373.9 kWh, and the posterior mean is 169.4 kWh.

rm(list = ls())
set.seed(010101)
# Electricity demand
DataUt <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/Utilities.csv", sep = ",", header = TRUE, quote = "")
library(dplyr)
DataUtEst <- DataUt %>% 
    filter(Electricity != 0)
attach(DataUtEst)
# Dependent variable: Monthly consumption (kWh) in log
Y <- log(Electricity) 
# Regressors quantity including intercept
X <- cbind(LnPriceElect, IndSocio1, IndSocio2, Altitude, Nrooms, HouseholdMem, Children, Lnincome, 1)
# LnPriceElect: Price per kWh (USD) in log
# IndSocio1, IndSocio2, IndSocio3: Indicators socio-economic condition (1) is the lowest and (3) the highest
# Altitude: Indicator of household location (1 is more than 1000 meters above sea level)
# Nrooms: Number of rooms in house
# HouseholdMem: Number of household members
# Children: Indicator por presence of children in household (1)
# Lnincome: Monthly income (USD) in log
k <- dim(X)[2]
N <- dim(X)[1]
# Hyperparameters
d0 <- 0.001
a0 <- 0.001
b0 <- rep(0, k)
B0 <- 1000*diag(k)
# Posterior parameters
bhat <- solve(t(X)%*%X)%*%t(X)%*%Y
Bn <- as.matrix(Matrix::forceSymmetric(solve(solve(B0) + t(X)%*%X))) # Force this matrix to be symmetric
bn <- Bn%*%(solve(B0)%*%b0 + t(X)%*%X%*%bhat)
dn <- as.numeric(d0 + t(Y)%*%Y+t(b0)%*%solve(B0)%*%b0-t(bn)%*%solve(Bn)%*%bn)
an <- a0 + N
Hn <- Bn*dn/an
# Posterior draws
S <- 10000 # Number of draws from posterior distributions
sig2 <- MCMCpack::rinvgamma(S,an/2,dn/2)
summary(coda::mcmc(sig2))
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      2.361e-01      7.617e-03      7.617e-05      7.617e-05 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.2217 0.2309 0.2360 0.2412 0.2513
Betas <- LaplacesDemon::rmvt(S, bn, Hn, an)
summary(coda::mcmc(Betas))
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## LnPriceElect -1.09043 0.19459 0.0019459      0.0019459
## IndSocio1    -0.32783 0.05294 0.0005294      0.0005294
## IndSocio2    -0.05737 0.04557 0.0004557      0.0004557
## Altitude     -0.19780 0.02386 0.0002386      0.0002429
## Nrooms        0.08731 0.01094 0.0001094      0.0001119
## HouseholdMem  0.05987 0.01334 0.0001334      0.0001334
## Children      0.05696 0.03043 0.0003043      0.0003043
## Lnincome      0.07447 0.01223 0.0001223      0.0001223
##               2.52296 0.35077 0.0035077      0.0035077
## 
## 2. Quantiles for each variable:
## 
##                   2.5%      25%      50%      75%    97.5%
## LnPriceElect -1.472069 -1.22432 -1.08961 -0.95703 -0.71429
## IndSocio1    -0.435957 -0.36228 -0.32731 -0.29133 -0.22588
## IndSocio2    -0.147252 -0.08744 -0.05757 -0.02650  0.03254
## Altitude     -0.244759 -0.21372 -0.19783 -0.18164 -0.15094
## Nrooms        0.066432  0.07985  0.08709  0.09480  0.10864
## HouseholdMem  0.033623  0.05089  0.05975  0.06889  0.08596
## Children     -0.002259  0.03637  0.05698  0.07736  0.11681
## Lnincome      0.050536  0.06614  0.07449  0.08283  0.09852
##               1.835506  2.28703  2.52165  2.76364  3.21199
# Log marginal function (multiply by -1 due to minimization)
LogMarLikLM <- function(X, c0){
    k <- dim(X)[2]
    N <- dim(X)[1]  
    # Hyperparameters
    B0 <- c0*diag(k)
    b0 <- rep(0, k)
    # Posterior parameters
    bhat <- solve(t(X)%*%X)%*%t(X)%*%Y
    # Force this matrix to be symmetric
    Bn <- as.matrix(Matrix::forceSymmetric(solve(solve(B0) + t(X)%*%X))) 
    bn <- Bn%*%(solve(B0)%*%b0 + t(X)%*%X%*%bhat)
    dn <- as.numeric(d0 + t(Y)%*%Y+t(b0)%*%solve(B0)%*%b0-t(bn)%*%solve(Bn)%*%bn)
    an <- a0 + N
    # Log marginal likelihood
    logpy <- (N/2)*log(1/pi)+(a0/2)*log(d0)-(an/2)*log(dn) + 0.5*log(det(Bn)/det(B0)) + lgamma(an/2)-lgamma(a0/2)
    return(-logpy)
}
cs <- c(10^0, 10^3, 10^6, 10^10, 10^12, 10^15, 10^20)
# Observe -1 to recover the right sign
LogML <- sapply(cs, function(c) {-LogMarLikLM(c0=c, X = X)}) 
# Regressor without price
Xnew <- cbind(IndSocio1, IndSocio2, Altitude, Nrooms, HouseholdMem, Children, Lnincome, 1)
# Observe -1 to recover the right sign
LogMLnew <- sapply(cs, function(c) {-LogMarLikLM(c0=c,X = Xnew)})
# Bayes factor
BF <- exp(LogML - LogMLnew)
BF
## [1] 8.687289e+16 1.006665e+05 3.108374e+03 3.108299e+01 3.108303e+00
## [6] 9.829315e-02 3.108302e-04
# Predictive distribution
Xpred <- c(log(0.15), 1, 0, 0, 2, 3, 1, log(500), 1)
Mean <- Xpred%*%bn
Hn <- dn*(1+t(Xpred)%*%Bn%*%Xpred)/an
ExpKwH <- exp(LaplacesDemon::rmvt(S, Mean, Hn, an))
summary(ExpKwH)
##        V1         
##  Min.   :  27.32  
##  1st Qu.: 119.94  
##  Median : 168.74  
##  Mean   : 190.18  
##  3rd Qu.: 236.20  
##  Max.   :1032.62
HDI <- HDInterval::hdi(ExpKwH, credMass = 0.95) # Highest posterior density credible interval
HDI
##            [,1]
## lower  47.94745
## upper 387.64843
## attr(,"credMass")
## [1] 0.95
hist(ExpKwH, main = "Histogram: Monthly demand of electricity", xlab = "Monthly kWh", col = "blue", breaks = 50)

References

An, Sungbae, and Frank Schorfheide. 2007. “Bayesian Analysis of DSGE Models.” Econometric Reviews 26 (2-4): 113–72.
Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances.” Philosophical Transactions of the Royal Society of London 53: 370–416.
———. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
Berger, James O, and Luis R Pericchi. 1996. “The Intrinsic Bayes Factor for Model Selection and Prediction.” Journal of the American Statistical Association 91 (433): 109–22.
Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley & Sons.
Finetti, de. 1937. “Foresight: Its Logical Laws, Its Subjective Sources.” In Studies in Subjective Probability, edited by H. E. Kyburg and H. E. Smokler. New York: Krieger.
Garthwaite, P., J. Kadane, and A. O’Hagan. 2005. “Statistical Methods for Eliciting Probability Distributions.” Journal of American Statistical Association 100 (470): 680–701.
Giordano, Ryan, Runjing Liu, Michael I Jordan, and Tamara Broderick. 2022. “Evaluating Sensitivity to the Stick-Breaking Prior in Bayesian Nonparametrics.” Bayesian Analysis 1 (1): 1–34.
Gustafson, Paul. 2000. “Local Robustness in Bayesian Analysis.” In Robust Bayesian Analysis, 71–88. Springer.
Jacobi, Liana, Chun Fung Kwok, Andrés Ramı́rez-Hassan, and Nhung Nghiem. 2024. “Posterior Manifolds over Prior Parameter Regions: Beyond Pointwise Sensitivity Assessments for Posterior Statistics from MCMC Inference.” Studies in Nonlinear Dynamics & Econometrics 28 (2): 403–34.
Jacobi, Liana, Dan Zhu, and Mark Joshi. 2022. “Estimating Posterior Sensitivities with Application to Structural Analysis of Bayesian Vector Autoregressions.” Available at SSRN 3347399. https://ssrn.com/abstract=3347399 or http://dx.doi.org/10.2139/ssrn.3347399.
———. 1961. Theory of Probability. London: Oxford University Press.
Jeffreys, Harold. 1946. “An Invariant Form for the Prior Probability in Estimation Problems.” Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 186 (1007): 453–61.
Kadane, J. B. 1980. “Predictive and Structural Methods for Eliciting Prior Distributions.” In Bayesian Analysis in Econometrics and Statistics: Essays in Honor of Harold Jeffreys, edited by A Zellner, 89–93. Amsterdam: North–Holland Publishing Company,.
Kadane, Joseph, and Lara Wolfson. 1998. “Experiences in Elitation.” The Statiscian 47 (1): 3–19.
Kim, Chang-Jin, and Charles R Nelson. 1999. “Has the US Economy Become More Stable? A Bayesian Approach Based on a Markov-Switching Model of the Business Cycle.” Review of Economics and Statistics 81 (4): 608–16.
Laplace, P. 1812. Théorie Analytique Des Probabilités. Courcier.
Lindley, D. V. 2000. “The Philosophy of Statistics.” The Statistician 49 (3): 293–337.
Ramsey, F. 1926. “Truth and Probability.” In The Foundations of Mathematics and Other Logical Essays, edited by Routledge and Kegan Paul. London: New York: Harcourt, Brace; Company.
Richardson, Sylvia, and Peter J Green. 1997. “On Bayesian Analysis of Mixtures with an Unknown Number of Components (with Discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (4): 731–92.
Savage, L. J. 1954. The Foundations of Statistics. New York: John Wiley & Sons, Inc.
Smith, A. F. M. 1973. A General Bayesian Linear Model.” Journal of the Royal Statistical Society. Series B (Methodological). 35 (1): 67–75.
Tversky, A., and D. Kahneman. 1974. “Judgement Under Uncertainty: Heuristics and Biases.” Science 185: 1124–31.

  1. A particular case of the Woodbury matrix identity, (\boldsymbol{A}+\boldsymbol{U}\boldsymbol{C}\boldsymbol{V})^{-1}=\boldsymbol{A}^{-1}-\boldsymbol{A}^{-1}\boldsymbol{U}(\boldsymbol{C}^{-1}+\boldsymbol{V}\boldsymbol{A}^{-1}\boldsymbol{U})^{-1}\boldsymbol{V}\boldsymbol{A}^{-1}.↩︎

  2. See for advice against this common practice.↩︎