4.2 Conjugate prior to exponential family

Theorem 4.2.1

The prior distribution \(\pi(\mathbf{\theta})\propto C(\mathbf{\theta})^{b_0}\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{a}_0\right\}\) is conjugate to the exponential family (equation (4.4)).

Proof

\[\begin{align} \pi(\mathbf{\theta}|\mathbf{y})& \propto C(\mathbf{\theta})^{b_0}\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{a}_0\right\} \times h(\mathbf{y}) C(\mathbf{\theta})^N\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{T}(\mathbf{y})\right\}\\ & \propto C(\mathbf{\theta})^{N+b_0} \exp\left\{\eta(\mathbf{\theta})^{\top}(\mathbf{T}(\mathbf{y})+\mathbf{a}_0\right\}. \end{align}\]

Observe that the posterior is in the exponential family, \(\pi(\mathbf{\theta}|\mathbf{y})\propto C(\mathbf{\theta})^{\beta_n} \exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{\alpha}_n\right\}\), \(\beta_n=N+b_0\) and \(\mathbf{\alpha}_n=\mathbf{T}(\mathbf{y})+\mathbf{a}_0\).

Remarks

We see comparing the prior and the likelihood that \(b_0\) plays the role of a hypothetical sample size, and \(\mathbf{a}_0\) plays the role of hypothetical sufficient statistics. This view helps the elicitation process.

In addition, we stablished the result in the standard form of the exponential family. We can also stablish this result in the canonical form of the exponential family. Observe that given \(\mathbf{\eta}=\mathbf{\eta}(\mathbf{\theta})\) another way to get a prior for \(\mathbf{\eta}\) is to use the change of variables theorem given a bijective function.

In the setting where there is a prior regular conjugate prior (Diaconis, Ylvisaker, et al. 1979) show that we obtain a posterior expectation of the sufficient statistics that is a weighted average between the prior expectation and the likelihood estimate.

Examples: Theorem 4.2.1

  1. Likelihood functions from discrete distributions
  • The Poisson-gamma model

Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a Poisson distribution then a conjugate prior density for \(\lambda\) has the form

\[\begin{align} \pi(\lambda)&\propto \left(\exp(-\lambda)\right)^{b_0} \exp\left\{a_0\log(\lambda)\right\}\\ & = \exp(-\lambda b_0) \lambda^{a_0}\\ & = \exp(-\lambda \beta_0) \lambda^{\alpha_0-1}. \end{align}\]

This is the kernel of a gamma density in the rate parametrization, \(G(\alpha_0,\beta_0)\), \(\alpha_0=a_0+1\) and \(\beta_0=b_0\). 19 Then, a prior conjugate distribution for the Poisson likelihood is a gamma distribution.

Taking into account that \(\sum_{i=1}^N y_i\) is a sufficient statistic for the Poisson distribution, then we can think about \(a_0\) as the number of occurrences in \(b_0\) experiments. Observe that

\[\begin{align} \pi(\lambda|\mathbf{y})&\propto \exp(-\lambda \beta_0) \lambda^{\alpha_0-1} \times \exp(-N\lambda)\lambda^{\sum_{i=1}^Ny_i}\\ &= \exp(-\lambda(N+\beta_0)) \lambda^{\sum_{i=1}^Ny_i+\alpha_0-1}. \end{align}\]

As expected, this is the kernel of a gamma distribution, which means \(\lambda|\mathbf{y}\sim G(\alpha_n,\beta_n)\), \(\alpha_n=\sum_{i=1}^Ny_i+\alpha_0\) and \(\beta_n=N+\beta_0\).

Observe that \(\alpha_0/\beta_0\) is the prior mean, and \(\alpha_0/\beta_0^2\) is the prior variance. Then, \(\alpha_0\rightarrow 0\) and \(\beta_0\rightarrow 0\) imply a non-informative prior such that the posterior mean converges to the maximum likelihood estimator \(\bar{y}=\frac{\sum_{i=1}^N y_i}{N}\),

\[\begin{align} \mathbb{E}\left[\lambda|\mathbf{y}\right]&=\frac{\alpha_n}{\beta_n}\\ &=\frac{\sum_{i=1}^Ny_i+\alpha_0}{N+\beta_0}\\ &=\frac{N\bar{y}}{N+\beta_0}+\frac{\alpha_0}{N+\beta_0}. \end{align}\]

The posterior mean is a weighted average between sample and prior information. This is a general result from regular conjugate priors (Diaconis, Ylvisaker, et al. 1979). Observe that \(\mathbb{E}\left[\lambda|\mathbf{y}\right]=\bar{y}, \lim N\rightarrow\infty\).

In addition, \(\alpha_0\rightarrow 0\) and \(\beta_0\rightarrow 0\) corresponds to \(\pi(\lambda)\propto \frac{1}{\lambda}\), which is an improper prior. Improper priors have bad consequences on Bayes factors (hypothesis testing). In this setting, we can get analytical solutions for the marginal likelihood and the predictive distribution (see the health insurance example and exercise 3 in Chapter 1).

  • The Bernoulli-beta model

Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a Bernoulli distribution then a conjugate prior density for \(\theta\) has the form

\[\begin{align} \pi(\theta)&\propto (1-\theta)^{b_0} \exp\left\{a_0\log\left(\frac{\theta}{1-\theta}\right)\right\}\\ & = (1-\theta)^{b_0-a_0}\theta^{a_0}\\ & = \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}. \end{align}\]

This is the kernel of a beta density, \(B(\alpha_0,\beta_0)\), \(\alpha_0=a_0+1\) and \(\beta_0=b_0-a_0+1\). A prior conjugate distribution for the Bernoulli likelihood is a beta distribution. Given that \(b_0\) is the hypothetical sample size, and \(a_0\) is the hypothetical sufficient statistic, which is the number of successes, then \(b_0-a_0\) is the number of failures. This implies that \(\alpha_0\) is the number of prior successes plus one, and \(\beta_0\) is the number of prior failures plus one. Given that the mode of a beta distribuited random variable is \(\frac{\alpha_0-1}{\alpha_0+\beta_0-2}=\frac{a_0}{b_0}\), then we have the a priori probability of success. Setting \(\alpha_0=1\) and \(\beta_0=1\), which implies a 0-1 uniform distribution, corresponds to a setting with 0 successes (and 0 failures) in 0 experiments.

Observe that

\[\begin{align} \pi(\theta|\mathbf{y})&\propto \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1} \times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^Ny_i}\\ &= \theta^{\alpha_0+\sum_{i=1}^N y_i-1}(1-\theta)^{\beta_0+N-\sum_{i=1}^Ny_i-1}. \end{align}\]

The posterior distribution is beta, \(\theta|\mathbf{y}\sim B(\alpha_n,\beta_n)\), \(\alpha_n=\alpha_0+\sum_{i=1}^N y_i\) and \(\beta_n=\beta_0+N-\sum_{i=1}^Ny_i\), where the posterior mean \(\mathbf{E}[\theta|\mathbf{y}]=\frac{\alpha_n}{\alpha_n+\beta_n}=\frac{\alpha_0+N\bar{y}}{\alpha_0+\beta_0+N}=\frac{\alpha_0+\beta_0}{\alpha_0+\beta_0+N}\frac{\alpha_0}{\alpha_0+\beta_0}+\frac{N}{\alpha_0+\beta_0+N}\bar{y}\). The posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.

El marginal likelihood in this setting is

\[\begin{align} p(\mathbf{y})=&\int_{0}^1 \frac{\theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}}{B(\alpha_0,\beta_0)}\times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}d\theta\\ =& \frac{B(\alpha_n,\beta_n)}{B(\alpha_0,\beta_0)}, \end{align}\]

where \(B(\cdot ,\cdot)\) is the beta function.

In addition, the predictive density is

\[\begin{align} p(Y_0|\mathbf{y})&=\int_0^1 \theta^{y_0}(1-\theta)^{1-y_0}\times \frac{\theta^{\alpha_n-1}(1-\theta)^{\beta_n-1}}{B(\alpha_n,\beta_n)}d\theta\\ &=\frac{B(\alpha_n+y_0,\beta_n+1-y_0)}{B(\alpha_n,\beta_n)}\\ &=\frac{\Gamma(\alpha_n+\beta_n)\Gamma(\alpha_n+y_0)\Gamma(\beta_n+1-y_0)}{\Gamma(\alpha_n+\beta_n+1)\Gamma(\alpha_n)\Gamma(\beta_n)}\\ &=\begin{Bmatrix} \frac{\alpha_n}{\alpha_n+\beta_n}, & y_0=1\\ \frac{\beta_n}{\alpha_n+\beta_n}, & y_0=0\\ \end{Bmatrix}. \end{align}\]

This is a Bernoulli distribution with probability of success equal to \(\frac{\alpha_n}{\alpha_n+\beta_n}\).

  • The multinomial-Dirichlet model

Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a multinomial distribution then a conjugate prior density for \(\mathbf{\theta}=\left[\theta_1,\theta_2,\dots,\theta_m\right]\) has the form

\[\begin{align} \pi(\mathbf{\theta})&\propto \theta_m^{b_0} \exp\left\{\mathbf{\eta}(\mathbf{\theta})^{\top}\mathbf{a}_0\right\}\\ & = \prod_{l=1}^{m-1}\theta_l^{a_{0l}}\theta_m^{b_0-\sum_{l=1}^{m-1}a_{0l}}\\ & = \prod_{l=1}^{m}\theta_l^{\alpha_{0l}-1}, \end{align}\]

where \(\mathbf{\eta}(\mathbf{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right),\dots,\log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right]\), \(\mathbf{a}_0=\left[a_{01},\dots,a_{am-1}\right]^{\top}\), \(\mathbf{\alpha}_0=\left[\alpha_{01},\alpha_{02},\dots,\alpha_{0m}\right]\), \(\alpha_{0l}=a_{0l}+1\), \(l=1,2,\dots,m-1\) and \(\alpha_{0m}=b_0-\sum_{l=1}^{m-1} a_{0l}+1\).

This is the kernel of a Dirichlet distribution, that is, the prior distribution is \(D(\mathbf{\alpha}_0)\).

Observe that \(a_{0l}\) is the number of hypothetical number of times outcome \(l\) is observed over the hypothetical \(b_0\) trials. Setting \(\alpha_{0l}=1\), that is a uniform distribution over the open standard simplex, implicitly we set \(a_{0l}=0\), which means that there are 0 occurrences of category \(l\) in \(b_0=0\) experiments.

The posterior distribution of the multinomial-Dirichlet model is given by

\[\begin{align} \pi(\mathbf{\theta}|\mathbf{y})&\propto \prod_{l=1}^m \theta_l^{\alpha_{0l}-1}\times\prod_{l=1}^m \theta_l^{\sum_{i=1}^{N} y_{il}}\\ &=\prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^{N} y_{il}-1} \end{align}\]

This is the kernel of a Dirichlet distribution \(D(\mathbf{\alpha}_n)\), \(\mathbf{\alpha}_n=\left[\alpha_{n1},\alpha_{n2},\dots,\alpha_{nm}\right]\), \(\alpha_{nl}=\alpha_{0l}+\sum_{i=1}^{N}y_{il}\), \(l=1,2,\dots,m\). Observe that

\[\begin{align} \mathbb{E}[\theta_{j}|\mathbf{y}]&=\frac{\alpha_{nj}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\\ &=\frac{\sum_{l=1}^m \alpha_{0l}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\alpha_{0j}}{\sum_{l=1}^m \alpha_{0l}}+\frac{\sum_{l=1}^m\sum_{i=1}^N y_{il}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\sum_{i=1}^N y_{ij}}{\sum_{l=1}^m\sum_{i=1}^N y_{il}}. \end{align}\]

We have again that the posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.

The marginal likelihood is

\[\begin{align} p(\mathbf{y})&=\int_{\mathbf{\Theta}}\frac{\prod_{l=1}^m \theta_l^{\alpha_{0l}-1}}{B(\mathbf{\alpha}_0)}\times \prod_{i=1}^N\frac{n!}{\prod_{l=1}^m y_{il}}\prod_{l=1}^m \theta_{l}^{y_{il}}d\mathbf{\theta}\\ &=\frac{N\times n!}{B(\mathbf{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\int_{\mathbf{\Theta}} \prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^N y_{il}-1} d\mathbf{\theta}\\ &=\frac{N\times n!}{B(\mathbf{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}B(\mathbf{\alpha}_n)\\ &=\frac{N\times n! \Gamma\left(\sum_{l=1}^n \alpha_{0l}\right)}{\Gamma\left(\sum_{l=1}^n \alpha_{0l}+N\times n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}\right)}{\Gamma\left(\alpha_{0l}\right)\prod_{i=1}^N y_{il}!}, \end{align}\]

where \(B(\mathbf{\alpha})=\frac{\prod_{l=1}^m\Gamma(\alpha_l)}{\Gamma\left(\sum_{l=1}^m \alpha_l\right)}\).

Following similar steps we get the predictive density

\[\begin{align} p(Y_0|\mathbf{y})&=\frac{ n! \Gamma\left(\sum_{l=1}^n \alpha_{nl}\right)}{\Gamma\left(\sum_{l=1}^n \alpha_{nl}+ n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}+y_{0l}\right)}{\Gamma\left(\alpha_{nl}\right) y_{0l}!}. \end{align}\]

This is a Dirichlet-multinomial distribution with parameters \(\mathbf{\alpha}_n\).

  1. Likelihood functions from continuous distributions
  • The normal-normal/inverse-gamma model

Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a normal distribution, then the conjugate prior density has the form

\[\begin{align} \pi(\mu,\sigma^2)&\propto \exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\\ &=\exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\exp\left\{-\frac{a_{01}^2}{2\sigma^2b_0}\right\}\exp\left\{\frac{a_{01}^2}{2\sigma^2b_0}\right\}\\ &=\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\frac{b_0+1-1}{2}}\exp\left\{\frac{1}{\sigma^2}\frac{-2b_0a_{02}+a_{01}^2}{2b_0}\right\}\\ &=\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{1}{2}}\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}}_{1}\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{b_0-1}{2}}\exp\left\{-\frac{1}{\sigma^2}\frac{2b_0a_{02}-a_{01}^2}{2b_0}\right\}}_{2}. \end{align}\]

The first part is the kernel of a normal density with mean \(\mu_0=a_{01}/\beta_0\) and variance \(\sigma^2/\beta_0\), \(\beta_0=b_0\) that is, \(\mu|\sigma^2\sim N(\mu_0,\sigma^2/\beta_0)\). The second part is the kernel of an inverse gamma density with shape parameter \(\alpha_0/2=\frac{\beta_0-3}{2}\), and scale parameter \(\delta_0/2=\frac{2\beta_0a_{02}-a_{01}^2}{2\beta_0}\), \(\sigma^2\sim IG(\alpha_0/2,\delta_0/2)\). Observe that \(b_0=\beta_0\) is the hypothetical sample size, and \(a_{01}\) is the hypothetical sum of prior observations, then, it makes sense that \(a_{01}/\beta_0\) and \(\sigma^2/\beta_0\) are the prior mean and variance, respectively.

Therefore, the posterior distribution is also a normal-inverse gamma distribution,

\[\begin{align} \pi(\mu,\sigma^2|\mathbf{y})&\propto \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_0}{2\sigma^2}(\mu-\mu_0)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}\\ &\times(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N (y_i-\mu)^2\right\}\\ & = \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\bar{y})^2+N(\mu-\bar{y})^2+\delta_0\right)\right\}\\ & \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1} + \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N} - \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N}\\ & = \underbrace{\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left((\beta_0+N)\left(\mu-\left(\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N}\right)\right)^2\right)\right\}}_{1}\\ & \times \underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}\left(\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2\right)\right\}}_{2}. \end{align}\]

The first term is the kernel of a normal density, \(\mu|\sigma^2,\mathbf{y}\sim N \left(\mu_n, \sigma_n^2\right)\), where \(\mu_n=\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N}\) and \(\sigma_n^2=\frac{\sigma^2}{\beta_n}\), \(\beta_n=\beta_0+N\). The second term is the kernel of an inverse gamma density, \(\sigma^2|\mathbf{y}\sim IG(\alpha_n/2,\delta_n/2)\) where \(\alpha_n=\alpha_0+N\) and \(\delta_n=\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2\). Observe that the posterior mean is a weighted average between prior and sample information. The weights depends on the sample sizes (\(\beta_0\) and \(N\)).

The marginal posterior for \(\sigma^2\) is inverse gamma with shape and scale parameters \(\alpha_n/2\) and \(\delta_n/2\), respectively. The marginal posterior of \(\mu\) is

\[\begin{align} \pi(\mu|\mathbf{y})&\propto \int_{0}^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+1}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2\\ &=\frac{\Gamma\left(\frac{\alpha_n+1}{2}\right)}{\left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{\frac{\alpha_n+1}{2}}}\\ &\propto \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}\left(\frac{\delta_n}{\delta_n}\right)^{-\frac{\alpha_n+1}{2}}\\ &\propto \left[\frac{\alpha_n\beta_n(\mu-\mu_n)^2}{\alpha_n\delta_n}+1\right]^{-\frac{\alpha_n+1}{2}}, \end{align}\]

where the second line due to having the kernel of an inverse gamma density with parameters \((\alpha_n+1)/2\) and \(-\frac{1}{2\sigma^2}(\beta_n(\mu-\mu_n)^2+\delta_n)\).

This is the kernel of a Student’s t distribution, \(\mu|\mathbf{y}\sim t(\mu_n,\delta_n/\beta_n\alpha_n,\alpha_n)\), where \(\mathbb{E}[\mu|\mathbf{y}]=\mu_n\) and \(Var[\mu|\mathbf{y}]=\frac{\alpha_n}{\alpha_n-2}\left(\frac{\delta_n}{\beta_n\alpha_n}\right)=\frac{\delta_n}{(\alpha_n-2)\beta_n}\), \(\alpha_n>2\). Observe that the marginal posterior distribution for \(\mu\) has heavier tails than the conditional posterior distribution due to incorporating uncertainty regarding \(\sigma^2\).

The marginal likelihood is

\[\begin{align} p(\mathbf{y})&=\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{ (2\pi\sigma^2/\beta_0)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2/\beta_0}(\mu-\mu_0)^2\right\}\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\right.\\ &\times\left.\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}(2\pi\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-\mu)^2\right\}\right\}d\sigma^2d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N+1}{2}+1}\right.\\ &\times\left.\exp\left\{-\frac{1}{2\sigma^2}(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\mu)^2+\delta_0)\right\}\right\}d\sigma^2d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N(y_i-\mu)^2+\delta_0}{2}\right]^{-\frac{\alpha_0+N+1}{2}}d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{\alpha_n+1}{2}\right)\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}\frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}\\ &=\frac{\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_0}{2}\right)}\frac{(\delta_0/2)^{\alpha_0/2}}{(\delta_n/2)^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{1/2}(\pi)^{-N/2}, \end{align}\]

where we take into account that \(\int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}=\int_{-\infty}^{\infty} \left[\frac{\beta_n\alpha_n(\mu-\mu_n)^2}{\delta_n\alpha_n}+1\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}\). The term in the integral is the kernel of a Student’s t density, this means that the integral is equal to \(\frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}\).

The predictive density is

\[\begin{align} \pi(Y_0|\mathbf{y})&\propto\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}(y_0-\mu)^2\right\}\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_n}{2\sigma^2}(\mu-\mu_n)^2\right\}\right.\\ &\times \left.\left(\frac{1}{\sigma^2}\right)^{\alpha_n/2+1}\exp\left\{-\frac{\delta_n}{2\sigma^2}\right\}\right\}d\sigma^2d\mu\\ &=\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+2}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}((y_0-\mu)^2+\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2d\mu\\ &\propto\int_{-\infty}^{\infty}\left[\beta_n(\mu-\mu_n)^2+(y_0-\mu)^2+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &=\int_{-\infty}^{\infty}\left[(\beta_n+1)\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2+\frac{\beta_n(y_0-\mu_n)^2}{\beta_n+1}+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &=\int_{-\infty}^{\infty}\left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &\times\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\\ &\propto\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}\right)^{\frac{1}{2}}\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\\ &\propto (\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n)^{\left(\frac{\alpha_n+1}{2}\right)}\\ &\propto\left[1+\frac{\beta_n\alpha_n}{(\beta_n+1)\delta_n\alpha_n}(y_0-\mu_n)^2\right]^{-\left(\frac{\alpha_n+1}{2}\right)}, \end{align}\]

where we have that \(\left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)}\) is the kernel of a Student’s t density with degrees of freedom \(\alpha_n+1\) and scale \(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}\).

The last expression is the kernel of a Student’s t density, that is, \(Y_0|\mathbf{y}\sim t\left(\mu_n,\frac{(\beta_n+1)\delta_n}{\beta_n\alpha_n},\alpha_n\right)\).

  • The multivariate normal-normal/inverse-Wishart model

We show in the subsection 4.1 that the multivariate normal distribution is in the exponential family where \(h(\mathbf{y})=(2\pi)^{-pN/2}\), \(\eta(\mathbf{\mu},\mathbf{\Sigma})^{\top}=\left[\left(vec\left(\mathbf{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)\right)^{\top}\right]\), \(T(\mathbf{y})=\left[-\frac{1}{2}\left(vec\left(\mathbf{S}\right)^{\top}+N vec\left(\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\mathbf{\mu}}^{\top}\right]^{\top}\) and \(C(\mathbf{\mu},\mathbf{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\).

Then, its conjugate prior distribution should have the form

\[\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma})&\propto \exp\left\{-\frac{b_0}{2}\left(tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\\ &\times \exp\left\{\mathbf{a}_{01}^{\top} vec\left(\mathbf{\Sigma}^{-1}\right)+\mathbf{a}_{02}^{\top}vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)\right\}\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(tr\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\mathbf{\mu}\right)\right)+tr\left(\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{\mu}\right)\right\}\\ &\times \exp\left\{\mathbf{a}_{01}^{\top} vec\left(\mathbf{\Sigma}^{-1}\right)+\frac{\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{a}_{02}}{2b_0}-\frac{\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{a}_{02}}{2b_0}\right\}\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)\right\}\\ &\times \exp\left\{-\frac{1}{2}tr\left(\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right)\mathbf{\Sigma}^{-1}\right)\right\}\\ &=\underbrace{|\Sigma|^{-1/2}\exp\left\{-\frac{b_0}{2}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)\right\}}_1\\ &\times \underbrace{|\Sigma|^{-(\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr\left(\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right)\mathbf{\Sigma}^{-1}\right)\right\}}_2, \end{align}\]

where \(b_0\) is the hypothetical sample size, and \(\mathbf{a}_{01}\) and \(\mathbf{a}_{02}\) are \(p^2\) and \(p\) dimensional vectors of prior sufficient statistics, and \(\mathbf{a}_{01}=-\frac{1}{2}vec(\mathbf{A}_{01})\) such that \(\mathbf{A}_{01}\) is a \(p\times p\) positive semi-definite matrix. Setting \(b_0=1+\alpha_0+p+1\) we have that the first part in the last expression is the kernel of a multivariate normal density with mean \(\mathbf{\mu}_0=\mathbf{a}_{02}/b_0\) and covariance \(\frac{\mathbf{\Sigma}}{b_0}\), that is, \(\mathbf{\mu}|\mathbf{\Sigma}\sim N_p\left(\mathbf{\mu}_0,\frac{\mathbf{\Sigma}}{\beta_0}\right)\), \(b_0=\beta_0\). It makes sense these hyperparameters because \(\mathbf{a}_{02}\) is the hypothetical sum of prior observations and \(b_0\) is the hypothetical prior sample size. On the other hand, the second expression in the last line is the kernel of a Inverse-Wishart distribution with scale matrix \(\mathbf{\Psi}_0=\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right)\) and degrees of freedom \(\alpha_0\), that is, \(\mathbf{\Sigma}\sim IW_p(\mathbf{\Psi}_0,\alpha_0)\). Observe that \(\mathbf{\Psi}_0\) has the same structure as the first part of the sufficient statistics in \(T(\mathbf{y})\), just that it should be understood as coming from prior hypothetical observations.

Therefore, the prior distribution in this setting is normal/inverse-Wishart, and given conjugacy, the posterior distribution is in the same family.

\[\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma}|\mathbf{Y})&\propto (2\pi)^{-p N/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\mathbf{S}+N\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}\\ &\times |\mathbf{\Sigma}|^{-1/2}\exp\left\{-\frac{\beta_0}{2}tr\left[(\mathbf{\mu}-\mathbf{\mu}_0)(\mathbf{\mu}-\mathbf{\mu}_0)^{\top}\mathbf{\Sigma}^{-1}\right]\right\}|\mathbf{\Sigma}|^{-(\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr(\mathbf{\Psi}_0\mathbf{\Sigma}^{-1})\right\}. \end{align}\]

Taking into account that \[\begin{align} N\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)^{\top}+\beta_0\left(\mathbf{\mu}-\mathbf{\mu}_0\right)\left(\mathbf{\mu}-\mathbf{\mu}_0\right)^{\top}&=(N+\beta_0)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\\ &+\frac{N\beta_0}{N+\beta_0}\left(\hat{\mathbf{\mu}}-\mathbf{\mu}_0\right)\left(\hat{\mathbf{\mu}}-\mathbf{\mu}_0\right)^{\top}, \end{align}\]

where \(\mathbf{\mu}_n=\frac{N}{N+\beta_0}\hat{\mathbf{\mu}}+\frac{\beta_0}{N+\beta_0}\mathbf{\mu}_0\) is the posterior mean. We have

\[\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma}|\mathbf{Y})&\propto |\Sigma|^{-1/2}\exp\left\{-\frac{N+\beta_0}{2}tr\left[\left(\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}\\ &\times |\mathbf{\Sigma}|^{-(N+\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr\left[\left(\mathbf{\Psi}_0+\mathbf{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}. \end{align}\]

Then, \(\mathbf{\mu}|\mathbf{\Sigma},\mathbf{Y}\sim N_p\left(\mathbf{\mu}_n,\frac{1}{\beta_n}\mathbf{\Sigma}\right)\), and \(\mathbf{\Sigma}|\mathbf{Y}\sim W\left(\alpha_n,\mathbf{\Psi}_n\right)\) where \(\beta_n=N+\beta_0\), \(\alpha_n=N+\alpha_0\) and \(\mathbf{\Psi}_n=\mathbf{\Psi}_0+\mathbf{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)^{\top}\).

The marginal posterior of \(\mathbf{\mu}\) is given by \(\int_{\mathcal{S}} \pi(\mathbf{\mu},\mathbf{\Sigma})d\mathbf{\Sigma}\) where \(\mathcal{S}\) is the space of positive semi-definite matrices. Then,

\[\begin{align} \pi(\mathbf{\mu}|\mathbf{Y})&\propto\int_{\mathcal{S}}\left\{|\mathbf{\Sigma}|^{-(\alpha_n+p+2)/2} \exp\left\{-\frac{1}{2}tr\left[\left(\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}+\mathbf{\Psi}_n\right)\mathbf{\Sigma}^{-1}\right]\right\} \right\}d\mathbf{\Sigma}\\ &\propto \big\lvert\left(\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}+\mathbf{\Psi}_n\right)\big\lvert^{-(\alpha_n+1)/2}\\ &=\left[\big\lvert\mathbf{\Psi}_n\big\lvert\times \big\lvert1+\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\mathbf{\Psi}_n^{-1}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\big\lvert\right]^{-(\alpha_n+1)/2}\\ &\propto \left(1+\frac{1}{\alpha_n+1-p}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\left(\frac{\mathbf{\Psi}_n}{(\alpha_n+1-p)\beta_n}\right)^{-1}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\right)^{-(\alpha_n+1-p+p)/2}, \end{align}\]

where the second line uses properties of the inverse Wishart distribution, and the third line uses a particular case of the Sylvester’s determinant theorem.

We observe that the last line is the kernel of a Multivariate Student’s t distribution, that is, \(\mathbf{\mu}|\mathbf{Y}\sim t_p(v_n,\mathbf{\mu}_n,\mathbf{\Sigma}_n)\) where \(v_n=\alpha_n+1-p\) and \(\mathbf{\Sigma}_n=\frac{\mathbf{\Psi}_n}{(\alpha_n+1-p)\beta_n}\).

The marginal likelihood is given by

\[\begin{align} p(\mathbf{Y})=\frac{\Gamma_p\left(\frac{\alpha_n}{2}\right)}{\Gamma_p\left(\frac{\alpha_0}{2}\right)}\frac{|\mathbf{\Psi}_0|^{\alpha_0/2}}{|\mathbf{\Psi}_n|^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{p/2}(2\pi)^{-Np/2}, \end{align}\]

where \(\Gamma_p\) is the multivariate gamma function (see Exercise 4).

The posterior predictive distribution is \(\mathbf{Y}_0|\mathbf{Y}\sim t_p(v_n,\mathbf{\mu}_n,(\beta_n+1)\mathbf{\Sigma}_n)\) (see Exercise 5).

References

Diaconis, Persi, Donald Ylvisaker, et al. 1979. “Conjugate Priors for Exponential Families.” The Annals of Statistics 7 (2): 269–81.