4.2 Conjugate prior to exponential family
Theorem 4.2.1
The prior distribution π(θ)∝C(θ)b0exp{η(θ)⊤a0} is conjugate to the exponential family (equation (4.4)).
Proof
π(θ|y)∝C(θ)b0exp{η(θ)⊤a0}×h(y)C(θ)Nexp{η(θ)⊤T(y)}∝C(θ)N+b0exp{η(θ)⊤(T(y)+a0}.
Observe that the posterior is in the exponential family, π(θ|y)∝C(θ)βnexp{η(θ)⊤αn}, βn=N+b0 and αn=T(y)+a0.
Remarks
We see comparing the prior and the likelihood that b0 plays the role of a hypothetical sample size, and a0 plays the role of hypothetical sufficient statistics. This view helps the elicitation process.
In addition, we stablished the result in the standard form of the exponential family. We can also stablish this result in the canonical form of the exponential family. Observe that given η=η(θ) another way to get a prior for η is to use the change of variables theorem given a bijective function.
In the setting where there is a prior regular conjugate prior (Diaconis, Ylvisaker, and others 1979) show that we obtain a posterior expectation of the sufficient statistics that is a weighted average between the prior expectation and the likelihood estimate.
Examples: Theorem 4.2.1
- Likelihood functions from discrete distributions
- The Poisson-gamma model
Given a random sample y=[y1,y2,…,yN]⊤ from a Poisson distribution then a conjugate prior density for λ has the form
π(λ)∝(exp(−λ))b0exp{a0log(λ)}=exp(−λb0)λa0=exp(−λβ0)λα0−1.
This is the kernel of a gamma density in the rate parametrization, G(α0,β0), α0=a0+1 and β0=b0.16 Then, a prior conjugate distribution for the Poisson likelihood is a gamma distribution.
Taking into account that ∑Ni=1yi is a sufficient statistic for the Poisson distribution, then we can think about a0 as the number of occurrences in b0 experiments. Observe that
π(λ|y)∝exp(−λβ0)λα0−1×exp(−Nλ)λ∑Ni=1yi=exp(−λ(N+β0))λ∑Ni=1yi+α0−1.
As expected, this is the kernel of a gamma distribution, which means λ|y∼G(αn,βn), αn=∑Ni=1yi+α0 and βn=N+β0.
Observe that α0/β0 is the prior mean, and α0/β20 is the prior variance. Then, α0→0 and β0→0 imply a non-informative prior such that the posterior mean converges to the maximum likelihood estimator ˉy=∑Ni=1yiN,
E[λ|y]=αnβn=∑Ni=1yi+α0N+β0=NˉyN+β0+α0N+β0.
The posterior mean is a weighted average between sample and prior information. This is a general result from regular conjugate priors (Diaconis, Ylvisaker, and others 1979). Observe that E[λ|y]=ˉy,lim.
In addition, \alpha_0\rightarrow 0 and \beta_0\rightarrow 0 corresponds to \pi(\lambda)\propto \frac{1}{\lambda}, which is an improper prior. Improper priors have bad consequences on Bayes factors (hypothesis testing). In this setting, we can get analytical solutions for the marginal likelihood and the predictive distribution (see the health insurance example and exercise 3 in Chapter 1).
- The Bernoulli-beta model
Given a random sample \mathbf{y}=[y_1,y_2,\dots,y_N]^{\top} from a Bernoulli distribution then a conjugate prior density for \lambda has the form
\begin{align} \pi(\theta)&\propto (1-\theta)^{b_0} \exp\left\{a_0\log\left(\frac{\theta}{1-\theta}\right)\right\}\\ & = (1-\theta)^{b_0-a_0}\theta^{a_0}\\ & = \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}. \end{align}
This is the kernel of a beta density, B(\alpha_0,\beta_0), \alpha_0=a_0+1 and \beta_0=b_0-a_0+1. A prior conjugate distribution for the Bernoulli likelihood is a beta distribution. Given that b_0 is the hypothetical sample size, and a_0 is the hypothetical sufficient statistic, which is the number of successes, then b_0-a_0 is the number of failures. This implies that \alpha_0 is the number of prior successes plus one, and \beta_0 is the number of prior failures plus one. Given that the mode of a beta distribuited random variable is \frac{\alpha_0-1}{\alpha_0+\beta_0-2}=\frac{a_0}{b_0}, then we have the a priori probability of success. Setting \alpha_0=1 and \beta_0=1, which implies a 0-1 uniform distribution, corresponds to a setting with 0 successes (and 0 failures) in 0 experiments.
Observe that
\begin{align} \pi(\lambda|\mathbf{y})&\propto \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1} \times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^Ny_i}\\ &= \theta^{\alpha_0+\sum_{i=1}^N y_i-1}(1-\theta)^{\beta_0+N-\sum_{i=1}^Ny_i-1}. \end{align}
The posterior distribution is beta, \theta|\mathbf{y}\sim B(\alpha_n,\beta_n), \alpha_n=\alpha_0+\sum_{i=1}^N y_i and \beta_n=\beta_0+N-\sum_{i=1}^Ny_i, where the posterior mean \mathbf{E}[\theta|\mathbf{y}]=\frac{\alpha_n}{\alpha_n+\beta_n}=\frac{\alpha_0+N\bar{y}}{\alpha_0+\beta_0+N}=\frac{\alpha_0+\beta_0}{\alpha_0+\beta_0+N}\frac{\alpha_0}{\alpha_0+\beta_0}+\frac{N}{\alpha_0+\beta_0+N}\bar{y}. The posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.
El marginal likelihood in this setting is
\begin{align} p(\mathbf{y})=&\int_{0}^1 \frac{\theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}}{B(\alpha_0,\beta_0)}\times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}d\theta\\ =& \frac{B(\alpha_n,\beta_n)}{B(\alpha_0,\beta_0)}, \end{align}
where B(\cdot ,\cdot) is the beta function.
In addition, the predictive density is
\begin{align} p(Y_0|\mathbf{y})&=\int_0^1 \theta^{y_0}(1-\theta)^{1-y_0}\times \frac{\theta^{\alpha_n-1}(1-\theta)^{\beta_n-1}}{B(\alpha_n,\beta_n)}d\theta\\ &=\frac{B(\alpha_n+y_0,\beta_n+1-y_0)}{B(\alpha_n,\beta_n)}\\ &=\frac{\Gamma(\alpha_n+\beta_n)\Gamma(\alpha_n+y_0)\Gamma(\beta_n+y_0)}{\Gamma(\alpha_n+\beta_n+1\Gamma(\alpha_n)\Gamma(\beta_n)}\\ &=\begin{Bmatrix} \frac{\alpha_n}{\alpha_n+\beta_n}, & y_0=1\\ \frac{\beta_n}{\alpha_n+\beta_n}, & y_0=0\\ \end{Bmatrix}. \end{align}
This is a Bernoulli distribution with probability of success equal to \frac{\alpha_n}{\alpha_n+\beta_n}.
- The multinomial-Dirichlet model
Given a random sample \mathbf{y}=[y_1,y_2,\dots,y_N]^{\top} from a multinomial distribution then a conjugate prior density for \mathbf{\theta}=\left[\theta_1,\theta_2,\dots,\theta_m\right] has the form
\begin{align} \pi(\mathbf{\theta})&\propto \theta_m^{b_0} \exp\left\{\mathbf{\eta}(\mathbf{\theta})^{\top}\mathbf{a}_0\right\}\\ & = \prod_{l=1}^{m-1}\theta_l^{a_{0l}}\theta_m^{b_0-\sum_{l=1}^{m-1}a_{0l}}\\ & = \prod_{l=1}^{m}\theta_l^{\alpha_{0l}-1}, \end{align}
where \mathbf{\eta}(\mathbf{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right),\dots,\log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right], \mathbf{a}_0=\left[a_{01},\dots,a_{am-1}\right]^{\top}, \mathbf{\alpha}_0=\left[\alpha_{01},\alpha_{02},\dots,\alpha_{0m}\right], \alpha_{0l}=a_{0l}+1, l=1,2,\dots,m-1 and \alpha_{0m}=b_0-\sum_{l=1}^{m-1} a_{0l}+1.
This is the kernel of a Dirichlet distribution, that is, the prior distribution is D(\mathbf{\alpha}_0).
Observe that a_{0l} is the number of hypothetical number of times outcome l is observed over the hypothetical b_0 trials. Setting \alpha_{0l}=1, that is a uniform distribution over the open standard simplex, implicitly we set a_{0l}=0, which means that there are 0 occurrences of category l in b_0=0 experiments.
The posterior distribution of the multinomial-Dirichlet model is given by
\begin{align} \pi(\mathbf{\theta}|\mathbf{y})&\propto \prod_{l=1}^m \theta_l^{\alpha_{0l}-1}\times\prod_{l=1}^m \theta_l^{\sum_{i=1}^{N} y_{il}}\\ &=\prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^{N} y_{il}-1} \end{align}
This is the kernel of a Dirichlet distribution D(\mathbf{\alpha}_n), \mathbf{\alpha}_n=\left[\alpha_{n1},\alpha_{n2},\dots,\alpha_{nm}\right], \alpha_{nl}=\alpha_{0l}+\sum_{i=1}^{N}y_{il}, l=1,2,\dots,m. Observe that
\begin{align} \mathbb{E}[\theta_{j}|\mathbf{y}]&=\frac{\alpha_{nj}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\\ &=\frac{\sum_{l=1}^m \alpha_{0l}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\alpha_{0j}}{\sum_{l=1}^m \alpha_{0l}}+\frac{\sum_{l=1}^m\sum_{i=1}^N y_{il}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\sum_{i=1}^N y_{ij}}{\sum_{l=1}^m\sum_{i=1}^N y_{il}}. \end{align}
We have again that the posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.
The marginal likelihood is
\begin{align} p(\mathbf{y})&=\int_{\mathbf{\Theta}}\frac{\prod_{l=1}^m \theta_l^{\alpha_{0l}-1}}{B(\mathbf{\alpha}_0)}\times \prod_{i=1}^N\frac{n!}{\prod_{l=1}^m y_{il}}\prod_{l=1}^m \theta_{l}^{y_{il}}d\mathbf{\theta}\\ &=\frac{N\times n!}{B(\mathbf{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\int_{\mathbf{\Theta}} \prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^N y_{il}-1} d\mathbf{\theta}\\ &=\frac{N\times n!}{B(\mathbf{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}B(\mathbf{\alpha}_n)\\ &=\frac{N\times n! \Gamma\left(\sum_{l=1}^n \alpha_{0l}\right)}{\Gamma\left(\sum_{l=1}^n \alpha_{0l}+N\times n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}\right)}{\Gamma\left(\alpha_{0l}\right)\prod_{i=1}^N y_{il}!}, \end{align}
where B(\mathbf{\alpha})=\frac{\prod_{l=1}^m\Gamma(\alpha_l)}{\Gamma\left(\sum_{l=1}^m \alpha_l\right)}.
Following similar steps we get the predictive density
\begin{align} p(Y_0|\mathbf{y})&=\frac{ n! \Gamma\left(\sum_{l=1}^n \alpha_{nl}\right)}{\Gamma\left(\sum_{l=1}^n \alpha_{nl}+ n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}+y_{0l}\right)}{\Gamma\left(\alpha_{nl}\right) y_{0l}!}. \end{align}
This is a Dirichlet-multinomial distribution with parameters \mathbf{\alpha}_n.
- Likelihood functions from continuos distributions
- The normal-normal/inverse-gamma model
Given a random sample \mathbf{y}=[y_1,y_2,\dots,y_N]^{\top} from a normal distribution, then the conjugate prior density has the form
\begin{align} \pi(\mu,\sigma^2)&\propto \exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\\ &=\exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\exp\left\{-\frac{a_{01}^2}{2\sigma^2b_0}\right\}\exp\left\{\frac{a_{01}^2}{2\sigma^2b_0}\right\}\\ &=\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\frac{b_0+1-1}{2}}\exp\left\{\frac{1}{\sigma^2}\frac{-2b_0a_{02}+a_{01}^2}{2b_0}\right\}\\ &=\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{1}{2}}\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}}_{1}\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{b_0-1}{2}}\exp\left\{-\frac{1}{\sigma^2}\frac{2b_0a_{02}-a_{01}^2}{2b_0}\right\}}_{2}. \end{align}
The first part is the kernel of a normal density with mean \mu_0=a_{01}/\beta_0 and variance \sigma^2/\beta_0, \beta_0=b_0 that is, \mu|\sigma^2\sim N(\mu_0,\sigma^2/\beta_0). The second part is the kernel of an inverse gamma density with shape parameter \alpha_0/2=\frac{\beta_0-3}{2}, and scale parameter \delta_0/2=\frac{2\beta_0a_{02}-a_{01}^2}{2\beta_0}, \sigma^2\sim IG(\alpha_0/2,\delta_0/2). Observe that b_0=\beta_0 is the hypothetical sample size, and a_{01} is the hypothetical sum of prior observations, then, it makes sense that a_{01}/\beta_0 and \sigma^2/\beta_0 are the prior mean and variance, respectively.
Therefore, the posterior distribution is also a normal-inverse gamma distribution,
\begin{align} \pi(\mu,\sigma^2|\mathbf{y})&\propto \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_0}{2\sigma^2}(\mu-\mu_0)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}\\ &\times(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N (y_i-\mu)^2\right\}\\ & = \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\bar{y})^2+N(\mu-\bar{y})^2+\delta_0\right)\right\}\\ & \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1} + \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N} - \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N}\\ & = \underbrace{\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left((\beta_0+N)\left(\mu-\left(\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N}\right)\right)^2\right)\right\}}_{1}\\ & \times \underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}\left(\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2\right)\right\}}_{2}. \end{align}
The first term is the kernel of a normal density, \mu|\sigma^2,\mathbf{y}\sim N \left(\mu_n, \sigma_n^2\right), where \mu_n=\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N} and \sigma_n^2=\frac{\sigma^2}{\beta_n}, \beta_n=\beta_0+N. The second term is the kernel of an inverse gamma density, \sigma^2|\mathbf{y}\sim IG(\alpha_n/2,\delta_n/2) where \alpha_n=\alpha_0+N and \delta_n=\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2. Observe that the posterior mean is a weighted average between prior and sample information. The weights depends on the sample sizes (\beta_0 and N).
The marginal posterior for \sigma^2 is inverse gamma with shape and scale parameters \alpha_n/2 and \delta_n/2, respectively. The marginal posterior of \mu is
\begin{align} \pi(\mu|\mathbf{y})&\propto \int_{0}^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+1}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2\\ &=\frac{\Gamma\left(\frac{\alpha_n+1}{2}\right)}{\left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{\frac{\alpha_n+1}{2}}}\\ &\propto \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}\left(\frac{\delta_n}{\delta_n}\right)^{-\frac{\alpha_n+1}{2}}\\ &\propto \left[\frac{\alpha_n\beta_n(\mu-\mu_n)^2}{\alpha_n\delta_n}+1\right]^{-\frac{\alpha_n+1}{2}}, \end{align}
where the second line due to having the kernel of an inverse gamma density with parameters (\alpha_n+1)/2 and -\frac{1}{2\sigma^2}(\beta_n(\mu-\mu_n)^2+\delta_n).
This is the kernel of a Student’s t distribution, \mu|\mathbf{y}\sim t(\mu_n,\delta_n/\beta_n\alpha_n,\alpha_n), where \mathbb{E}[\mu|\mathbf{y}]=\mu_n and Var[\mu|\mathbf{y}]=\frac{\alpha_n}{\alpha_n-2}\left(\frac{\delta_n}{\beta_n\alpha_n}\right)=\frac{\delta_n}{(\alpha_n-2)\beta_n}, \alpha_n>2. Observe that the marginal posterior distribution for \mu has heavier tails than the conditional posterior distribution due to incorporating uncertainty regarding \sigma^2.
The marginal likelihood is
\begin{align} p(\mathbf{y})&=\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{ (2\pi\sigma^2/\beta_0)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2/\beta_0}(\mu-\mu_0)^2\right\}\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\right.\\ &\times\left.\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}(2\pi\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-\mu)^2\right\}\right\}d\sigma^2d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N+1}{2}+1}\right.\\ &\times\left.\exp\left\{-\frac{1}{2\sigma^2}(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\mu)^2+\delta_0)\right\}\right\}d\sigma^2d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N(y_i-\mu)^2+\delta_0}{2}\right]^{-\frac{\alpha_0+N+1}{2}}d\mu\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{\alpha_n+1}{2}\right)\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}\frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}\\ &=\frac{\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_0}{2}\right)}\frac{(\delta_0/2)^{\alpha_0/2}}{(\delta_n/2)^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{1/2}(\pi)^{-N/2}, \end{align}
where we take into account that \int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}=\int_{-\infty}^{\infty} \left[\frac{\beta_n\alpha_n(\mu-\mu_n)^2}{\delta_n\alpha_n}+1\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}. The term in the integral is the kernel of a Student’s t density, this means that the integral is equal to \frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}.
The predictive density is
\begin{align} \pi(Y_0|\mathbf{y})&\propto\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}(y_0-\mu)^2\right\}\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_n}{2\sigma^2}(\mu-\mu_n)^2\right\}\right.\\ &\times \left.\left(\frac{1}{\sigma^2}\right)^{\alpha_n/2+1}\exp\left\{-\frac{\delta_n}{2\sigma^2}\right\}\right\}d\sigma^2d\mu\\ &=\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+2}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}((y_0-\mu)^2+\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2d\mu\\ &\propto\int_{-\infty}^{\infty}\left[\beta_n(\mu-\mu_n)^2+(y_0-\mu)^2+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &=\int_{-\infty}^{\infty}\left[(\beta_n+1)\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2+\frac{\beta_n(y_0-\mu_n)^2}{\beta_n+1}+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &=\int_{-\infty}^{\infty}\left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\\ &\times\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\\ &\propto\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}\right)^{\frac{1}{2}}\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\\ &\propto (\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n)^{\left(\frac{\alpha_n+1}{2}\right)}\\ &\propto\left[1+\frac{\beta_n\alpha_n}{(\beta_n+1)\delta_n\alpha_n}(y_0-\mu_n)^2\right]^{-\left(\frac{\alpha_n+1}{2}\right)}, \end{align}
where we have that \left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)} is the kernel of a Student’s t density with degrees of freedom \alpha_n+1 and scale \frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}.
The last expression is the kernel of a Student’s t density, that is, Y_0|\mathbf{y}\sim t\left(\mu_n,\frac{(\beta_n+1)\delta_n}{\beta_n\alpha_n},\alpha_n\right).
- The multivariate normal-normal/inverse-Wishart model
We show in the subsection 4.1 that the multivariate normal distribution is in the exponential family where h(\mathbf{y})=(2\pi)^{-pN/2}, \eta(\mathbf{\mu},\mathbf{\Sigma})^{\top}=\left[\left(vec\left(\mathbf{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)\right)^{\top}\right], T(\mathbf{y})=\left[-\frac{1}{2}\left(vec\left(\mathbf{S}\right)^{\top}+N vec\left(\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\mathbf{\mu}}^{\top}\right]^{\top} and C(\mathbf{\mu},\mathbf{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}.
Then, its conjugate prior distribution should have the form
\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma})&\propto \exp\left\{-\frac{b_0}{2}\left(tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\\ &\times \exp\left\{\mathbf{a}_{01}^{\top} vec\left(\mathbf{\Sigma}^{-1}\right)+\mathbf{a}_{02}^{\top}vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)\right\}\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(tr\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\mathbf{\mu}\right)\right)+tr\left(\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{\mu}\right)\right\}\\ &\times \exp\left\{\mathbf{a}_{01}^{\top} vec\left(\mathbf{\Sigma}^{-1}\right)+\frac{\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{a}_{02}}{2b_0}-\frac{\mathbf{a}_{02}^{\top}\mathbf{\Sigma}^{-1}\mathbf{a}_{02}}{2b_0}\right\}\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)\right\}\\ &\times \exp\left\{-\frac{1}{2}tr\left(\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right)\mathbf{\Sigma}^{-1}\right)\right\}\\ &=\underbrace{|\Sigma|^{-1/2}\exp\left\{-\frac{b_0}{2}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{\mu}-\frac{\mathbf{a}_{02}}{b_0}\right)\right\}}_1\\ &\times \underbrace{|\Sigma|^{-(\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr\left(\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right)\mathbf{\Sigma}^{-1}\right)\right\}}_2, \end{align}
where b_0 is the hypothetical sample size, and \mathbf{a}_{01} and \mathbf{a}_{02} are p^2 and p dimensional vectors of prior sufficient statistics, and \mathbf{a}_{01}=-\frac{1}{2}vec(\mathbf{A}_{01}) such that \mathbf{A}_{01} is a p\times p positive semi-definite matrix. Setting b_0=1+\alpha_0+p+1 we have that the first part in the last expression is the kernel of a multivariate normal density with mean \mathbf{\mu}_0=\mathbf{a}_{02}/b_0 and covariance \frac{\mathbf{\Sigma}}{b_0}, that is, \mathbf{\mu}|\mathbf{\Sigma}\sim N_p\left(\mathbf{\mu}_0,\frac{\mathbf{\Sigma}}{\beta_0}\right), b_0=\beta_0. It makes sense these hyperparameters because \mathbf{a}_{02} is the hypothetical sum of prior observations and b_0 is the hypothetical prior sample size. On the other hand, the second expression in the last line is the kernel of a Inverse-Wishart distribution with scale matrix \mathbf{\Psi}_0=\left(\mathbf{A}_{01}-\frac{\mathbf{a}_{02}\mathbf{a}_{02}^{\top}}{b_0}\right) and degrees of freedom \alpha_0, that is, \mathbf{\Sigma}\sim IW_p(\mathbf{\Psi}_0,\alpha_0). Observe that \mathbf{\Psi}_0 has the same structure as the first part of the sufficient statistics in T(\mathbf{y}), just that it should be understood as coming from prior hypothetical observations.
Therefore, the prior distribution in this setting is normal/inverse-Wishart, and given conjugacy, the posterior distribution is in the same family.
\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma}|\mathbf{Y})&\propto (2\pi)^{-p N/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\mathbf{S}+N\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}\\ &\times |\mathbf{\Sigma}|^{-1/2}\exp\left\{-\frac{\beta_0}{2}tr\left[(\mathbf{\mu}-\mathbf{\mu}_0)(\mathbf{\mu}-\mathbf{\mu}_0)^{\top}\mathbf{\Sigma}^{-1}\right]\right\}|\mathbf{\Sigma}|^{-(\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr(\mathbf{\Psi}_0\mathbf{\Sigma}^{-1})\right\}. \end{align}
Taking into account that \begin{align} N\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)^{\top}+\beta_0\left(\mathbf{\mu}-\mathbf{\mu}_0\right)\left(\mathbf{\mu}-\mathbf{\mu}_0\right)^{\top}&=(N+\beta_0)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\\ &+\frac{N\beta_0}{N+\beta_0}\left(\hat{\mathbf{\mu}}-\mathbf{\mu}_0\right)\left(\hat{\mathbf{\mu}}-\mathbf{\mu}_0\right)^{\top}, \end{align}
where \mathbf{\mu}_n=\frac{N}{N+\beta_0}\hat{\mathbf{\mu}}+\frac{\beta_0}{N+\beta_0}\mathbf{\mu}_0 is the posterior mean. We have
\begin{align} \pi(\mathbf{\mu},\mathbf{\Sigma}|\mathbf{Y})&\propto |\Sigma|^{-1/2}\exp\left\{-\frac{N+\beta_0}{2}tr\left[\left(\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}\\ &\times |\mathbf{\Sigma}|^{-(N+\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr\left[\left(\mathbf{\Psi}_0+\mathbf{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}. \end{align}
Then, \mathbf{\mu}|\mathbf{\Sigma},\mathbf{Y}\sim N_p\left(\mathbf{\mu}_n,\frac{1}{\beta_n}\mathbf{\Sigma}\right), and \mathbf{\Sigma}|\mathbf{Y}\sim W\left(\alpha_n,\mathbf{\Psi}_n\right) where \beta_n=N+\beta_0, \alpha_n=N+\alpha_0 and \mathbf{\Psi}_n=\mathbf{\Psi}_0+\mathbf{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)(\hat{\mathbf{\mu}}-\mathbf{\mu}_0)^{\top}.
The marginal posterior of \mathbf{\mu} is given by \int_{\mathcal{S}} \pi(\mathbf{\mu},\mathbf{\Sigma})d\mathbf{\Sigma} where \mathcal{S} is the space of positive semi-definite matrices. Then,
\begin{align} \pi(\mathbf{\mu}|\mathbf{Y})&\propto\int_{\mathcal{S}}\left\{|\mathbf{\Sigma}|^{-(\alpha_n+p+2)/2} \exp\left\{-\frac{1}{2}tr\left[\left(\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}+\mathbf{\Psi}_n\right)\mathbf{\Sigma}^{-1}\right]\right\} \right\}d\mathbf{\Sigma}\\ &\propto \big\lvert\left(\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}+\mathbf{\Psi}_n\right)\big\lvert^{-(\alpha_n+1)/2}\\ &=\left[\big\lvert\mathbf{\Psi}_n\big\lvert\times \big\lvert1+\beta_n\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\mathbf{\Psi}_n^{-1}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\big\lvert\right]^{-(\alpha_n+1)/2}\\ &\propto \left(1+\frac{1}{\alpha_n+1-p}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)^{\top}\left(\frac{\mathbf{\Psi}_n}{(\alpha_n+1-p)\beta_n}\right)^{-1}\left(\mathbf{\mu}-\mathbf{\mu}_n\right)\right)^{-(\alpha_n+1-p+p)/2}, \end{align}
where the second line uses properties of the inverse Wishart distribution, and the third line uses a particular case of the Sylvester’s determinant theorem.
We observe that the last line is the kernel of a Multivariate Student’s t distribution, that is, \mathbf{\mu}|\mathbf{Y}\sim t_p(v_n,\mathbf{\mu}_n,\mathbf{\Sigma}_n) where v_n=\alpha_n+1-p and \mathbf{\Sigma}_n=\frac{\mathbf{\Psi}_n}{(\alpha_n+1-p)\beta_n}.
The marginal likelihood is given by
\begin{align} p(\mathbf{Y})=\frac{\Gamma_p\left(\frac{\alpha_n}{2}\right)}{\Gamma_p\left(\frac{\alpha_0}{2}\right)}\frac{|\mathbf{\Psi}_0|^{\alpha_0/2}}{|\mathbf{\Psi}_n|^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{p/2}(2\pi)^{-Np/2}, \end{align}
where \Gamma_p is the multivariate gamma function (see Exercise 4).
The posterior predictive distribution is \mathbf{Y}_0|\mathbf{Y}\sim t_p(v_n,\mathbf{\mu}_n,(\beta_n+1)\mathbf{\Sigma}_n) (see Exercise 5).