4.1 Motivation of conjugate families

Observing three fundamental pieces of Bayesian analysis: the posterior distribution (parameter inference), the marginal likelihood (hypothesis testing), and the predictive distribution (prediction), equations (4.1), (4.2) and (4.3), respectively,

\[\begin{align} \pi(\mathbf{\theta}|\mathbf{y})&=\frac{p(\mathbf{y}|\mathbf{\theta}) \times \pi(\mathbf{\theta})}{p(\mathbf{y})}, \tag{4.1} \end{align}\]

\[\begin{equation} p(\mathbf{y})=\int_{\mathbf{\Theta}}p(\mathbf{y}|\mathbf{\theta})\pi(\mathbf{\theta})d\mathbf{\theta}, \tag{4.2} \end{equation}\]

and

\[\begin{equation} p(\mathbf{Y}_0|\mathbf{y})=\int_{\mathbf{\Theta}}p(\mathbf{Y}_0|\mathbf{\theta})\pi(\mathbf{\theta}|\mathbf{y})d\mathbf{\theta}, \tag{4.3} \end{equation}\]

we can understand that some of the initial limitations of the application of the Bayesian analysis were associated with the ausence of algorithms to draw from non-standard posterior distributions (equation (4.1)), and the lack of analytical solutions of the marginal likelihood (equation (4.2)) and the predictive distribution (equation (4.3)). Both issues requiring computational power.

Although there were algorithms to sample from non-standard posterior distributions since the second half of the last century [Metropolis et al. (1953)](Hastings 1970),(Geman and Geman 1984), their particular application in the Bayesian framework emerged later (A. E. Gelfand and Smith 1990),(Tierney 1994), maybe until increasing computational power of desktop computers. However, it is also common practice nowadays to use models that have standard conditional posterior distributions to mitigate computational requirements. In addition, nice mathematical tricks plus computational algorithms (Alan E. Gelfand and Dey 1994), (Chib 1995),(Chib and Jeliazkov 2001) and approximations [Tierney and Kadane (1986)](Jordan and Saul 1999) are used to obtain the marginal likelihood (prior predictive).

Despite these advances, there are two potentially conflicting desirable model specification features that we can see from equations (4.1), (4.2) and (4.3): analytical solutions and the posterior distribution in the same family as the prior distribution for a given likelihood. The latter is called conjugate priors, a family of priors that is closed under sampling (Schlaifer and Raiffa 1961, p.~ 43-57).

These features are desirable as the former implies facility to perform hypothesis testing and predictive analysis, and the latter means invariance of the prior-to-posterior updating. Both feautures imply less computational burden.

We can easily achieve each of these features independenly, for instance using improper priors for analytical tractability, and defining in a broad sense the family of prior distributions for prior conjugacy. However, these are in conflict.

Fortunately, we can achieve these two nice features if we assume that the data generating process is given by a distribution function in the exponential family. That is, given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\), a probability density function \(p(\mathbf{y}|\mathbf{\theta})\) belongs to the exponential family if it has the form

\[\begin{align} p(\mathbf{y}|\mathbf{\theta})&=\prod_{i=1}^N h(y_i) C(\mathbf{\theta}) \exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{T}(y_i)\right\}\\ &=h(\mathbf{y}) C(\mathbf{\theta})^N\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{T}(\mathbf{y})\right\} \tag{4.4}\\ &=h(\mathbf{y})\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{T}(\mathbf{y})-A(\mathbf{\theta})\right\}, \end{align}\]

where \(h(\mathbf{y})=\prod_{i=1}^N h(y_i)\) is a non-negative function, \(\eta(\mathbf{\theta})\) is a known function of the parameters, \(A(\mathbf{\theta})=\log\left\{\int_{\mathbf{Y}}h(\mathbf{y})\exp\left\{\eta(\mathbf{\theta})^{\top}\mathbf{T}(\mathbf{y})\right\}d\mathbf{y}\right\}=-N\log(C(\mathbf{\theta}))\) is a normalization factor, and \(\mathbf{T}(\mathbf{y})=\sum_{i=1}^N\mathbf{T}(y_i)\) is the vector of sufficient statistics of the distribution (by the factorization theorem).

If the support of \(\mathbf{y}\) is independent of \(\mathbf{\theta}\), then the family is said to be regular, otherwise it is irregular. In addition, if we set \(\eta=\eta(\mathbf{\theta})\), then the exponential family is said to be in the canonical form

\[\begin{align} p(\mathbf{y}|\mathbf{\theta})&=h(\mathbf{y})D(\mathbf{\eta})^N\exp\left\{\eta^{\top}\mathbf{T}(\mathbf{y})\right\}\\ &=h(\mathbf{y})\exp\left\{\eta^{\top}\mathbf{T}(\mathbf{y})-B(\mathbf{\eta})\right\}. \end{align}\]

A nice feature of this representation is that \(\mathbb{E}[\mathbf{T}(\mathbf{y})|\mathbf{\eta}]=\nabla B(\mathbf{\eta})\) and \(Var[\mathbf{T}(\mathbf{y})|\mathbf{\eta}]=\nabla^2 B(\mathbf{\eta})\).

Examples of exponential family distributions

  1. Discrete distributions
  • Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a Poisson distribution let’s show that \(p(\mathbf{y}|\lambda)\) is in the exponential family.

\[\begin{align} p(\mathbf{y}|\lambda)&=\prod_{i=1}^N \frac{\lambda^{y_i} \exp(-\lambda)}{y_i!}\\ &=\frac{\lambda^{\sum_{i=1}^N y_i}\exp(-N\lambda)}{\prod_{i=1}^N y_i!}\\ &=\frac{\exp(-N\lambda)\exp(\sum_{i=1}^Ny_i\log(\lambda))}{\prod_{i=1}^N y_i!}, \end{align}\]

then \(h(\mathbf{y})=\left[\prod_{i=1}^N y_i!\right]^{-1}\), \(\eta(\lambda)=\log(\lambda)\), \(T(\mathbf{y})=\sum_{i=1}^N y_i\) (sufficient statistic) and \(C(\lambda)=\exp(-\lambda)\).

If we set \(\eta=\log(\lambda)\), then

\[\begin{align} p(\mathbf{y}|\eta)&=\frac{\exp(\eta\sum_{i=1}^Ny_i-N\exp(\eta))}{\prod_{i=1}^N y_i!}, \end{align}\]

such that \(B(\eta)=N\exp(\eta)\), then \(\nabla(B(\eta))=N\exp(\eta)=N\lambda=\mathbb{E}\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right]\), that is, \(\mathbb{E}\left[\frac{\sum_{i=1}^N y_i}{N}\biggr\rvert\lambda\right]=\mathbb{E}[\bar{y}|\lambda]=\lambda\), and \(\nabla^2(B(\eta))=N\exp(\eta)=N\lambda=Var\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right]=N^2 \times Var\left[\bar{y}\rvert\lambda\right]\), then \(Var\left[\bar{y}\rvert\lambda\right]=\frac{\lambda}{N}\).

  • Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a Bernoulli distribution let’s show that \(p(\mathbf{y}|\theta)\) is in the exponential family.

\[\begin{align} p(\mathbf{y}|\theta)&=\prod_{i=1}^N \theta^{y_i}(1-\theta)^{1-y_i}\\ &=\theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}\\ &=(1-\theta)^N\exp\left\{\sum_{i=1}^N y_i\log\left(\frac{\theta}{1-\theta}\right)\right\}, \end{align}\]

then \(h(\mathbf{y})=\mathbb{I}[y_i\in\left\{0,1\right\}]\), \(\eta(\theta)=\log\left(\frac{\theta}{1-\theta}\right)\), \(T(\mathbf{y})=\sum_{i=1}^N y_i\) and \(C(\theta)=1-\theta\).

Write this distribution in the canonical form, and find the mean and variance of the sufficient statistic (exercise 1).

  • Given a random sample \(\mathbf{y}=[\mathbf{y}_1,\mathbf{y}_2,\dots,\mathbf{y}_N]\) from a m-dimensional multinomial distribution, where \(\mathbf{y}_i=\left[y_{i1},\dots,y_{im}\right]\), \(\sum_{l=1}^m y_{il}=n\), \(n\) independent trials each of which leads to a success for exactly one of \(m\) categories with probabilities \(\mathbf{\theta}=[\theta_1,\theta_2,\dots,\theta_m]\), \(\sum_{l=1}^m \theta_l=1\). Let’s show that \(p(\mathbf{y}|\mathbf{\theta})\) is in the exponential family.

\[\begin{align} p(\mathbf{y}|\mathbf{\theta})&=\prod_{i=1}^N \frac{n!}{\prod_{l=1}^m y_{il}!} \prod_{l=1}^m\theta_l^{y_{il}}\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\sum_{i=1}^N\sum_{l=1}^m y_{il}\log(\theta_l)\right\}\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\left(N\times n-\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\right)\log(\theta_m)+\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l)\right\}\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\theta_m^{N\times n}\exp\left\{\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l/\theta_m)\right\}, \end{align}\]

then \(h(\mathbf{y})=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\), \(\eta(\mathbf{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right)\dots \log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right]\), \(T(\mathbf{y})=\left[\sum_{i=1}^N y_{i1}\dots \sum_{i=1}^N y_{im-1}\right]\) and \(C(\mathbf{\theta})=\theta_m^n\).

  1. Continuous distributions
  • Given a random sample \(\mathbf{y}=[y_1,y_2,\dots,y_N]^{\top}\) from a normal distribution let’s show that \(p(\mathbf{y}|\mu,\sigma^2)\) is in the exponential family.

\[\begin{align} p(\mathbf{y}|\mu,\sigma^2)&=\prod_{i=1}^N \frac{1}{2\pi\sigma^2}\exp\left\{-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right\}\\ &= (2\pi)^{-N/2}(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N\left(y_i-\mu\right)^2\right\}\\ &= (2\pi)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^Ny_i^2+\frac{\mu}{\sigma^2}\sum_{i=1}^N y_i-N\frac{\mu^2}{2\sigma^2}-\frac{N}{2}\log(\sigma^2)\right\} \end{align}\]

then \(h(\mathbf{y})=(2\pi)^{-N/2}\), \(\eta(\mu,\sigma^2)=\left[\frac{\mu}{\sigma^2} \ \frac{-1}{2\sigma^2}\right]\), \(T(\mathbf{y})=\left[\sum_{i=1}^N y_i \ \sum_{i=1}^N y_i^2\right]\) and \(C(\mu,\sigma^2)=\exp\left\{-\frac{\mu^2}{2\sigma^2}-\frac{\log(\sigma^2)}{2}\right\}\).

Observe that

\[\begin{align} p(\mathbf{y}|\mu,\sigma^2)&= (2\pi)^{-N/2}\exp\left\{\eta_1\sum_{i=1}^N y_i+\eta_2\sum_{i=1}^Ny_i^2-\frac{N}{2}\log(-2\eta_2)+\frac{N}{4}\frac{\eta_1^2}{\eta_2}\right\}, \end{align}\]

where \(B(\mathbf{\eta})=\frac{N}{2}\log(-2\eta_2)-\frac{N}{4}\frac{\eta_1^2}{\eta_2}\). Then,

\[\begin{align} \nabla B(\mathbf{\eta}) & = \begin{bmatrix} -\frac{N}{2}\frac{\eta_1}{\eta_2}\\ -\frac{N}{2}\frac{1}{\eta_2}+\frac{N}{4}\frac{\eta_1^2}{\eta_2^2} \end{bmatrix} = \begin{bmatrix} N\times\mu\\ N\times(\mu^2+\sigma^2) \end{bmatrix} = \begin{bmatrix} \mathbb{E}\left[\sum_{i=1}^N y_i\bigr\rvert \mu,\sigma^2\right]\\ \mathbb{E}\left[\sum_{i=1}^N y_i^2\bigr\rvert \mu,\sigma^2\right] \end{bmatrix}. \end{align}\]

  • Given \(\mathbf{Y}=[\mathbf{y}_1 \ \mathbf{y}_2 \ \dots \ \mathbf{y}_p]\) a \(N\times p\) matrix such that \(\mathbf{y}_i\sim N_p(\mathbf{\mu},\mathbf{\Sigma})\), \(i=1,2,\dots,N\), that is, each \(i\)-th row of \(\mathbf{Y}\) follows a multivariate normal distribution. Then, assuming independence between rows, let’s show that \(p(\mathbf{y}_1,\mathbf{y}_2,\dots,\mathbf{y}_N|\mathbf{\mu},\mathbf{\Sigma})\) is in the exponential family.

\[\begin{align} p(\mathbf{y}_1,\mathbf{y}_2,\dots,\mathbf{y}_N|\mathbf{\mu},\mathbf{\Sigma})&=\prod_{i=1}^N (2\pi)^{-p/2}|\Sigma|^{-1/2}\exp\left\{-\frac{1}{2}\left(\mathbf{y}_i-\mathbf{\mu}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{y}_i-\mathbf{\mu}\right)\right\}\\ &= (2\pi)^{-pN/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\sum_{i=1}^N\left(\mathbf{y}_i-\mathbf{\mu}\right)^{\top}\mathbf{\Sigma}^{-1}\left(\mathbf{y}_i-\mathbf{\mu}\right)\right]\right\}\\ &= (2\pi)^{-p N/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\mathbf{S}+N\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)\left(\mathbf{\mu}-\hat{\mathbf{\mu}}\right)^{\top}\right)\mathbf{\Sigma}^{-1}\right]\right\}\\ &= (2\pi)^{-p N/2}\exp\left\{-\frac{1}{2}\left[\left(vec\left(\mathbf{S}\right)^{\top}+N vec\left(\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{\top}\right)^{\top}\right)vec \left(\mathbf{\Sigma}^{-1}\right)\right.\right.\\ &\left.\left.-2N\hat{\mathbf{\mu}}^{\top}vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+N tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+N\log (|\mathbf{\Sigma}|)\right]\right\}\\ \end{align}\]

where the second line uses the trace operator (\(tr\)), and its invariability under cyclic permutation is used in the third line. In addition, we add and subtract \(\hat{\mathbf{\mu}}=\frac{1}{N}\sum_{i=1}^N\mathbf{y}_i\) in each parenthesis such that we get \(\mathbf{S}=\sum_{i=1}^N\left(\mathbf{y}_i-\hat{\mathbf{\mu}}\right)\left(\mathbf{y}_i-\hat{\mathbf{\mu}}\right)^{\top}\). We get the fourth line after using some properties of the trace operator to introduce the vectorization operator (\(vec\)), and collecting terms.

Then \(h(\mathbf{y})=(2\pi)^{-pN/2}\), \(\eta(\mathbf{\mu},\mathbf{\Sigma})^{\top}=\left[\left(vec\left(\mathbf{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)\right)^{\top}\right]\), \(T(\mathbf{y})=\left[-\frac{1}{2}\left(vec\left(\mathbf{S}\right)^{\top}+N vec\left(\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\mathbf{\mu}}^{\top}\right]^{\top}\) and \(C(\mathbf{\mu},\mathbf{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\mathbf{\mu}\mathbf{\mu}^{\top}\mathbf{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\).

References

Chib, Siddhartha. 1995. “Marginal Likelihood from the Gibbs Output.” Journal of the American Statistical Association 90 (432): 1313–21.
Chib, Siddhartha, and Ivan Jeliazkov. 2001. “Marginal Likelihood from the Metropolis–Hastings Output.” Journal of the American Statistical Association 96 (453): 270–81.
Gelfand, A. E., and A. F. M. Smith. 1990. “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association 85: 398–409.
Gelfand, Alan E, and Dipak K Dey. 1994. “Bayesian Model Choice: Asymptotics and Exact Calculations.” Journal of the Royal Statistical Society: Series B (Methodological) 56 (3): 501–14.
Geman, S, and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721–41.
Hastings, W. 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Application.” Biometrika 57: 97–109.
Jordan, Ghahramani, M. I., and L. Saul. 1999. “Introduction to Variational Methods for Graphical Models.” Machine Learning 37: 183–233.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” J. Chem. Phys 21: 1087–92.
Schlaifer, Robert, and Howard Raiffa. 1961. Applied Statistical Decision Theory.
Tierney, Luke. 1994. “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics, 1701–28.
Tierney, Luke, and Joseph B Kadane. 1986. “Accurate Approximations for Posterior Moments and Marginal Densities.” Journal of the American Statistical Association 81 (393): 82–86.