3.1 Motivation of conjugate families

By observing the three fundamental pieces of Bayesian analysis –the posterior distribution (parameter inference), the marginal likelihood (hypothesis testing), and the predictive distribution (prediction)– as given in equations (3.1), (3.2), and (3.3), respectively, we can understand that some of the initial limitations of Bayesian analysis were due to the absence of algorithms for sampling from non-standard posterior distributions (equation (3.1)), and the lack of analytical solutions for the marginal likelihood (equation (3.2)) and the predictive distribution (equation (3.3)), both of which require significant computational power.

\[\begin{align} \pi(\boldsymbol{\theta}\mid \boldsymbol{y})&=\frac{p(\boldsymbol{y}\mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta})}{p(\boldsymbol{y})}, \tag{3.1} \end{align}\]

\[\begin{equation} p(\boldsymbol{y})=\int_{\boldsymbol{\Theta}}p(\boldsymbol{y}\mid \boldsymbol{\theta})\pi(\boldsymbol{\theta})d\boldsymbol{\theta}, \tag{3.2} \end{equation}\]

and

\[\begin{equation} p(\boldsymbol{y}_0\mid \boldsymbol{y})=\int_{\boldsymbol{\Theta}}p(\boldsymbol{y}_0\mid \boldsymbol{\theta})\pi(\boldsymbol{\theta}\mid \boldsymbol{y})d\boldsymbol{\theta}, \tag{3.3} \end{equation}\]

Although algorithms for sampling from non-standard posterior distributions have existed since the second half of the last century (Metropolis et al. 1953; Hastings 1970; Geman and Geman 1984), their application within the Bayesian framework emerged later (A. E. Gelfand and Smith 1990; Tierney 1994), likely coinciding with the rise in computational power of desktop computers. However, it is still common practice today to use models with standard conditional posterior distributions in order to reduce computational requirements. In addition, mathematical techniques coupled with computational algorithms (Alan E. Gelfand and Dey 1994; Siddhartha Chib 1995; Siddhartha Chib and Jeliazkov 2001) and approximations (Tierney and Kadane 1986, Jordan1999) are employed to obtain the marginal likelihood (prior predictive).

Despite these advances, two potentially conflicting desirable model specification features are evident from equations (3.1), (3.2), and (3.3): (1) analytical solutions and (2) the posterior distribution belonging to the same family as the prior distribution for a given likelihood. The latter is known as conjugate priors, a family of priors that is closed under sampling (Schlaifer and Raiffa 1961).

These features are desirable because the former facilitates hypothesis testing and predictive analysis, while the latter ensures invariance in the prior-to-posterior updating process. Both features reduce computational burden.

Although each of these features can be achieved independently –such as using improper priors for analytical tractability and broadly defining the family of priors for conjugacy– these features are in conflict.

Fortunately, we can achieve both characteristics if we assume that the data-generating process follows a distribution function in the exponential family. That is, given a random sample \(\boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top}\), a probability density function \(p(\boldsymbol{y}\mid \boldsymbol{\theta})\) belongs to the exponential family if it has the form: \[\begin{align} p(\boldsymbol{y}\mid \boldsymbol{\theta})&=\prod_{i=1}^N h(y_i) C(\boldsymbol{\theta}) \exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(y_i)\right\}\tag{3.4}\\ &=h(\boldsymbol{y}) C(\boldsymbol{\theta})^N\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(\boldsymbol{y})\right\}\nonumber \\ &=h(\boldsymbol{y})\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(\boldsymbol{y})-A(\boldsymbol{\theta})\right\}\nonumber, \end{align}\]

where \(h(\boldsymbol{y}) = \prod_{i=1}^N h(y_i)\) is a non-negative function, \(\eta(\boldsymbol{\theta})\) is a known function of the parameters, and \(A(\boldsymbol{\theta}) = \log\left\{ \int_{\boldsymbol{Y}} h(\boldsymbol{y}) \exp\left\{ \eta(\boldsymbol{\theta})^{\top} \boldsymbol{T}(\boldsymbol{y}) \right\} d\boldsymbol{y} \right\} = -N \log\left(C(\boldsymbol{\theta})\right)\) is the normalization factor. Additionally, \(\boldsymbol{T}(\boldsymbol{y}) = \sum_{i=1}^N \boldsymbol{T}(y_i)\) is the vector of sufficient statistics for the distribution (by the factorization theorem).

If the support of \(\boldsymbol{Y}\) is independent of \(\boldsymbol{\theta}\), the family is said to be regular; otherwise, it is irregular. Furthermore, if we set \(\eta = \eta(\boldsymbol{\theta})\), the exponential family is said to be in the canonical form. \[\begin{align} p(\boldsymbol{y}\mid \boldsymbol{\eta})&=h(\boldsymbol{y})D(\boldsymbol{\eta})^N\exp\left\{\eta^{\top}\boldsymbol{T}(\boldsymbol{y})\right\}\nonumber\\ &=h(\boldsymbol{y})\exp\left\{\eta^{\top}\boldsymbol{T}(\boldsymbol{y})-B(\boldsymbol{\eta})\right\}.\nonumber \end{align}\]

A nice feature of this representation is that \(\mathbb{E}[\boldsymbol{T}(\boldsymbol{y})\mid \boldsymbol{\eta}]=\nabla B(\boldsymbol{\eta})\) and \(Var[\boldsymbol{T}(\boldsymbol{y})\mid \boldsymbol{\eta}]=\nabla^2 B(\boldsymbol{\eta})\).

3.1.1 Examples of exponential family distributions

Discrete distributions

Let’s show that some of the most common distributions for random variables, which can take values on a finite or countably infinite set, are part of the exponential family.

Poisson distribution

Given a random sample \(\boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top}\) from a Poisson distribution let’s show that \(p(\boldsymbol{y}\mid \lambda)\) is in the exponential family. \[\begin{align} p(\boldsymbol{y}\mid \lambda)&=\prod_{i=1}^N \frac{\lambda^{y_i} \exp(-\lambda)}{y_i!}\nonumber\\ &=\frac{\lambda^{\sum_{i=1}^N y_i}\exp(-N\lambda)}{\prod_{i=1}^N y_i!}\nonumber\\ &=\frac{\exp(-N\lambda)\exp(\sum_{i=1}^Ny_i\log(\lambda))}{\prod_{i=1}^N y_i!}\nonumber, \end{align}\]

then \(h(\boldsymbol{y})=\left[\prod_{i=1}^N y_i!\right]^{-1}\), \(\eta(\lambda)=\log(\lambda)\), \(T(\boldsymbol{y})=\sum_{i=1}^N y_i\) (sufficient statistic) and \(C(\lambda)=\exp(-\lambda)\).

If we set \(\eta=\log(\lambda)\), then \[\begin{align} p(\boldsymbol{y}\mid \eta)&=\frac{\exp(\eta\sum_{i=1}^Ny_i-N\exp(\eta))}{\prod_{i=1}^N y_i!},\nonumber \end{align}\]

such that \(B(\eta)=N\exp(\eta)\), then \(\nabla(B(\eta))=N\exp(\eta)=N\lambda=\mathbb{E}\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right]\), that is, \(\mathbb{E}\left[\frac{\sum_{i=1}^N y_i}{N}\biggr\rvert\lambda\right]=\mathbb{E}[\bar{y}\mid \lambda]=\lambda\), and \(\nabla^2(B(\eta))=N\exp(\eta)=N\lambda=Var\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right]=N^2 \times Var\left[\bar{y}\rvert\lambda\right]\), then \(Var\left[\bar{y}\rvert\lambda\right]=\frac{\lambda}{N}\).

Bernoulli distribution

Given a random sample \(\boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top}\) from a Bernoulli distribution let’s show that \(p(\boldsymbol{y}\mid \theta)\) is in the exponential family. \[\begin{align} p(\boldsymbol{y}\mid \theta)&=\prod_{i=1}^N \theta^{y_i}(1-\theta)^{1-y_i}\nonumber\\ &=\theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}\nonumber\\ &=(1-\theta)^N\exp\left\{\sum_{i=1}^N y_i\log\left(\frac{\theta}{1-\theta}\right)\right\}\nonumber, \end{align}\]

then \(h(\boldsymbol{y})=\mathbb{1}[y_i\in\left\{0,1\right\}]\) (indicator function), \(\eta(\theta)=\log\left(\frac{\theta}{1-\theta}\right)\), \(T(\boldsymbol{y})=\sum_{i=1}^N y_i\) and \(C(\theta)=1-\theta\).

Write this distribution in the canonical form, and find the mean and variance of the sufficient statistic (Exercise 1).

Multinomial distribution

Given a random sample \(\boldsymbol{Y}=[\boldsymbol{Y}_1 \ \boldsymbol{Y}_2 \ \dots \ \boldsymbol{Y}_N]\) from a m-dimensional multinomial distribution, where \(\boldsymbol{Y}_i=\left[Y_{i1} \ Y_{i2} \ \dots \ Y_{im}\right]\), \(\sum_{l=1}^m Y_{il}=n\), \(n\) independent trials each of which leads to a success for exactly one of \(m\) categories with probabilities \(\boldsymbol{\theta}=[\theta_1 \ \theta_2 \ \dots \ \theta_m]\), \(\sum_{l=1}^m \theta_l=1\). Let’s show that \(p(\boldsymbol{y}\mid \boldsymbol{\theta})\) is in the exponential family. \[\begin{align} p(\boldsymbol{y}\mid \boldsymbol{\theta})&=\prod_{i=1}^N \frac{n!}{\prod_{l=1}^m y_{il}!} \prod_{l=1}^m\theta_l^{y_{il}}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\sum_{i=1}^N\sum_{l=1}^m y_{il}\log(\theta_l)\right\}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\left(N\times n-\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\right)\log(\theta_m)\nonumber\right. \\ &\left.+\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l)\right\}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\theta_m^{N\times n}\exp\left\{\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l/\theta_m)\right\}\nonumber, \end{align}\]

then \(h(\boldsymbol{y})=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\), \(\eta(\boldsymbol{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right)\dots \log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right]\), \(T(\boldsymbol{y})=\left[\sum_{i=1}^N y_{i1}\dots \sum_{i=1}^N y_{im-1}\right]\) and \(C(\boldsymbol{\theta})=\theta_m^n\).

Continuous distributions

Let’s show that some of the most common distributions for random variables, which can take any value within a certain range or interval –an infinite number of possible values– are part of the exponential family.

Normal distribution

Given a random sample \(\boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top}\) from a normal distribution let’s show that \(p(\boldsymbol{y}\mid \mu,\sigma^2)\) is in the exponential family. \[\begin{align} p(\boldsymbol{y}\mid \mu,\sigma^2)&=\prod_{i=1}^N \frac{1}{2\pi\sigma^2}\exp\left\{-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right\}\nonumber\\ &= (2\pi)^{-N/2}(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N\left(y_i-\mu\right)^2\right\}\nonumber\\ &= (2\pi)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^Ny_i^2+\frac{\mu}{\sigma^2}\sum_{i=1}^N y_i\right.\nonumber\\ &-\left.N\frac{\mu^2}{2\sigma^2}-\frac{N}{2}\log(\sigma^2)\right\}\nonumber, \end{align}\]

then \(h(\boldsymbol{y})=(2\pi)^{-N/2}\), \(\eta(\mu,\sigma^2)=\left[\frac{\mu}{\sigma^2} \ \frac{-1}{2\sigma^2}\right]\), \(T(\boldsymbol{y})=\left[\sum_{i=1}^N y_i \ \sum_{i=1}^N y_i^2\right]\) and \(C(\mu,\sigma^2)=\exp\left\{-\frac{\mu^2}{2\sigma^2}-\frac{\log(\sigma^2)}{2}\right\}\).

Observe that \[\begin{align} p(\boldsymbol{y}\mid \mu,\sigma^2)&= (2\pi)^{-N/2}\exp\left\{\eta_1\sum_{i=1}^N y_i+\eta_2\sum_{i=1}^Ny_i^2-\frac{N}{2}\log(-2\eta_2)+\frac{N}{4}\frac{\eta_1^2}{\eta_2}\right\}\nonumber, \end{align}\]

where \(B(\boldsymbol{\eta})=\frac{N}{2}\log(-2\eta_2)-\frac{N}{4}\frac{\eta_1^2}{\eta_2}\). Then, \[\begin{align*} \nabla B(\boldsymbol{\eta}) & = \begin{bmatrix} -\frac{N}{2}\frac{\eta_1}{\eta_2}\\ -\frac{N}{2}\frac{1}{\eta_2}+\frac{N}{4}\frac{\eta_1^2}{\eta_2^2} \end{bmatrix} = \begin{bmatrix} N\times\mu\\ N\times(\mu^2+\sigma^2) \end{bmatrix} = \begin{bmatrix} \mathbb{E}\left[\sum_{i=1}^N y_i\bigr\rvert \mu,\sigma^2\right]\\ \mathbb{E}\left[\sum_{i=1}^N y_i^2\bigr\rvert \mu,\sigma^2\right] \end{bmatrix}. \end{align*}\]

Multivariate normal distribution

Given \(\boldsymbol{Y}=[\boldsymbol{Y}_1 \ \boldsymbol{Y}_2 \ \dots \ \boldsymbol{Y}_p]\) a \(N\times p\) matrix such that \(\boldsymbol{Y}_i\sim N_p(\boldsymbol{\mu},\boldsymbol{\Sigma})\), \(i=1,2,\dots,N\), that is, each \(i\)-th row of \(\boldsymbol{Y}\) follows a multivariate normal distribution. Then, assuming independence between rows, let’s show that \(p(\boldsymbol{y}_1,\boldsymbol{y}_2,\dots,\boldsymbol{y}_N\mid \boldsymbol{\mu},\boldsymbol{\Sigma})\) is in the exponential family.

\[\begin{align} p(\boldsymbol{y}_1,\dots,\boldsymbol{y}_N\mid \boldsymbol{\mu},\boldsymbol{\Sigma})&=\prod_{i=1}^N (2\pi)^{-p/2}| \Sigma|^{-1/2}\exp\left\{-\frac{1}{2}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)\right\}\nonumber\\ &= (2\pi)^{-pN/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\sum_{i=1}^N\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)\right]\right\}\nonumber\\ &= (2\pi)^{-p N/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\boldsymbol{S}+N\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)^{\top}\right)\boldsymbol{\Sigma}^{-1}\right]\right\}\nonumber\\ &= (2\pi)^{-p N/2}\exp\left\{-\frac{1}{2}\left[\left(vec\left(\boldsymbol{S}\right)^{\top}+N vec\left(\hat{\boldsymbol{\mu}}\hat{\boldsymbol{\mu}}^{\top}\right)^{\top}\right)vec \left(\boldsymbol{\Sigma}^{-1}\right)\right.\right.\nonumber\\ &\left.\left.-2N\hat{\boldsymbol{\mu}}^{\top}vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+N tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+N\log (|\boldsymbol{\Sigma}|)\right]\right\}\nonumber, \end{align}\]

where the second line uses the trace operator (\(\text{tr}\)), and its invariance under cyclic permutation is applied in the third line. Additionally, we add and subtract \(\hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^N \boldsymbol{y}_i\) inside each parenthesis, resulting in \(\boldsymbol{S} = \sum_{i=1}^N \left(\boldsymbol{y}_i - \hat{\boldsymbol{\mu}}\right) \left(\boldsymbol{y}_i - \hat{\boldsymbol{\mu}}\right)^{\top}\). The fourth line is obtained after collecting terms and using properties of the trace operator to introduce the vectorization operator (\(\text{vec}\)), specifically, \(\text{tr}(\boldsymbol{A}^{\top}\boldsymbol{B}) = \text{vec}(\boldsymbol{A})^{\top} \text{vec}(\boldsymbol{B})\), and \(\text{vec}(\boldsymbol{A} + \boldsymbol{B}) = \text{vec}(\boldsymbol{A}) + \text{vec}(\boldsymbol{B})\).

Then \(h(\boldsymbol{y})=(2\pi)^{-pN/2}\), \(\eta(\boldsymbol{\mu},\boldsymbol{\Sigma})^{\top}=\left[\left(vec\left(\boldsymbol{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)\right)^{\top}\right]\), \(T(\boldsymbol{y})=\left[-\frac{1}{2}\left(vec\left(\boldsymbol{S}\right)^{\top}+N vec\left(\hat{\boldsymbol{\mu}}\hat{\boldsymbol{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\boldsymbol{\mu}}^{\top}\right]^{\top}\) and \(C(\boldsymbol{\mu},\boldsymbol{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\).

References

———. 1995. “Marginal Likelihood from the Gibbs Output.” Journal of the American Statistical Association 90 (432): 1313–21.

Chib, Siddhartha, and Ivan Jeliazkov. 2001. “Marginal Likelihood from the Metropolis–Hastings Output.” Journal of the American Statistical Association 96 (453): 270–81.

Gelfand, A. E., and A. F. M. Smith. 1990. “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association 85: 398–409.

Gelfand, Alan E, and Dipak K Dey. 1994. “Bayesian Model Choice: Asymptotics and Exact Calculations.” Journal of the Royal Statistical Society: Series B (Methodological) 56 (3): 501–14.

Geman, S, and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721–41.

Hastings, W. 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Application.” Biometrika 57: 97–109.

Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” J. Chem. Phys 21: 1087–92.

Schlaifer, Robert, and Howard Raiffa. 1961. Applied Statistical Decision Theory.

Tierney, Luke. 1994. “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics, 1701–28.

Tierney, Luke, and Joseph B Kadane. 1986. “Accurate Approximations for Posterior Moments and Marginal Densities.” Journal of the American Statistical Association 81 (393): 82–86.