3.1 Motivation of conjugate families
By observing the three fundamental pieces of Bayesian analysis –the posterior distribution (parameter inference), the marginal likelihood (hypothesis testing), and the predictive distribution (prediction)– as given in equations (3.1), (3.2), and (3.3), respectively, we can understand that some of the initial limitations of Bayesian analysis were due to the absence of algorithms for sampling from non-standard posterior distributions (equation (3.1)), and the lack of analytical solutions for the marginal likelihood (equation (3.2)) and the predictive distribution (equation (3.3)), both of which require significant computational power.
\begin{align} \pi(\boldsymbol{\theta}\mid \boldsymbol{y})&=\frac{p(\boldsymbol{y}\mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta})}{p(\boldsymbol{y})}, \tag{3.1} \end{align}
\begin{equation} p(\boldsymbol{y})=\int_{\boldsymbol{\Theta}}p(\boldsymbol{y}\mid \boldsymbol{\theta})\pi(\boldsymbol{\theta})d\boldsymbol{\theta}, \tag{3.2} \end{equation}
and
\begin{equation} p(\boldsymbol{y}_0\mid \boldsymbol{y})=\int_{\boldsymbol{\Theta}}p(\boldsymbol{y}_0\mid \boldsymbol{\theta})\pi(\boldsymbol{\theta}\mid \boldsymbol{y})d\boldsymbol{\theta}, \tag{3.3} \end{equation}
Although algorithms for sampling from non-standard posterior distributions have existed since the second half of the last century (Metropolis et al. 1953; Hastings 1970; Geman and Geman 1984), their application within the Bayesian framework emerged later (A. E. Gelfand and Smith 1990; Tierney 1994), likely coinciding with the rise in computational power of desktop computers. However, it is still common practice today to use models with standard conditional posterior distributions in order to reduce computational requirements. In addition, mathematical techniques coupled with computational algorithms (Alan E. Gelfand and Dey 1994; Siddhartha Chib 1995; Siddhartha Chib and Jeliazkov 2001) and approximations (Tierney and Kadane 1986, Jordan1999) are employed to obtain the marginal likelihood (prior predictive).
Despite these advances, two potentially conflicting desirable model specification features are evident from equations (3.1), (3.2), and (3.3): (1) analytical solutions and (2) the posterior distribution belonging to the same family as the prior distribution for a given likelihood. The latter is known as conjugate priors, a family of priors that is closed under sampling (Schlaifer and Raiffa 1961).
These features are desirable because the former facilitates hypothesis testing and predictive analysis, while the latter ensures invariance in the prior-to-posterior updating process. Both features reduce computational burden.
Although each of these features can be achieved independently –such as using improper priors for analytical tractability and broadly defining the family of priors for conjugacy– these features are in conflict.
Fortunately, we can achieve both characteristics if we assume that the data-generating process follows a distribution function in the exponential family. That is, given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top}, a probability density function p(\boldsymbol{y}\mid \boldsymbol{\theta}) belongs to the exponential family if it has the form: \begin{align} p(\boldsymbol{y}\mid \boldsymbol{\theta})&=\prod_{i=1}^N h(y_i) C(\boldsymbol{\theta}) \exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(y_i)\right\}\tag{3.4}\\ &=h(\boldsymbol{y}) C(\boldsymbol{\theta})^N\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(\boldsymbol{y})\right\}\nonumber \\ &=h(\boldsymbol{y})\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(\boldsymbol{y})-A(\boldsymbol{\theta})\right\}\nonumber, \end{align}
Where h(\boldsymbol{y}) = \prod_{i=1}^N h(y_i) is a non-negative function, \eta(\boldsymbol{\theta}) is a known function of the parameters, and A(\boldsymbol{\theta}) = \log\left\{ \int_{\boldsymbol{Y}} h(\boldsymbol{y}) \exp\left\{ \eta(\boldsymbol{\theta})^{\top} \boldsymbol{T}(\boldsymbol{y}) \right\} d\boldsymbol{y} \right\} = -N \log\left(C(\boldsymbol{\theta})\right) is the normalization factor. Additionally, \boldsymbol{T}(\boldsymbol{y}) = \sum_{i=1}^N \boldsymbol{T}(y_i) is the vector of sufficient statistics for the distribution (by the factorization theorem).
If the support of \boldsymbol{Y} is independent of \boldsymbol{\theta}, the family is said to be ; otherwise, it is . Furthermore, if we set \eta = \eta(\boldsymbol{\theta}), the exponential family is said to be in the . \begin{align} p(\boldsymbol{y}\mid \boldsymbol{\eta})&=h(\boldsymbol{y})D(\boldsymbol{\eta})^N\exp\left\{\eta^{\top}\boldsymbol{T}(\boldsymbol{y})\right\}\nonumber\\ &=h(\boldsymbol{y})\exp\left\{\eta^{\top}\boldsymbol{T}(\boldsymbol{y})-B(\boldsymbol{\eta})\right\}.\nonumber \end{align}
A nice feature of this representation is that \mathbb{E}[\boldsymbol{T}(\boldsymbol{y})\mid \boldsymbol{\eta}]=\nabla B(\boldsymbol{\eta}) and Var[\boldsymbol{T}(\boldsymbol{y})\mid \boldsymbol{\eta}]=\nabla^2 B(\boldsymbol{\eta}).
3.1.1 Examples of exponential family distributions
- Discrete distributions
Let’s show that some of the most common distributions for random variables, which can take values on a finite or countably infinite set, are part of the exponential family.
Poisson distribution
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a Poisson distribution let’s show that p(\boldsymbol{y}\mid \lambda) is in the exponential family. \begin{align} p(\boldsymbol{y}\mid \lambda)&=\prod_{i=1}^N \frac{\lambda^{y_i} \exp(-\lambda)}{y_i!}\nonumber\\ &=\frac{\lambda^{\sum_{i=1}^N y_i}\exp(-N\lambda)}{\prod_{i=1}^N y_i!}\nonumber\\ &=\frac{\exp(-N\lambda)\exp(\sum_{i=1}^Ny_i\log(\lambda))}{\prod_{i=1}^N y_i!}\nonumber, \end{align}
then h(\boldsymbol{y})=\left[\prod_{i=1}^N y_i!\right]^{-1}, \eta(\lambda)=\log(\lambda), T(\boldsymbol{y})=\sum_{i=1}^N y_i (sufficient statistic) and C(\lambda)=\exp(-\lambda).
If we set \eta=\log(\lambda), then \begin{align} p(\boldsymbol{y}\mid \eta)&=\frac{\exp(\eta\sum_{i=1}^Ny_i-N\exp(\eta))}{\prod_{i=1}^N y_i!},\nonumber \end{align}
such that B(\eta)=N\exp(\eta), then \nabla(B(\eta))=N\exp(\eta)=N\lambda=\mathbb{E}\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right], that is, \mathbb{E}\left[\frac{\sum_{i=1}^N y_i}{N}\biggr\rvert\lambda\right]=\mathbb{E}[\bar{y}\mid \lambda]=\lambda, and \nabla^2(B(\eta))=N\exp(\eta)=N\lambda=Var\left[\sum_{i=1}^N y_i\biggr\rvert\lambda\right]=N^2 \times Var\left[\bar{y}\rvert\lambda\right], then Var\left[\bar{y}\rvert\lambda\right]=\frac{\lambda}{N}.
Bernoulli distribution
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a Bernoulli distribution let’s show that p(\boldsymbol{y}\mid \theta) is in the exponential family. \begin{align} p(\boldsymbol{y}\mid \theta)&=\prod_{i=1}^N \theta^{y_i}(1-\theta)^{1-y_i}\nonumber\\ &=\theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}\nonumber\\ &=(1-\theta)^N\exp\left\{\sum_{i=1}^N y_i\log\left(\frac{\theta}{1-\theta}\right)\right\}\nonumber, \end{align}
then h(\boldsymbol{y})=\mathbb{1}[y_i\in\left\{0,1\right\}] (indicator function), \eta(\theta)=\log\left(\frac{\theta}{1-\theta}\right), T(\boldsymbol{y})=\sum_{i=1}^N y_i and C(\theta)=1-\theta.
Write this distribution in the canonical form, and find the mean and variance of the sufficient statistic (Exercise 1).
Multinomial distribution
Given a random sample \boldsymbol{Y}=[\boldsymbol{Y}_1 \ \boldsymbol{Y}_2 \ \dots \ \boldsymbol{Y}_N] from a m-dimensional multinomial distribution, where \boldsymbol{Y}_i=\left[Y_{i1} \ Y_{i2} \ \dots \ Y_{im}\right], \sum_{l=1}^m Y_{il}=n, n independent trials each of which leads to a success for exactly one of m categories with probabilities \boldsymbol{\theta}=[\theta_1 \ \theta_2 \ \dots \ \theta_m], \sum_{l=1}^m \theta_l=1. Let’s show that p(\boldsymbol{y}\mid \boldsymbol{\theta}) is in the exponential family. \begin{align} p(\boldsymbol{y}\mid \boldsymbol{\theta})&=\prod_{i=1}^N \frac{n!}{\prod_{l=1}^m y_{il}!} \prod_{l=1}^m\theta_l^{y_{il}}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\sum_{i=1}^N\sum_{l=1}^m y_{il}\log(\theta_l)\right\}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\exp\left\{\left(N\times n-\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\right)\log(\theta_m)\nonumber\right. \\ &\left.+\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l)\right\}\nonumber\\ &=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\theta_m^{N\times n}\exp\left\{\sum_{i=1}^N\sum_{l=1}^{m-1}y_{il}\log(\theta_l/\theta_m)\right\}\nonumber, \end{align}
then h(\boldsymbol{y})=\frac{(n!)^N}{\prod_{i=1}^N\prod_{l=1}^m y_{il}!}, \eta(\boldsymbol{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right)\dots \log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right], T(\boldsymbol{y})=\left[\sum_{i=1}^N y_{i1}\dots \sum_{i=1}^N y_{im-1}\right] and C(\boldsymbol{\theta})=\theta_m^n.
- Continuous distributions
Let’s show that some of the most common distributions for random variables, which can take any value within a certain range or interval –often an infinite number of possible values– are part of the exponential family.
Normal distribution
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a normal distribution let’s show that p(\boldsymbol{y}\mid \mu,\sigma^2) is in the exponential family. \begin{align} p(\boldsymbol{y}\mid \mu,\sigma^2)&=\prod_{i=1}^N \frac{1}{2\pi\sigma^2}\exp\left\{-\frac{1}{2\sigma^2}\left(y_i-\mu\right)^2\right\}\nonumber\\ &= (2\pi)^{-N/2}(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N\left(y_i-\mu\right)^2\right\}\nonumber\\ &= (2\pi)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^Ny_i^2+\frac{\mu}{\sigma^2}\sum_{i=1}^N y_i\right.\nonumber\\ &-\left.N\frac{\mu^2}{2\sigma^2}-\frac{N}{2}\log(\sigma^2)\right\}\nonumber, \end{align}
then h(\boldsymbol{y})=(2\pi)^{-N/2}, \eta(\mu,\sigma^2)=\left[\frac{\mu}{\sigma^2} \ \frac{-1}{2\sigma^2}\right], T(\boldsymbol{y})=\left[\sum_{i=1}^N y_i \ \sum_{i=1}^N y_i^2\right] and C(\mu,\sigma^2)=\exp\left\{-\frac{\mu^2}{2\sigma^2}-\frac{\log(\sigma^2)}{2}\right\}.
Observe that \begin{align} p(\boldsymbol{y}\mid \mu,\sigma^2)&= (2\pi)^{-N/2}\exp\left\{\eta_1\sum_{i=1}^N y_i+\eta_2\sum_{i=1}^Ny_i^2-\frac{N}{2}\log(-2\eta_2)+\frac{N}{4}\frac{\eta_1^2}{\eta_2}\right\}\nonumber, \end{align}
where B(\boldsymbol{\eta})=\frac{N}{2}\log(-2\eta_2)-\frac{N}{4}\frac{\eta_1^2}{\eta_2}. Then, \begin{align*} \nabla B(\boldsymbol{\eta}) & = \begin{bmatrix} -\frac{N}{2}\frac{\eta_1}{\eta_2}\\ -\frac{N}{2}\frac{1}{\eta_2}+\frac{N}{4}\frac{\eta_1^2}{\eta_2^2} \end{bmatrix} = \begin{bmatrix} N\times\mu\\ N\times(\mu^2+\sigma^2) \end{bmatrix} = \begin{bmatrix} \mathbb{E}\left[\sum_{i=1}^N y_i\bigr\rvert \mu,\sigma^2\right]\\ \mathbb{E}\left[\sum_{i=1}^N y_i^2\bigr\rvert \mu,\sigma^2\right] \end{bmatrix}. \end{align*}
Multivariate normal distribution
Given \boldsymbol{Y}=[\boldsymbol{Y}_1 \ \boldsymbol{Y}_2 \ \dots \ \boldsymbol{Y}_p] a N\times p matrix such that \boldsymbol{Y}_i\sim N_p(\boldsymbol{\mu},\boldsymbol{\Sigma}), i=1,2,\dots,N, that is, each i-th row of \boldsymbol{Y} follows a multivariate normal distribution. Then, assuming independence between rows, let’s show that p(\boldsymbol{y}_1,\boldsymbol{y}_2,\dots,\boldsymbol{y}_N\mid \boldsymbol{\mu},\boldsymbol{\Sigma}) is in the exponential family.
\begin{align} p(\boldsymbol{y}_1,\dots,\boldsymbol{y}_N\mid \boldsymbol{\mu},\boldsymbol{\Sigma})&=\prod_{i=1}^N (2\pi)^{-p/2}| \Sigma|^{-1/2}\exp\left\{-\frac{1}{2}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)\right\}\nonumber\\ &= (2\pi)^{-pN/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\sum_{i=1}^N\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{y}_i-\boldsymbol{\mu}\right)\right]\right\}\nonumber\\ &= (2\pi)^{-p N/2}|\Sigma|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\boldsymbol{S}+N\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)^{\top}\right)\boldsymbol{\Sigma}^{-1}\right]\right\}\nonumber\\ &= (2\pi)^{-p N/2}\exp\left\{-\frac{1}{2}\left[\left(vec\left(\boldsymbol{S}\right)^{\top}+N vec\left(\hat{\boldsymbol{\mu}}\hat{\boldsymbol{\mu}}^{\top}\right)^{\top}\right)vec \left(\boldsymbol{\Sigma}^{-1}\right)\right.\right.\nonumber\\ &\left.\left.-2N\hat{\boldsymbol{\mu}}^{\top}vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+N tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+N\log (|\boldsymbol{\Sigma}|)\right]\right\}\nonumber, \end{align}
where the second line uses the trace operator (\text{tr}), and its invariance under cyclic permutation is applied in the third line. Additionally, we add and subtract \hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^N \boldsymbol{y}_i inside each parenthesis, resulting in \boldsymbol{S} = \sum_{i=1}^N \left(\boldsymbol{y}_i - \hat{\boldsymbol{\mu}}\right) \left(\boldsymbol{y}_i - \hat{\boldsymbol{\mu}}\right)^{\top}. The fourth line is obtained after collecting terms and using properties of the trace operator to introduce the vectorization operator (\text{vec}), specifically, $ () = ({}){} ()$, and $ ( + ) = () + ()$.
Then h(\boldsymbol{y})=(2\pi)^{-pN/2}, \eta(\boldsymbol{\mu},\boldsymbol{\Sigma})^{\top}=\left[\left(vec\left(\boldsymbol{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)\right)^{\top}\right], T(\boldsymbol{y})=\left[-\frac{1}{2}\left(vec\left(\boldsymbol{S}\right)^{\top}+N vec\left(\hat{\boldsymbol{\mu}}\hat{\boldsymbol{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\boldsymbol{\mu}}^{\top}\right]^{\top} and C(\boldsymbol{\mu},\boldsymbol{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}.