11.1 Mixture models

Mixture models naturally arise in situations where a sample consists of draws from different subpopulations (clusters) that cannot be easily distinguished based on observable characteristics. However, performing inference on specific identified subpopulations can be misleading if the assumed distribution for each cluster is misspecified.

Even when distinct subpopulations do not exist, finite and infinite mixture models provide a useful framework for semi-parametric inference. They effectively approximate distributions with skewness, excess kurtosis, and multimodality, making them useful for modeling stochastic errors.

In addition, mixture models help capture unobserved heterogeneity. That is, as data modelers, we may observe individuals with identical sets of observable variables but entirely different response variables. These differences cannot be explained solely by sampling variability; rather, they suggest the presence of an unobserved underlying process, independent of the observable features, that accounts for this pattern.

11.1.1 Finite Gaussian mixtures

A finite Gaussian mixture model with $H$ known components assumes that a sample
$\boldsymbol{y}=\left[y_1 \ y_2 \ \dots \ y_N\right]^{\top}$ consists of observations $y_i$ ,
for $i=1,2,\dots,N$ , where each $y_i$ is generated from one of the $H$ components,
$h=1,2,\dots,H$ , conditional on the regressors $\boldsymbol{x}_i$ . Specifically, we assume
$y_i \mid \boldsymbol{x}_i \sim N(\boldsymbol{x}_i^{\top}\boldsymbol{\beta}_h, \sigma_h^2).$

Thus, the sampling distribution of $y_i$ is given by
$p(y_i \mid \{\lambda_h, \boldsymbol{\beta}_h, \sigma_h^2\}_{h=1}^H, \boldsymbol{x}_i) = \sum_{h=1}^H \lambda_h \phi(y_i \mid \boldsymbol{x}_i^{\top}\boldsymbol{\beta}_h, \sigma_h^2),$

where $\phi(y_i \mid \boldsymbol{x}_i^{\top}\boldsymbol{\beta}_h, \sigma_h^2)$ is the Gaussian density with mean
$\boldsymbol{x}_i^{\top}\boldsymbol{\beta}_h$ and variance $\sigma_h^2$ , $0\leq \lambda_h\leq 1$ represents
the proportion of the population belonging to subpopulation $h$ , and the weights satisfy
$\sum_{h=1}^H \lambda_h = 1$ .

Then, we allow cross-sectional units to differ according to unobserved clusters (subpopulations) that exhibit homogeneous behavior within each cluster.

To model a finite Gaussian mixture, we introduce an individual cluster indicator or latent class $\psi_{ih}$ such that
$\psi_{ih}= \begin{cases} 1, & \text{if the } i\text{-th unit is drawn from the } h\text{-th cluster}, \\ 0, & \text{otherwise}. \end{cases}$

Thus, $P(\psi_{ih}=1) = \lambda_h$ for all clusters $h=1,2,\dots,H$ and units $i=1,2,\dots,N$ . Note that a high probability of individuals belonging to the same cluster suggests that these clusters capture similar sources of unobserved heterogeneity.

This setting implies that
$\boldsymbol{\psi}_i = [\psi_{i1} \ \psi_{i2} \ \dots \ \psi_{iH}]^{\top} \sim \text{Categorical}(\boldsymbol{\lambda}),$

where $\boldsymbol{\lambda} = [\lambda_1 \ \lambda_2 \ \dots \ \lambda_H]^{\top}$ represents the event probabilities.

We know from Chapter 4 that the Dirichlet prior distribution is conjugate to the multinomial distribution, where the categorical distribution is a special case in which the number of trials is one. Thus, we assume that
$\pi(\boldsymbol{\lambda}) \sim \text{Dir}(\boldsymbol{\alpha}),$

where $\boldsymbol{\alpha} = [\alpha_1 \ \alpha_2 \ \dots \ \alpha_H]^{\top}$ .

Observe that we are using a hierarchical structure, as we specify a prior on $\boldsymbol{\lambda}$ , which serves as the hyperparameter for the cluster indicators.

The label switching identification problem: All finite mixture models are nonidentifiable due to the distribution being unchanged if the group labels are permuted. This poses problems when performing inference in specific components of the mixtures, but it is not a concern when using the mixture to model the stochastic errors. In the former case, there are post-processing strategies to mitigate the issue, such as random permutation of latent classes (see Andrew Gelman et al. (2021) and Algorithm 3.5 in Frühwirth-Schnatter (2006)).

If we do not want to fix in advance the number of clusters, we can follow the approach proposed by Andrew Gelman et al. (2021)…

11.1.2 Dirichlet processes

References

———. 2006. Finite Mixture and Markov Switching Models. Springer.

Gelman, Andrew, John B Carlin, Hal S Stern, David Dunson, Aki Vehtari, and Donald B Rubin. 2021. Bayesian Data Analysis. Chapman; Hall/CRC.