4.2 Canonical Examples

In this section, we give some canonical examples of how the EM algorithm can be used to estimate model parameters. These examples are simple enough that they can be solved using more direct methods, but they are nevertheless useful for demonstrating how to set up the two-step EM algorithm in various scenarios.

4.2.1 Two-Part Normal Mixture Model

Suppose we have data \(y_1,\dots,y_n\) that are sampled independently from a two-part mixture of Normals model with density \[ f(y\mid\theta) = \lambda\varphi(y\mid\mu_1,\sigma_1^2) + (1-\lambda)\varphi(y\mid\mu_2,\sigma_2^2). \] where \(\varphi(y\mid\mu,\sigma^2)\) is the Normal density with mean \(\mu\) and variance \(\sigma^2\). The unknown parameter vector is \(\theta = (\mu_1,\mu_2,\sigma_1^2,\sigma_2^2, \lambda)\) and the log-likelihood is \[ \log f(y_1,\dots,y_n\mid\theta) = \log \sum_{i=1}^n \lambda\varphi(y_i\mid\mu_1,\sigma_1) + (1-\lambda)\varphi(y_i\mid\mu_2,\sigma_2). \] This problem is reasonably simple enough that it could be solved using a direct optimization method like Newton’s method, but the EM algorithm provides a nice stable approach to finding the optimum.

The art of applying the EM algorithm is coming up with a useful complete data model. In this example, the approach is to hypothesize that each observation comes from one of two populations parameterized by \((\mu_1, \sigma_1^2)\) and \((\mu_2,\sigma^2_2)\), respectively. The “missing data” in this case are the labels identifying which observation came from which population. Therefore, we assert that there are missing data \(z_1,\dots,z_n\) such that \[ z_i\sim\text{Bernoulli}(\lambda). \] When \(z_i=1\), \(y_i\) comes from population 1 and when \(z_i=0\), \(y_i\) comes from population 2.

The idea is then that the data are sampled in two stages. First we sample \(z_i\) to see which population the data come from and then given \(z_i\), we can sample \(y_i\) from the appropriate Normal distribution. The joint density of the observed and missing data, i.e. the complete data density, is then \[ g(y,z\mid\theta) = \varphi(y\mid\mu_1,\sigma_1^2)^{z}\varphi(y\mid\mu_2,\sigma^2_2)^{1-z}\lambda^z(1-\lambda)^{1-z}. \] It’s easy to show that \[ \sum_{z=0}^1 g(y, z\mid\theta) = f(y\mid\theta) \] so that when we “integrate” out the missing data, we get the observed data density.

The complete data log-likelihood is then \[ \log g(y, z\mid\theta) = \sum_{i=1}^n z_i\log\varphi(y_i\mid\mu_1,\sigma^2_1) + (1-z_i)\log\varphi(y_i\mid\mu_2,\sigma^2_2) + z_i\log\lambda + (1-z_i)\log(1-\lambda). \] Note that this function is nice and linear in the missing data \(z_i\). To evaluate the \(Q(\theta\mid\theta_0)\) function we need to take the expectation of the above expression with respect to the missing data density \(h(z\mid y, \theta)\). But what is that? The missing data density will be proportional to the complete data density, so that \[\begin{eqnarray*} h(z\mid y,\theta) & \propto & \varphi(y\mid\mu_1,\sigma_1^2)^z\varphi(y\mid\mu_2,\sigma_2^2)^{1-z}\lambda^z(1-\lambda)^{1-z}\\ & = & (\lambda \varphi(y\mid\mu_1,\sigma_1^2))^z((1-\lambda)\varphi(y\mid\mu_2,\sigma_2^2))^{1-z}\\ & = & \text{Bernoulli}\left( \frac{\lambda \varphi(y\mid\mu_1,\sigma_1^2)}{\lambda \varphi(y\mid\mu_1,\sigma_1^2) + (1-\lambda)\varphi(y\mid\mu_2,\sigma_2^2)} \right) \end{eqnarray*}\] From this, what we need to compute the \(Q()\) function is \(\pi_i = \mathbb{E}[z_i\mid y_i, \theta_0]\). Given that, wen then compute the \(Q()\) function in the E-step. \[\begin{eqnarray*} Q(\theta\mid\theta_0) & = & \mathbb{E}\left[ \sum_{i=1}^n z_i\log\varphi(y\mid\mu_1,\sigma_1^2) + (1-z_i)\log\varphi(y\mid\mu_2,\sigma_2^2) + z_i\log\lambda + (1-z_i)\log(1-\lambda) \right]\\ & = & \sum_{i=1}^n \pi_i\log\varphi(y\mid\mu_1,\sigma_1^2) + (1-\pi_i)\varphi(y\mid\mu_2,\sigma_2^2) + \pi_i\log\lambda + (1-\pi_i)\log(1-\lambda)\\ & = & \sum_{i=1}^n \pi_i\left[ -\frac{1}{2}\log 2\pi\sigma_1^2-\frac{1}{2\sigma_1^2}(y_i-\mu_1)^2 \right] + (1-\pi_i)\left[ -\frac{1}{2}\log 2\pi\sigma_2^2-\frac{1}{2\sigma_2^2}(y_i-\mu_2)^2 \right]\\ & & + \pi_i\log\lambda + (1-\pi_i)\log(1-\lambda) \end{eqnarray*}\] In order to compute \(\pi_i\), we will need to use the current estimates of \(\mu_1, \sigma_1^2, \mu_2\), and \(\sigma_2^2\) (in addition to the data \(y_1,\dots, y_n\)). We can then compute the gradient of \(Q\) in order maximize it for the current iteration. After doing that we get the next values, which are \[\begin{eqnarray*} \hat{\mu}_1 & = & \frac{\sum \pi_i y_i}{\sum \pi_i}\\ \hat{\mu}_2 & = & \frac{\sum (1-\pi_i) y_i}{\sum 1-\pi_i}\\ \hat{\sigma}_1^2 & = & \frac{\sum\pi_i(y_i-\mu_1)^2}{\sum\pi_i}\\ \hat{\sigma}_2^2 & = & \frac{\sum(1-\pi_i)(y_i-\mu_2)^2}{\sum(1-\pi_i)}\\ \hat{\lambda} & = & \frac{1}{n}\sum\pi_i \end{eqnarray*}\] Once we have these updated estimates, we can go back to the E-step and recompute our \(Q\) function.

4.2.2 Censored Exponential Data

Suppose we have survival times \(x_1,\dots,x_n\sim\text{Exponential}(\lambda)\). However, we do not observe these survival times because some of them are censored at times \(c_1,\dots,c_n\). Because the censoring times are known, what we actually observe are the data \((\min(y_1, c_1), \delta_1),\dots,(\min(y_n,c_n),\delta_n)\), where \(\delta=1\) if \(y_i\leq c_i\) and \(\delta=0\) if \(y_i\) is censored at time \(c_i\).

The complete data density is simply the exponential distribution with rate parameter \(\lambda\), \[ g(x_1,\dots,x_n\mid\lambda) = \prod_{i=1}^n\frac{1}{\lambda}\exp(-x_i/\lambda). \] To do the E-step, we need to compute \[ Q(\lambda\mid\lambda_0) = \mathbb{E}[\log g(x_1,\dots,x_n\mid\lambda)\mid \mathbf{y}, \lambda_0]\\ \] We can divide the data into the observations that we fully observe (\(\delta_i=1\)) and those that are censored (\(\delta_i=0\)). For the censored data, their complete survival time is “missing”, so can denote the complete survival time as \(z_i\). Given that, the \(Q(\lambda\mid\lambda_0)\) function is \[ Q(\lambda\mid\lambda_0) = \mathbb{E}\left\{\left.-n\log\lambda-\frac{1}{\lambda}\left[ \sum_{i=1}^n \delta_i y_i + (1-\delta_i) z_i. \right] \right|\mathbf{y},\lambda_0 \right\} \] But what is \(\mathbb{E}[z_i\mid y_i,\lambda_0]\)? Because we assume the underlying data are exponentially distributed, we can use the “memoryless” property of the exponential distribution. That is, given that we have survived until the censoring time \(c_i\), our expected survival time beyond that is simply \(\lambda\). Because we don’t know \(\lambda\) yet we can plug in our current best estimate. Now, for the E-step we have \[ Q(\lambda\mid\lambda_0) = -n\log\lambda-\frac{1}{\lambda} \left[ \sum_{i=1}^n\delta_i y_i+(1-\delta_i)(c_i + \lambda_0) \right] \] With the \(Q\) function removed of missing data, we can execute the M-step and maximize the above function to get \[ \hat{\lambda} = \frac{1}{n}\left[ \sum_{i=1}^n\delta_iy_i +(1-\delta_i)(c_i+\lambda_0) \right] \] We can then update \(\lambda_0=\hat{\lambda}\) and go back and repeat the E-step.