4.2 Canonical Examples

In this section, we give some canonical examples of how the EM algorithm can be used to estimate model parameters. These examples are simple enough that they can be solved using more direct methods, but they are nevertheless useful for demonstrating how to set up the two-step EM algorithm in various scenarios.

4.2.1 Two-Part Normal Mixture Model

Suppose we have data y1,,yn that are sampled independently from a two-part mixture of Normals model with density f(yθ)=λφ(yμ1,σ21)+(1λ)φ(yμ2,σ22). where φ(yμ,σ2) is the Normal density with mean μ and variance σ2. The unknown parameter vector is θ=(μ1,μ2,σ21,σ22,λ) and the log-likelihood is logf(y1,,ynθ)=logni=1λφ(yiμ1,σ1)+(1λ)φ(yiμ2,σ2). This problem is reasonably simple enough that it could be solved using a direct optimization method like Newton’s method, but the EM algorithm provides a nice stable approach to finding the optimum.

The art of applying the EM algorithm is coming up with a useful complete data model. In this example, the approach is to hypothesize that each observation comes from one of two populations parameterized by (μ1,σ21) and (μ2,σ22), respectively. The “missing data” in this case are the labels identifying which observation came from which population. Therefore, we assert that there are missing data z1,,zn such that ziBernoulli(λ). When zi=1, yi comes from population 1 and when zi=0, yi comes from population 2.

The idea is then that the data are sampled in two stages. First we sample zi to see which population the data come from and then given zi, we can sample yi from the appropriate Normal distribution. The joint density of the observed and missing data, i.e. the complete data density, is then g(y,zθ)=φ(yμ1,σ21)zφ(yμ2,σ22)1zλz(1λ)1z. It’s easy to show that 1z=0g(y,zθ)=f(yθ) so that when we “integrate” out the missing data, we get the observed data density.

The complete data log-likelihood is then logg(y,zθ)=ni=1zilogφ(yiμ1,σ21)+(1zi)logφ(yiμ2,σ22)+zilogλ+(1zi)log(1λ). Note that this function is nice and linear in the missing data zi. To evaluate the Q(θθ0) function we need to take the expectation of the above expression with respect to the missing data density h(zy,θ). But what is that? The missing data density will be proportional to the complete data density, so that h(zy,θ)φ(yμ1,σ21)zφ(yμ2,σ22)1zλz(1λ)1z=(λφ(yμ1,σ21))z((1λ)φ(yμ2,σ22))1z=Bernoulli(λφ(yμ1,σ21)λφ(yμ1,σ21)+(1λ)φ(yμ2,σ22)) From this, what we need to compute the Q() function is πi=E[ziyi,θ0]. Given that, wen then compute the Q() function in the E-step. Q(θθ0)=E[ni=1zilogφ(yμ1,σ21)+(1zi)logφ(yμ2,σ22)+zilogλ+(1zi)log(1λ)]=ni=1πilogφ(yμ1,σ21)+(1πi)φ(yμ2,σ22)+πilogλ+(1πi)log(1λ)=ni=1πi[12log2πσ2112σ21(yiμ1)2]+(1πi)[12log2πσ2212σ22(yiμ2)2]+πilogλ+(1πi)log(1λ) In order to compute πi, we will need to use the current estimates of μ1,σ21,μ2, and σ22 (in addition to the data y1,,yn). We can then compute the gradient of Q in order maximize it for the current iteration. After doing that we get the next values, which are ˆμ1=πiyiπiˆμ2=(1πi)yi1πiˆσ21=πi(yiμ1)2πiˆσ22=(1πi)(yiμ2)2(1πi)ˆλ=1nπi Once we have these updated estimates, we can go back to the E-step and recompute our Q function.

4.2.2 Censored Exponential Data

Suppose we have survival times x1,,xnExponential(λ). However, we do not observe these survival times because some of them are censored at times c1,,cn. Because the censoring times are known, what we actually observe are the data (min, where \delta=1 if y_i\leq c_i and \delta=0 if y_i is censored at time c_i.

The complete data density is simply the exponential distribution with rate parameter \lambda, g(x_1,\dots,x_n\mid\lambda) = \prod_{i=1}^n\frac{1}{\lambda}\exp(-x_i/\lambda). To do the E-step, we need to compute Q(\lambda\mid\lambda_0) = \mathbb{E}[\log g(x_1,\dots,x_n\mid\lambda)\mid \mathbf{y}, \lambda_0]\\ We can divide the data into the observations that we fully observe (\delta_i=1) and those that are censored (\delta_i=0). For the censored data, their complete survival time is “missing”, so can denote the complete survival time as z_i. Given that, the Q(\lambda\mid\lambda_0) function is Q(\lambda\mid\lambda_0) = \mathbb{E}\left\{\left.-n\log\lambda-\frac{1}{\lambda}\left[ \sum_{i=1}^n \delta_i y_i + (1-\delta_i) z_i. \right] \right|\mathbf{y},\lambda_0 \right\} But what is \mathbb{E}[z_i\mid y_i,\lambda_0]? Because we assume the underlying data are exponentially distributed, we can use the “memoryless” property of the exponential distribution. That is, given that we have survived until the censoring time c_i, our expected survival time beyond that is simply \lambda. Because we don’t know \lambda yet we can plug in our current best estimate. Now, for the E-step we have Q(\lambda\mid\lambda_0) = -n\log\lambda-\frac{1}{\lambda} \left[ \sum_{i=1}^n\delta_i y_i+(1-\delta_i)(c_i + \lambda_0) \right] With the Q function removed of missing data, we can execute the M-step and maximize the above function to get \hat{\lambda} = \frac{1}{n}\left[ \sum_{i=1}^n\delta_iy_i +(1-\delta_i)(c_i+\lambda_0) \right] We can then update \lambda_0=\hat{\lambda} and go back and repeat the E-step.