4 The EM Algorithm

The EM algorithm is one of the most popular algorithms in all of statistics. A quick look at Google Scholar shows that the paper by Art Dempster, Nan Laird, and Don Rubin has been cited more than 50,000 times. The EM stands for “Expectation-Maximization”, which indicates the two-step nature of the algorithm. At a high level, there are two steps: The “E-Step” and the “M-step” (duh!).

The EM algorithm is not so much an algorithm as a methodology for creating a family of algorithms. We will get into how exactly it works a bit later, but suffice it to say that when someone says “We used the EM algorithm,” that probably isn’t enough information to understand exactly what they did. The devil is in the details and most problems will need a bit of hand crafting. That said, there are a number of canonical problems now where an EM-type algorithm is the standard approach.

The basic idea underlying the EM algorithm is as follows. We observe some data that we represent with \(Y\). However, there are some missing data, that we represent with \(Z\), that make life difficult for us. Together, the observed data \(Y\) and the missing data \(Z\) make up the complete data \(X = (Y, Z)\).

  1. We imagine the complete data have a density \(g(y, z\mid\theta)\) that is parametrized by the vector of parameters \(\theta\). Because of the missing data, we cannot evaluate \(g\).

  2. The observed data have the density \[ f(y\mid\theta) = \int g(y, z\mid\theta)\,dz \] and the observed data log-likelihood is \(\ell(\theta\mid y) = \log f(y\mid\theta)\).

  3. The problem now is that \(\ell(\theta\mid y)\) is difficult to evaluate or maximize because of the integral (for discrete problems this will be a sum). However, in order to estimate \(\theta\) via maximum likelihood using only the observed data, we need to be able to maximize \(\ell(\theta\mid y)\).

  4. The complete data density usually has some nice form (like being an exponential family member) so that if we had the missing data \(Z\), we could easily evaluate \(g(y,z\mid\theta)\).

Given this setup, the basic outline of the EM algorithm works as follows:

  1. E-step: Let \(\theta_0\) be the current estimate of \(\theta\). Define \[ Q(\theta\mid\theta_0) = \mathbb{E}\left[\log g(y,z\mid\theta)\mid y, \theta_0\right] \]

  2. M-step: Maximize \(Q(\theta\mid\theta_0)\) with respect to \(\theta\) to get the next value of \(\theta\).

  3. Goto 1 unless converged.

In the E-step, the expectation is taken with respect to the missing data density, which is \[ h(z\mid y,\theta) = \frac{g(y,z\mid\theta)}{f(y\mid\theta)}. \] Because we do not know \(\theta\), we can plug in \(\theta_0\) to evaluate the missing data density. In particular, one can see that it’s helpful if the \(\log g(y, z \mid\theta)\) is linear in the missing data so that taking the expectation is a simple operation.