3.9 Maximum likelihood

Suppose we have a density function, giving us the distribution of some random variable. Let’s shake things up and take the Poisson as an example:

The Poisson distribution is an asymmetric (skewed) distribution with a long right tail that can only have non-negative values. It often gets used for counts, like “number of customers who will show up in the next hour.” I’m using it for this demo because it’s a relatively simple formula, and because it’s good to see a distribution that isn’t the Normal once in a while.

Yes, that is a factorial in the denominator. The Poisson is rad.

\[f(x) = \frac{e^{-\lambda} \lambda^x}{x!}\]

Note that this distribution has one parameter, \(\lambda\), that tells us what exactly it looks like. For a Poisson distribution, \(\lambda\) is both the mean and the variance, which is pretty cool.

Now let’s think about some future sample of values drawn from this distribution, of size \(n\). That’s a collection of \(n\) random variables, \(x_1,\dots,x_n\). These random variables will be independent of each other (assuming you’re doing good sampling!), and they’ll all have the same distribution; so we call them independently and identically distributed, or iid for short.

If we want to think about the sample as a whole, we’ll need to consider the joint density of all these random variables. Well, the joint density is just the product of the individual densities (why, intuitively?).

A few quick probability facts, since it’s not actually all that intuitive:

  • By definition, two random variables are independent if and only if \(P(X<x, Y<y) = P(X<x)P(Y<y)\) for any values \(x\) and \(y\)
  • The joint CDF of two random variables is defined as \(F(x,y) = P(X<x, Y<y)\)
  • …so \(F(x,y) = P(X<x)P(Y<y)\) iff the variables are independent
  • You can show (but I’m not about to) that \(F(x,y) = F_X(x)F_Y(y)\) if and only if \(f(x,y) = f_X(x)f_Y(y)\)

So, fine. For random variables \(x_1,\dots,x_n\):

\[f(x_1,\dots,x_n~|~\theta) = \prod f(x_i~|~\theta)\]

What’s that new Greek letter? That’s theta. We usually use this letter to stand for the parameter(s) of a distribution in a generic way.

If our variables are Poisson:

\[f(x_1,\dots,x_n~|~\lambda) = \prod \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}\]

In probability theory, we are usually given \(\lambda\). Then we consider \(f\) as a function of \(x_1,\dots,x_n\): for any possible set of data values, we can calculate the probability. The idea of maximum likelihood is to switch this around. Think of the data as given and look at \(f\) as a function of \(\lambda\).

Now, since this is related to the chance of seeing the data values that actually happened, we should choose \(\lambda\) to make it as big as possible (like choosing the pitcher to maximize the chance of the score I saw on TV).

To do that, we usually look at the log of the density (which we now call the likelihood function, because we’re now treating it as a function of \(\lambda\)) and maximize it (treating the \(x\)’s as fixed, looking for the best possible value of \(\lambda\)). Here’s how it goes…

The likelihood function is:

\[L(\lambda) = \prod \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}\]

Take the log so it’s easier to work with (since if we maximize the log, we maximize the function):

\[l(\lambda) = \sum [(x_i \log \lambda) - \lambda - \log(x_i!)]\] \[ = \log \lambda \sum x_i - n \lambda - \sum \log(x_i!)\]

Now we want to find the value of \(\lambda\) that maximizes this function, so we take the derivative of \(l\) with respect to \(\lambda\):

\[l'(\lambda) = \frac{\sum x_i}{\lambda} - n = 0\]

Somewhat to our own surprise, we discover that \(\hat{\lambda} = \sum x_i/n\).

Technically, we should check the second derivative to make sure this point is a maximum and not a minumum. I did. It’s negative. Yay!

This is a maximum likelihood estimator (or MLE for short) of the parameter \(\lambda\). You’ve seen several estimators before, like using \(\frac{1}{n}\sum x_i\) to estimate \(\mu\), the population mean of a normal distribution. It turns out that most of the estimators that you know are, in fact, maximum likelihood estimators (MLEs), with one exception…the maximum likelihood estimator of \(\sigma^2\) is \(s^2 = \frac{\sum(y_i - \bar{y})^2}{n}\). Yes, the denominator is \(n\), not \(n-1\).

Alas, it turns out that the MLE for \(\sigma^2\) is biased (that is, \(E(s^2) \not= \sigma^2\); we’ll talk about this more in the future). Instead we usually use \[ s_y^2 = \frac{\sum(y_i - \bar{y})^2}{n-1}\]

This sort of thing drives Bayesians up the wall. They claim that you should have principles, not waffle between wanting MLEs and wanting unbiased estimators. It’s tough being a Bayesian sometimes.

which is based on the MLE, but is adjusted so that it’s unbiased.