## 3.9 Maximum likelihood

Suppose we have a density function, giving us the distribution of some random variable. Let’s shake things up and take the Poisson as an example:

The Poisson distribution is an asymmetric (skewed) distribution with a long right tail that can only have non-negative values. It often gets used for counts, like “number of customers who will show up in the next hour.” I’m using it for this demo because it’s a relatively simple formula, and because it’s good to see a distribution that isn’t the Normal once in a while.

Yes, that is a factorial in the denominator. The Poisson is rad.

\[f(x) = \frac{e^{-\lambda} \lambda^x}{x!}\]

Note that this distribution has one *parameter*, \(\lambda\), that tells us what exactly it looks like. For a Poisson distribution, \(\lambda\) is both the mean *and* the variance, which is pretty cool.

Now let’s think about some future sample of values drawn from this distribution, of size \(n\). That’s a collection of \(n\) random variables, \(x_1,\dots,x_n\). These random variables will be *independent* of each other (assuming you’re doing good sampling!), and they’ll all have the same distribution; so we call them *independently and identically distributed*, or *iid* for short.

If we want to think about the sample as a whole, we’ll need to consider the *joint* density of all these random variables. Well, the joint density is just the product of the individual densities (why, intuitively?).

A few quick probability facts, since it’s not actually all that intuitive:

- By definition, two random variables are independent if and only if \(P(X<x, Y<y) = P(X<x)P(Y<y)\) for any values \(x\) and \(y\)
- The joint CDF of two random variables is defined as \(F(x,y) = P(X<x, Y<y)\)
- …so \(F(x,y) = P(X<x)P(Y<y)\) iff the variables are independent
- You can show (but I’m not about to) that \(F(x,y) = F_X(x)F_Y(y)\) if and only if \(f(x,y) = f_X(x)f_Y(y)\)

So, fine. For random variables \(x_1,\dots,x_n\):

\[f(x_1,\dots,x_n~|~\theta) = \prod f(x_i~|~\theta)\]

What’s that new Greek letter? That’s *theta*. We usually use this letter to stand for the parameter(s) of a distribution in a generic way.

If our variables are Poisson:

\[f(x_1,\dots,x_n~|~\lambda) = \prod \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}\]

In probability theory, we are usually *given* \(\lambda\). Then we consider \(f\) as a function of \(x_1,\dots,x_n\): for any possible set of data values, we can calculate the probability. The idea of maximum likelihood is to switch this around. Think of the data as *given* and look at \(f\) *as a function of* \(\lambda\).

Now, since this is related to the chance of seeing the data values that *actually happened*, we should choose \(\lambda\) to make it as big as possible (like choosing the pitcher to maximize the chance of the score I saw on TV).

To do that, we usually look at the log of the density (which we now call the *likelihood function*, because we’re now treating it as a function of \(\lambda\)) and maximize it (treating the \(x\)’s as fixed, looking for the best possible value of \(\lambda\)). Here’s how it goes…

The likelihood function is:

\[L(\lambda) = \prod \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}\]

Take the log so it’s easier to work with (since if we maximize the log, we maximize the function):

\[l(\lambda) = \sum [(x_i \log \lambda) - \lambda - \log(x_i!)]\] \[ = \log \lambda \sum x_i - n \lambda - \sum \log(x_i!)\]

Now we want to find the value of \(\lambda\) that maximizes this function, so we take the derivative of \(l\) with respect to \(\lambda\):

\[l'(\lambda) = \frac{\sum x_i}{\lambda} - n = 0\]

Somewhat to our own surprise, we discover that \(\hat{\lambda} = \sum x_i/n\).

Technically, we should check the second derivative to make sure this point is a maximum and not a minumum. I did. It’s negative. Yay!

This is a *maximum likelihood estimator* (or *MLE* for short) of the parameter \(\lambda\). You’ve seen several estimators before, like using \(\frac{1}{n}\sum x_i\) to estimate \(\mu\), the population mean of a normal distribution. It turns out that most of the estimators that you know are, in fact, maximum likelihood estimators (MLEs), with one exception…the maximum likelihood estimator of \(\sigma^2\) is \(s^2 = \frac{\sum(y_i - \bar{y})^2}{n}\). Yes, the denominator is \(n\), not \(n-1\).

Alas, it turns out that the MLE for \(\sigma^2\) is *biased* (that is, \(E(s^2) \not= \sigma^2\); we’ll talk about this more in the future). Instead we usually use
\[ s_y^2 = \frac{\sum(y_i - \bar{y})^2}{n-1}\]

This sort of thing drives Bayesians up the wall. They claim that you should have *principles*, not waffle between wanting MLEs and wanting unbiased estimators. It’s tough being a Bayesian sometimes.

which is based on the MLE, but is adjusted so that it’s *unbiased*.