3.9 Maximum likelihood
Suppose we have a density function, giving us the distribution of some random variable. Let’s shake things up and take the Poisson as an example:
The Poisson distribution is an asymmetric (skewed) distribution with a long right tail that can only have non-negative values. It often gets used for counts, like “number of customers who will show up in the next hour.” I’m using it for this demo because it’s a relatively simple formula, and because it’s good to see a distribution that isn’t the Normal once in a while.
Yes, that is a factorial in the denominator. The Poisson is rad.
f(x)=e−λλxx!
Note that this distribution has one parameter, λ, that tells us what exactly it looks like. For a Poisson distribution, λ is both the mean and the variance, which is pretty cool.
Now let’s think about some future sample of values drawn from this distribution, of size n. That’s a collection of n random variables, x1,…,xn. These random variables will be independent of each other (assuming you’re doing good sampling!), and they’ll all have the same distribution; so we call them independently and identically distributed, or iid for short.
If we want to think about the sample as a whole, we’ll need to consider the joint density of all these random variables. Well, the joint density is just the product of the individual densities (why, intuitively?).
A few quick probability facts, since it’s not actually all that intuitive:
- By definition, two random variables are independent if and only if P(X<x,Y<y)=P(X<x)P(Y<y) for any values x and y
- The joint CDF of two random variables is defined as F(x,y)=P(X<x,Y<y)
- …so F(x,y)=P(X<x)P(Y<y) iff the variables are independent
- You can show (but I’m not about to) that F(x,y)=FX(x)FY(y) if and only if f(x,y)=fX(x)fY(y)
So, fine. For random variables x1,…,xn:
f(x1,…,xn | θ)=∏f(xi | θ)
What’s that new Greek letter? That’s theta. We usually use this letter to stand for the parameter(s) of a distribution in a generic way.
If our variables are Poisson:
f(x1,…,xn | λ)=∏e−λλxixi!
In probability theory, we are usually given λ. Then we consider f as a function of x1,…,xn: for any possible set of data values, we can calculate the probability. The idea of maximum likelihood is to switch this around. Think of the data as given and look at f as a function of λ.
Now, since this is related to the chance of seeing the data values that actually happened, we should choose λ to make it as big as possible (like choosing the pitcher to maximize the chance of the score I saw on TV).
To do that, we usually look at the log of the density (which we now call the likelihood function, because we’re now treating it as a function of λ) and maximize it (treating the x’s as fixed, looking for the best possible value of λ). Here’s how it goes…
The likelihood function is:
L(λ)=∏e−λλxixi!
Take the log so it’s easier to work with (since if we maximize the log, we maximize the function):
l(λ)=∑[(xilogλ)−λ−log(xi!)] =logλ∑xi−nλ−∑log(xi!)
Now we want to find the value of λ that maximizes this function, so we take the derivative of l with respect to λ:
l′(λ)=∑xiλ−n=0
Somewhat to our own surprise, we discover that ˆλ=∑xi/n.
Technically, we should check the second derivative to make sure this point is a maximum and not a minumum. I did. It’s negative. Yay!
This is a maximum likelihood estimator (or MLE for short) of the parameter λ. You’ve seen several estimators before, like using 1n∑xi to estimate μ, the population mean of a normal distribution. It turns out that most of the estimators that you know are, in fact, maximum likelihood estimators (MLEs), with one exception…the maximum likelihood estimator of σ2 is s2=∑(yi−ˉy)2n. Yes, the denominator is n, not n−1.
Alas, it turns out that the MLE for σ2 is biased (that is, E(s2)≠σ2; we’ll talk about this more in the future). Instead we usually use s2y=∑(yi−ˉy)2n−1
This sort of thing drives Bayesians up the wall. They claim that you should have principles, not waffle between wanting MLEs and wanting unbiased estimators. It’s tough being a Bayesian sometimes.
which is based on the MLE, but is adjusted so that it’s unbiased.