3.10 Maximum likelihood and least squares
Let’s look again at our linear model:
yi=β0+β1xi+εi
Assume εiiid∼N(0,σ2). That’s our usual assumption about the errors (independent, equal variance, normal distribution), but in slightly fancier notation.
Remember, to a frequentist, parameters aren’t random! They’re fixed, we just don’t happen to know what they are. It’s easy being a frequentist sometimes.
Now, looking at the right-hand side of the model, the only thing over there that’s actually random is εi. That’s because the β are parameters and the xi are assumed to be fixed (given). So the yi’s are also random, since they have a random component, but all their randomness “comes from” the εi’s. Adding in the fixed part, β0+β1xi, shifts the mean but doesn’t change the variance or the shape of the distribution. Thus, given the xi’s (or in other words, conditioned on the xi’s), the density of each yi is Normal with mean β0+β1xi and variance σ2.
The pdf for a normal RV is: f(x)=1√2πσe−(x−μ)22σ2 which I grant you looks pretty horrible, but is surprisingly well behaved in some interesting ways, some but not all of which you will encounter in this class.
So (remembering that all the data points are assumed to be independent!):
Note that I’m not using θ here to represent the parameters. I could write it that way, but since I know what the relevant parameters are, I might as well write them out by name.
f(y1,…,yn|β0,β1,σ)=∏1√2πσexp[−(yi−β0−β1xi)22σ2]
Considered as a function of the y’s, this is a density: given certain parameter values, here’s the relative chance of these y’s occurring. But viewed as a function of the parameters, this becomes the likelihood function of the intercept and slope. That’s what we want to do: we want to find estimated values of the parameters that maximize the chance of seeing the y’s that we observed.
For the moment, assume σ is known and fixed. In that case, the likelihood function is a function of b0 and b1:
L(b0,b1)=n∏i=11√2πσexp[−(yi−b0−b1xi)22σ2] Work out the product:
=1(2πσ2)(n/2)exp[−12σ2∑(yi−b0−b1xi)2]
Remember our goal is to maximize the function. If we maximize the log, we also maximize the original function. So it doesn’t hurt to think about the log instead :)
Take the log of this to simplify things:
l(b0,b1)=−n2log(2π)−n2log(σ2)−12σ2∑(yi−b0−b1xi)2
Notice that the first two terms are constants; we can’t do anything about those, no matter what we say about b0 and b1. To maximize the log likelihood, we want to focus on the part that does depend on those quantities: −∑(yi−b0−b1xi)2
In other words we want to minimize: ∑(yi−b0−b1xi)2
BUT WAIT.
That looks familiar.
That’s… ∑(yi−ˆyi)2. That’s the sum of squared residuals!
We came at this problem from a whole new perspective – trying to find the b’s that maximized the likelihood of seeing the values we observed. And after all that hard work, we found ourselves doing the very same thing as we did in least squares estimation.
We used the independence assumption, too! Where did that come in?
When it comes to fitting linear regression coefficients, least squares and maximum likelihood agree…if you start from the assumption that the errors are normally distributed. That assumption was how we got that original density function. So this is one of the reasons we have that “normally distributed errors” condition in regression – it means that no matter which meaning of “best fit” we’re using, we’ll get the same answer.