2.10 Maximum likelihood and least squares
Let’s look again at our linear model:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]
Assume \(\varepsilon_i \stackrel{iid}{\sim} N(0,\sigma^2)\). That’s our usual assumption about the errors (independent, equal variance, normal distribution), but in slightly fancier notation.
Remember, to a frequentist, parameters aren’t random! They’re fixed, we just don’t happen to know what they are. It’s easy being a frequentist sometimes.
Now, looking at the right-hand side of the model, the only thing over there that’s actually random is \(\varepsilon_i\). That’s because the \(\beta\) are parameters and the \(x_i\) are assumed to be fixed (given). So the \(y_i\)’s are also random, since they have a random component, but all their randomness “comes from” the \(\varepsilon_i\)’s. Adding in the fixed part, \(\beta_0 + \beta_1 x_i\), shifts the mean but doesn’t change the variance or the shape of the distribution. Thus, given the \(x_i\)’s (or in other words, conditioned on the \(x_i\)’s), the density of each \(y_i\) is Normal with mean \(\beta_0 + \beta_1 x_i\) and variance \(\sigma^2\).
The pdf for a normal RV is: \[f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}}\] which I grant you looks pretty horrible, but is surprisingly well behaved in some interesting ways, some but not all of which you will encounter in this class.
So (remembering that all the data points are assumed to be independent!):
Note that I’m not using \(\theta\) here to represent the parameters. I could write it that way, but since I know what the relevant parameters are, I might as well write them out by name.
\[f(y_1,\dots,y_n|\beta_0,\beta_1,\sigma) = \prod \frac{1}{\sqrt{2\pi}\sigma}\exp\left[\frac{-(y_i-\beta_0 - \beta_1 x_i)^2}{2 \sigma^2}\right]\]
Considered as a function of the \(y\)’s, this is a density: given certain parameter values, here’s the relative chance of these \(y\)’s occurring. But viewed as a function of the parameters, this becomes the likelihood function of the intercept and slope. That’s what we want to do: we want to find estimated values of the parameters that maximize the chance of seeing the \(y\)’s that we observed.
For the moment, assume \(\sigma\) is known and fixed. In that case, the likelihood function is a function of \(b_0\) and \(b_1\):
\[L(b_0,b_1) =\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left[\frac{-(y_i-b_0 - b_1 x_i)^2}{2 \sigma^2}\right] \] Work out the product:
\[= \frac{1}{(2\pi\sigma^2)^{(n/2)}}\exp\left[-\frac{1}{2\sigma^2}\sum(y_i-b_0-b_1 x_i)^2\right]\]
Remember our goal is to maximize the function. If we maximize the log, we also maximize the original function. So it doesn’t hurt to think about the log instead :)
Take the log of this to simplify things:
\[l(b_0,b_1) = -\frac{n}{2} \log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2 \sigma^2} \sum(y_i -b_0 - b_1 x_i)^2\]
Notice that the first two terms are constants; we can’t do anything about those, no matter what we say about \(b_0\) and \(b_1\). To maximize the log likelihood, we want to focus on the part that does depend on those quantities: \[-\sum(y_i -b_0 - b_1 x_i)^2\]
In other words we want to minimize: \[\sum(y_i -b_0 - b_1 x_i)^2\]
BUT WAIT.
That looks familiar.
That’s… \(\sum (y_i - \hat{y}_i)^2\). That’s the sum of squared residuals!
We came at this problem from a whole new perspective – trying to find the \(b\)’s that maximized the likelihood of seeing the values we observed. And after all that hard work, we found ourselves doing the very same thing as we did in least squares estimation.
We used the independence assumption, too! Where did that come in?
When it comes to fitting linear regression coefficients, least squares and maximum likelihood agree…if you start from the assumption that the errors are normally distributed. That assumption was how we got that original density function. So this is one of the reasons we have that “normally distributed errors” condition in regression – it means that no matter which meaning of “best fit” we’re using, we’ll get the same answer.