2.8 But really, why squared?

Why did our definition of “best fit” use the squares of the residuals, \(SSE = \sum{e_i^2} = \sum(y_i - \hat{y}_i)^2\)? We noted that this made sure everything was positive, so things didn’t cancel out, but there are other ways to make a value positive!

The first one you might think of is absolute value, since this simply takes any number and makes it non-negative. This turns out to be mathematically annoying, though, because it’s non-differentiable at zero (so taking a derivative to minimize it is going to be gross). Still, there are lots of functions with continuous derivatives, which leads us back to: why squares?

The reason is that we assumed the errors are normal. The choice of sum of squares comes from combining this assumption with a process called maximizing the likelihood.

Likelihood of what? Likelihood that you’d observe the value you did, given a value of the parameter. This is a bit trippy, so here’s a thought experiment I like to do:

Say that my favorite baseball team has three pitchers who could start a game. Let \(Y\) be the number of runs the pitcher allows in a game (the number of points the other team scores). I happen to know the probability, for each pitcher, of giving up a certain number of runs:

##     Joe Gio Max
## Y=0 0.2 0.6 0.9
## Y=1 0.3 0.3 0.1
## Y=4 0.5 0.1 0.0

I see the game score on TV, but it doesn’t say who the pitcher is. If the opposing team scored 0 runs, who do you think was pitching? What if they scored 4 runs? (What if I told you that, as a sweet promotion, the manager had chosen the pitcher randomly with the following probabilities: 0.9, 0.1, 0?)

Remember parameters and statistics from your previous stats work? The parameter is a characteristic of the population distribution. For example, a normal distribution has two parameters: the mean and the variance (or standard deviation, equivalently).

So the key here is that in probability land, if I know the pitcher (parameter), I have the probability of seeing each score (observed value/sample statistic). Likelihood is about knowing the score and trying to guess the pitcher: given the sample I observed, what distribution do I think it came from?