## 3.8 But really, why squared?

Why did our definition of “best fit” use the *squares* of the residuals, \(SSE = \sum{e_i^2} = \sum(y_i - \hat{y}_i)^2\)? We noted that this made sure everything was positive, so things didn’t cancel out, but there are other ways to make a value positive!

The first one you might think of is absolute value, since this simply takes any number and makes it non-negative. This turns out to be mathematically annoying, though, because it’s non-differentiable at zero (so taking a derivative to minimize it is going to be gross). Still, there are lots of functions with continuous derivatives, which leads us back to: why squares?

The reason is that we assumed the errors are *normal*. The choice of sum of squares comes from combining this assumption with a process called *maximizing the likelihood*.

Likelihood of what? Likelihood that you’d observe the value you did, given a value of the parameter. This is a bit trippy, so here’s a thought experiment I like to do:

Say that my favorite baseball team has three pitchers who could start a game. Let \(Y\) be the number of runs the pitcher allows in a game (the number of points the other team scores). I happen to know the probability, for each pitcher, of giving up a certain number of runs:

```
## Joe Gio Max
## Y=0 0.2 0.6 0.9
## Y=1 0.3 0.3 0.1
## Y=4 0.5 0.1 0.0
```

I see the game score on TV, but it doesn’t say who the pitcher is. If the opposing team scored 0 runs, who do you think was pitching? What if they scored 4 runs? (What if I told you that, as a sweet promotion, the manager had chosen the pitcher randomly with the following probabilities: 0.9, 0.1, 0?)

Remember parameters and statistics from your previous stats work? The *parameter* is a characteristic of the *population* distribution. For example, a normal distribution has two parameters: the mean and the variance (or standard deviation, equivalently).

So the key here is that in probability land, if I know the pitcher (**p**arameter), I have the probability of seeing each score (observed value/**s**ample statistic). Likelihood is about knowing the *score* and trying to guess the *pitcher*: given the sample I observed, what distribution do I think it came from?