3.2 Bias, variance, and estimators
Here in (frequentist) statistics, we spend a lot of time thinking about True Quantities that we can never observe, which we often refer to as parameters. We’d like to estimate those quantities as best we can, based on some observed data. A function or formula that does this estimation is called an estimator. For example, an estimator for the mean of a normal distribution is \(\sum x_i /n\), where the \(x_i\)’s are our observations drawn from that population.
Note the subtle difference between “estimator” and “estimate.” An estimator is a function. Once we actually plug numbers into it and get a result, that quantity is an estimate.
Honestly it’s not going to make a huge difference in your current life if you use one word or the other, but it’s nice to think about the distinction.
There are two properties that we (in general) want estimators to have:
- They should be close (on average/in expectation) to the thing they’re trying to predict, and
- they should vary as little as possible.
Ideally, they should, at least on average, be exactly equal to the thing they’re trying to estimate.
Let’s use some notation. Let \(\theta\) be any True Quantity or parameter, and let \(\hat{\theta}\) represent some estimator of it. We’d like \(E(\hat{\theta}) = \theta\) (in which case \(\hat{\theta}\) is called unbiased), and we’d like \(Var(\hat{\theta})\) to be small.
But we might be willing to allow a little bias (\(Bias= E(\hat{\theta} - \theta) \ne 0\)) in exchange for some reduction in variance. Perhaps we could create a criterion that combines both of them?
We’ve used “MSE” in a few different places now! It showed up in the regression context, to represent the observed sum of squared errors divided by \(n-k-1\). Then it appeared when we were talking about the average squared prediction error. Here, we think about MSE as a function or characteristic of an estimator. You’ll see that the underlying ideas are similar.
Indeed we can: the MSE! Here’s the definition of MSE in an estimator context:
\[ \begin{aligned} MSE(\hat{\theta}) &\equiv E[(\hat{\theta} - \theta)^2]\\ &= E[({\hat{\theta}} - E(\hat{\theta}) + E(\hat{\theta}) - \theta)^2]\\ &= E[({\hat{\theta}} - E(\hat{\theta}))^2] + 2 E[(\hat{\theta}- E(\hat{\theta}))(E(\hat{\theta})- \theta)] + (E(\hat{\theta})-\theta))^2\\ &= Var(\hat{\theta}) + Bias(\hat{\theta},\theta)^2 \end{aligned} \]
Check out that notation: \(Bias(\hat{\theta},\theta)\). This can be read as “the bias of \(\hat{\theta}\) as an estimator of \(\theta\).” Putting the \(\theta\) in there just reminds us what the actual goal was – the quantity we’re trying to estimate :)
Note that the bias is squared in this expression, so that the units are the same as for the variance.
This is a sweet result – the actual mean squared error of the estimator breaks down into bias and variance components. But where did that irritating middle term go?
3.2.1 Optional math moment: proof of the MSE
First, notice that that middle term is an expectation. In \(E[(\hat{\theta}- E(\hat{\theta}))(E(\hat{\theta})- \theta)]\), what’s random?
Answer: only \(\hat{\theta}\). The expected value, \(E(\hat{\theta})\), isn’t a random quantity; the expected value of the estimator is fixed, it’s just that we get different observed values of the estimator depending on the sample we use. So \(E(\hat{\theta}) - \theta\) is a non-random constant (!) and you can pull it right out of the expectation:
\[ \begin{aligned} E[(\hat{\theta}- E(\hat{\theta}))(E(\hat{\theta})- \theta)] &= (E(\hat{\theta})- \theta) \times E[\hat{\theta}- E(\hat{\theta})] \\ &= Bias(\hat{\theta})\times (E(\hat{\theta})- E(E(\hat{\theta})))\\ &=0 \end{aligned} \]
The bias may be non-zero, but \(E(\hat{\theta}) = E(E(\hat{\theta}))\) (because \(E(\hat{\theta})\) is a constant), so the second term is 0.
This is a great general proof trick – check for secret constants that come from expected values!
3.2.2 Okay we’re back
So we’ve shown that any estimator’s MSE can be split into a bias piece and a variance piece. This means that, if we agree to deal only with unbiased estimators, we can compare them just by looking at their variances.
Here’s a handy example: regression coefficients. We’ve previously noted that the least squares estimator of \(b_1\) is unbiased: \(E(b_1) = \beta_1\). This is true of all the least squares estimators of the \(b\)’s, even in multiple regression.
It turns out that, for the regression slope and intercept, the least squares estimators have the lowest possible variance of any unbiased linear estimators. This result is called the Gauss-Markov Theorem and I am not about to prove it here. It’s worth noting, though, that the proof relies on certain assumptions: \(E(\varepsilon_i)=0\), \(Var(\varepsilon_i) = \sigma_{\varepsilon}^2\), and \(Cov(\varepsilon_i,\varepsilon_j)=0\).
Later in life you may come across things like BLUPs and MVUEs, but BLUEs are good enough for now.
The least squares estimators are therefore called BLUEs, for Best Linear Unbiased Estimators.
3.2.3 Predictions
Another context in which we are trying to estimate True Quantities is prediction. In this situation, our estimator is…whatever our model predicts given some set of predictor values \(\boldsymbol{x}_{\nu}\), which we call \(\hat{y}_{\nu}\). But what is it that we’re trying to estimate?
From one perspective, we’re trying to estimate the true average response for given predictor value(s): \(\hat{\mu}_{\nu}\) given \(\boldsymbol{x}_{\nu}\). (Look back at the notes on confidence intervals for predictions to see this notation!)
In that case, the difference between the True Quantity and our estimator is \((\hat{\mu}_{\nu} - \hat{y}_{\nu})\). That’s the difference between the value of the True Line (or whatever) given that \(\boldsymbol{x}\), and our model’s prediction given that \(\boldsymbol{x}\). So the MSE is:
\[E(\hat{\mu}_{\nu} - \hat{y}_{\nu})^2 = Var(\hat{y}_{\nu}) + Bias(\hat{y}_{\nu}, \hat{\mu}_{\nu})^2\]
But there’s another perspective we could take on this. Suppose the True Quantity we’re trying to estimate isn’t the average response \(\hat{\mu}_{\nu}\), but a single, individual response value \(y_{\nu}\) (still with predictor values \(\boldsymbol{x}_{\nu}\)). Just as we saw with prediction intervals vs. confidence intervals, there’s now an “extra error” involved: the variation of individual points around \(\hat{\mu}_{\nu}\). So the MSE gets an extra piece:
\[E(y - \hat{y}_{\nu})^2 = Var(\hat{y}_{\nu}) + Bias(\hat{y}_{\nu}, \hat{\mu}_{\nu})^2 + Var(\varepsilon)\]
or in other words
\[E(y - \hat{y}_{\nu})^2 = Var(\hat{y}_{\nu}) + Bias(\hat{y}_{\nu}, \hat{\mu}_{\nu})^2 + \sigma^2_{\varepsilon}\] This is the version the textbook uses. It’s practical because in the real world, we don’t have any actual data about \(\hat{\mu}_{\nu}\); all we have is a sample of observed individual \(y\) values. So those are what we use to calculate, say, the test MSE.
Of course, no matter what choices you make about modeling, you can never get rid of \(\sigma^2_{\varepsilon}\) (that’s why we call it irreducible error). Such is life.