5.1 Model selection criteria

This is a collection of ways to describe whether a model is “good” – though they don’t all have the same idea of what a “good model” actually means!

Several of these measures are specifically used to compare two candidate models. The value of the measure might not mean a lot in isolation, but given two different models, you can say which one scores better than the other. One thing to watch out for is whether there are restrictions on what kinds of models you can compare!

5.1.1 \(t\)-tests

We have spent a lot of time with one way to decide (or, you know, test) whether a single coefficient is useful, in the sense of “would we plausibly see an estimated \(b_j\) this far from 0, if the true \(\beta_j\) were zero?” We just do a \(t\)-test on that individual \(b_j\), using the variance/standard error of \(b_j\).

But you can also think about this as a way to evaluate or compare models. Doing such a \(t\) test amounts to choosing between the model with predictor \(j\), and the model without it (but with all the other predictors still there). If the coefficient is significant, then the model with the predictor is considered better. If not, you might as well use the smaller, more parsimonious model.

5.1.2 Nested F-tests

We’ve previously looked at the overall \(F\) test for regression, but there’s a more general class of \(F\) tests you can do. The hypotheses here are \(H_0: \text{a particular subset of the }\beta_j{\text{'s}} = 0\) (reduced model) and \(H_A: \beta_j\not=0\) for at least one \(j\) in that subset. The test statistic uses sums of squares from the full and reduced models to create an F-statistic, which is then compared to the appropriate F distribution to obtain a p-value.

So your two models being compared are: (with this bunch of predictors) and (without that bunch, but with all the other ones). We call these models nested. The reduced model (without that bunch of predictors) is nested inside the full model, which contains everything. Note that the reduced model is not allowed to have any predictor that isn’t in the full model.

One common situation where you might want to do a nested F-test is when you have a categorical predictor with several levels. Recall that in this situation, you incorporate the categorical predictor in the model by creating an indicator variable for each level of the predictor (except one, which is considered the baseline/default). The \(t\)-tests on these individual indicator variables can tell you whether each group is different from the baseline group. But you might want to ask “is this categorical variable useful in general?” In that case, what you’re testing is whether the model that includes all those indicator variables is better than the model that includes none of those indicator variables.

You may (or may not) recall this from previous stats work: you used to do ANOVA tests for whether there was, overall, a difference in mean response between a bunch of groups. Think of those groups as being defined by the levels of a categorical/grouping variable. Testing whether the variable is useful overall is equivalent to testing whether there’s a difference between the levels of the variable overall. And indeed, ANOVA tests are F-tests!

The \(F\) test that we’ve seen previously is a specific example of this: it compares the model with all your predictors (as the full model) vs. the constant model, using the mean but none of the predictors (as the reduced model). If it’s significant, that means that something in your model is useful; if not, none of these predictors are really helping to explain the response.

5.1.3 SSE

One way we think about modeling is with the goal of prediction – trying to produce \(\hat{y}\)’s that are close to the true \(y\)’s. So why not just look at how big the errors are that result from each model?

In this view, a “good model” is one that does not leave much error between the predictions and actual response values.
We want to decrease the sum of squared errors (SSE), or equivalently the MSE.
This approach isn’t restricted to comparing nested models (unlike \(F\) tests) – you can compare any two models.
Want low! The better model is the one that leaves less error.

5.1.4 \(R^2\)

Recall from your intro days (or a couple weeks ago) that \(R^2\) can be describe as “the fraction of the variation in the response that’s explained by the model.” That seems like a good thing!

Note the difference in mindset here! Looking for a small SSE is motivated by wanting a model that’s good at prediction. Looking for a large \(R^2\) is motivated by wanting a model that’s good at explaining the response.

A “good model” is one that explains a large proportion of the variation between response values.
Does not require comparing nested models.
Want high!
But be careful: Adding variables will always increase \(R^2\)! If you do have two nested models, the larger one will inevitably have a higher \(R^2\), even if the extra stuff isn’t really very useful.

5.1.5 Adjusted \(R^2\)

Okay, we need to do something about the “don’t include the useless stuff” part. We should have a penalty for including variables that don’t really help. So:

A “good model” is considered the same way as with \(R^2\)
But now, we take into account the number of variables: we’ll penalize making the model bigger. Hopefully this will prevent overfitting and having ridiculously bloated models.
Unlike regular \(R^2\), \(R^2_{adj}\) can decrease when adding a variable, if the SSE doesn’t decrease much.
Does not require comparing nested models.
Want high!
The penalty for adding a variable is not, relatively speaking, very heavy; so you’ll often wind up with a fairly large model.

The mathematical definition of adjusted \(R^2\) is: \[R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTO/(n-1)}\]

You can show that \[R^2_{adj} = R^2 - (1-R^2)\frac{k}{n-k-1}\] if you want to; it involves a bit of algebra and remembering that \(R^2\) is defined as \(1-SSE/SSTO\). (It’s easiest to start with the second \(R^2_a\) formula and work back to the first, if you ask me.)

Now this second definition here is in an interesting format.

There’s one piece that rewards us for adding things to the model. This piece, \(R^2\), measures goodness of fit, specifically in terms of proportion of response variation that’s explained by the model.
Then there’s another piece that penalizes us for adding things to the model: we subtract something involving \(k\).

So, if we add something to the model, we’ll get both a reward and a penalty. If the predictor is really useful, the reward will outweigh the penalty, and \(R^2_a\) will improve. But if the predictor barely helps with goodness of fit, the penalty will outweigh the reward.

5.1.6 AIC

This was developed by Hirotugu Akaike, “a most gentle person of great intellect, integrity and generosity.” He originally called it just “An Information Criterion” because apparently he was modest too.

The Akaike Information Criterion (almost always just called “AIC”) has this same structure: a reward for goodness of fit, and a penalty for the size/complexity of the model.

There’s some cool math you can use to explain why you apply this particular penalty for model size. I’m not including it here. It involves the Kullback-Leibler Divergence, which I mention primarily because it’s a great band name.

A “good model” is one under which our observations are plausible. Well, hold on, we’ve heard that before…
AIC’s “reward component” is based on the log-likelihood of the model:

\[AIC = -2 \log(L) + 2*(k+1)\] So our idea of “goodness of fit” here is: observations have the greatest likelihood under this model.

Does not require comparing nested models. (But all the models must use the same data – no comparing models that use transformed and untransformed response vectors!)
Want low!
A limitation: this relies on having a decent number of observations. For small sample sizes, it’ll tend to overfit, including too many parameters. Instead, use AICc (corrected), which puts a heavier penalty on the number of parameters when \(n\) is small.

MATH! This simplification relies on using a specific equation for the log-likelihood. In this case, we have to use the multivariate normal. This, along with the expressions we have for maximum likelihood estimators when errors are normal (in particular, using \(SSE/n\) for \(\sigma^2\), as you may have explored in the practice problems!), gives us the simplified form.

An interesting fact about AIC is this: assuming normality for the errors \(\boldsymbol{\varepsilon}\), AIC simplifies to approximately

\[ AIC = n\log SSE - n\log n + 2(k+1)\] So the “goodness of fit” part is related to the SSE. Remember, you want low values for the AIC, so a smaller SSE means a better model.

5.1.7 BIC

SBC stands for Schwartz’s Bayesian Criterion, after Gideon Schwartz, who introduced it. And after Thomas Bayes, I guess, who came up with Bayes’ Rule and started this whole Bayesian mess in the first place.

The Bayesian Information Criterion (usually referred to as “BIC” or very occasionally “SBC”) is similar to the AIC in some ways. But oddly, it comes out of a pretty different philosophy.

A “good model” is (still) one with high likelihood, though the penalty is different (see definition).
The goodness-of-fit part is again based on the log-likelihood: \[BIC = -2 \log(L) + (k+1)\log(n)\] The original argument for using this is based on a Bayesian approach and we won’t talk about it here. But it’s cool that you wind up with something so similar to AIC, eh?
Does not require comparing nested models, but the response values have to be the same for all the models (again, no transformed vs. untransformed).
Has a larger penalty for larger models, so you’ll tend to get smaller models than with AIC (depends on \(n\) and \(k\) though).
Want low! (If you’re comparing two models, a difference of 2-6 in their BICs is considered indicative that one of them is somewhat “better”; more than 6 is pretty strong.)
Limitation: there is a nonzero probability that it will come up with a real clunker.

As you’ll see when you start trying to use this in practice, R actually considers BIC to be just a variation of the AIC!

Using the same mathematical approach as with AIC, the BIC simplifies to approximately: \[BIC = n\log SSE -n\log n + (k+1)\log n\] Compare AIC and BIC: the only difference is that the last term has \(k+1\) multiplied by 2 (for AIC) or \(\log n\) (for BIC). This makes it easy to see that the BIC’s penalty for model size is more severe!

5.1.8 Mallows’ \(C_p\)

This was developed by Colin Lingwood Mallows, with an s, not Mallow; have mercy on the poor man and put the apostrophe in the right place, after the s.

Mallows’ \(C_p\) (sometimes referred to as \(C_k\)) is another one that’s pretty similar to AIC – in fact it’s only off by a constant (in our normal-error linear regression case). And yet, it appears to think about model goodness from another different perspective!

A “good model” is one with smaller errors.
The goodness-of-fit part is defined in terms of sums of squares of the models being compared: \[C_p = SSE_{reduced}/MSE_{full} - (n-2*(k+1))\]
As you can see from the definition, you have to compare nested models for this one.
Want low!