Chapter 10 Model Selection

in the field of applied statistics, model selection has seen perhaps the greatest growth over the last few decades. Advances in computing power and increases in model selection programs in various software have combined to allowed for the rapid development of model selection. But before we get into model selection let’s think about why we may want to evaluate a number of models in the first place.

Ultimately we want the best model for our understanding or application. We live in an era where there is a lot of data, which means there can be a lot of predictors and we need ways to identify the most important predictors and exclude those that are not contributing knowledge or information. Historically, model selection took place in an ad hoc way; however, model selection now refers to a regularization method that involves an actual analysis and comparison of different models. All of this hearkens back to Chamberlain’s idea of multiple working hypotheses. By developing multiple working hypotheses it is harder to fall in love with a single hypothesis that we develop. Similarly by developing multiple models it becomes harder to fall in love with one model. And from those multiple models, model selection has become the routine that allows us to identify that best model.

10.1 Implicit and explicit model selection

Let’s first think about model selection writ large—that is, implicit and explicit model selection. (Note that these terms are introduced here and are not necessarily used in other references.) Explicit model selection is the model selection that is most commonly referred to when modelers or statisticians are talking about model selection; for example, AIC (Akaike Information Criterion) and cross validation methods with which you may be familiar. Implicit model selection, however, takes place often before the explicit model selection yet can be just as important. Implicit model selection refers to the things you do and decide on when it comes to designing a model. For example implicit model selection includes what inputs and predictors make sense to even consider including (based on a mechanism or hypothesis). Or, perhaps you’ve evaluated some predictors and looked at their effects, which may not be a great way to determine whether to include a predictor but still may produce some information of use. And as covered in previous chapters you will likely want to consider how correlated your predictors are as to whether or not they should all be included. The remainder of this chapter will discuss explicit model selection; however, it remains critical to understand the power you have with implicit model selection and not leave all the model selection to a mechanized routine.

10.2 Model Balance

Recall that when we develop a model we seek to balance the idea of generalizability and specificity. We want a model to be general enough that it can be widely used but not so general that it has no power to help with understanding our prediction. Similarly, we want a model that is specific enough to be helpful but not so specific that it’s fine-tuned to one particular data set (over parametereized) or not useful in other applications. One thing to consider in model selection is whether you seek prediction or understanding. Although prediction and understanding are not mutually exclusive, and in fact most models include some amount of both, it can be important to ask yourself whether you are seeking to identify a model to more fully understand a complex process or perhaps just to make a simple prediction. Such a distinction could produce different outcomes with respect to the model selection routine you use. Model or system understanding may not predict well but it can help explain the drivers of a system. On the other hand, model prediction may not explain well but still have value in performance and application. But again—prediction and understanding are not exclusive.

10.3 Information criterion: The ICs

The information criterion, or ICs, are those model selection routines that seek to balance the reduction in sun of squared errors with the cost of adding additional parameters. In other words IC routines penalize models with more parameters unless those parameters help with the overall fitting. A basic IC example would be an adjusted $R^2$ ; however, AIC is probably the most common information criterion and several others exist. It remains important to remember that the same data need to be used for each model when comparing. Obviously, the predictors can change, but the same exact observations need to go into each model.

AIC, or Akaike Information Criterion, is the most popular of the ICs. You can see from the formula below that AIC consists of approximately two terms. The first term, $2ln(\hat{L})$ refers to the likelihood of the data. The second term, $2k$ , is the penalty term that increases with the additional number of inputs or model parameters. We can think of the AIC formula as AIC equals the likelihood of the model plus the number of estimated parameters. It’s important to remember that lower AIC values are better, but AIC is unitless so it is not worth trying to interpret what the AIC values mean.

$AIC = 2ln(\hat{L}) + 2k$

Outside of very simple applications where you are just looking at a small number of AIC values (that may have clear differences), you are likely to evaluate a lot of models and therefore have a lot of AIC values to compare. In common situations such as these there are different pieces of information derived from AIC that may help with the comparison. First you may want to consider $\Delta$ AIC. $\Delta$ AIC is simply the number of AIC units any model is from the best model, such that the best model has a $\Delta$ AIC of 0. From there, it is typical to retain only those models within in a $\Delta$ AIC cutoff, such as two or four (cite). Another tool for comparison is AIC weights, $w$ . AIC weights for a given set of models sum to one and are simply a representation of how much weight of evidence is given to the various models. The best model, the model with the $\Delta$ AIC of 0, will have the greatest weight and from there the weights will decrease, often including several models with virtually no weight. There are not necessarily firm rules on AIC weights with regard to what specific weights mean or what cutoffs to use; however, weights can be important for simply understanding the overall distribution of support. Does one model have most of the weight? Or is most of the weight spread across a handful of models?

In addition to AIC you may want to consider other information criteria. Before leaving AIC completely is worth knowing that there is an AICc, which is used for lower sample sizes, typically when $n-K>30$ . However AICc converges with AIC for larger sample sizes, and so in theory AICc can be used all the time with little risk. Another option is BIC, or the Bayesian information criterion, which is not necessarily for Bayesian estimation. BIC imposes a greater penalty for additional terms in the model. DIC is the deviance information criterion and is an AIC version that is used with hierarchical models. In general, different ICs will tend to give you similar answers, although they are not perfectly correlated. Also ICs tend to be more for model or system understanding and less for prediction, although that is certainly not a universal statement.

10.4 Cross validation

In many ways, cross validation can become more complex than using an IC, and in fact, cross validation can be used for model validation and other applications beyond just picking a best model. Cross validation is built on the basic concept of bootstrapping. In other words, cross validation involves splitting a data set into a training data set in a test data set. The training data set fits the model and estimates the coefficients and then the test data set is used to evaluate how the model performs. As you can perhaps tell from this simple description cross validation tends to be more for model or system prediction and sometimes less for system understanding.

[Cross validation image]

Typically we hear about k-fold cross validation, where k is equal to the number of groups the data are split into. The smallest k-fold cross validation we can have is 2-fold validation, where half the data are used for training and half are used for testing. The largest k we can have is equal to n, or the sample size of the data. When k=n, it is often referred to as leave one out cross validation or jackknife cross validation. Despite these two extremes many types of cross validation happen with k-folds between 3 and 5 (citation).

10.5 Comparison of AIC and CV

Let’s consider the Midway et al. (2013) example as a comparison between AIC and cross validation. In this study the researchers had eight different predictors available to predict a binary maturity status of a fish. After evaluating all subsets of models the AIC values tended to select for more complex models, or those models that tended to have greater numbers of predictor terms in them. However, when looking at the cross validation scores of the same models, it was noted that there was very little relationship between the AIC scores and the cross validation scores. Even though the top one or two models could be identified using AIC, models that were not competitive with AIC (and would have been excluded as candidates for the best model) still performed very well using cross validation methods.

[Table 2]

The next thing the researchers did was evaluate the best performing model based on cross validation at each unique number of inputs. For example what was the best model with only one predictor? What was the best model with only two predictors? And so forth and so on. You can see in Table 3 that the AIC values for these models are not particularly high and these models would not have been selected using AIC; however, the more predictive-heavy cross validation identified these models as the best performing. This simple example makes it clear not only that AIC tends to not penalized parameters very much (and ultimately lead to higher ranks for models with more terms), but that a similar predictive power was available with even one or two model terms compared to much larger AIC favored models. The take away from this example is that for understanding the process of fish maturation, a model with a lot of terms may be useful (perhaps because fish maturation is a complex process). However, if your goal is to predict whether a fish is or is not mature, you may only need one or two variables.

10.6 Which model selection should I use?

Ultimately when it comes to the model selection routine you should use there is no one size fits all. It remains important to invest your efforts in implicit model selection. For example, there is no reason to continue to evaluate predictors if you don’t have mechanisms or hypotheses as to why they should even be considered in the first place. Explicit model selection should not be a crutch because the analyst or expert did not want to evaluate what is going in the model. After implicit model selection you can consider whether you’re interested in understanding and perhaps want something like AIC. Or, whether you are more heavily invested in prediction, in which case cross validation may be a better tool. Despite several critical differences in model selection tools, it remains important to understand that many model selection approaches will perform similarly on the same data set (Murtaugh citation).