11.4 Stepwise Selection

Stepwise selection was original developed as a feature selection technique for linear regression models. The forward stepwise regression approach uses a sequence of steps to allow features to enter or leave the regression model one-at-a-time. Often this procedure converges to a subset of features. The entry and exit criteria is commonly based on a p-value threshold. A typical entry criterion is that a p-value must be less than 0.15 for a feature to enter the model and must be greater than 0.15 for a feature to leave the model. The process begins by creating $p$ linear regression models, each of which uses exactly one of the features.⁸⁷ The importance of the features are then ranked by their individual ability to explain variation in the outcome. The amount of variation explained can be condensed into a p-value for convenience. If no features have a p-value less than 0.15, then the process stops. However, if one or more features have p-value(s) less than 0.15, then the one with the lowest value is retained. In the next step, $p-1$ linear regression models are built. These models consist of the feature selected in the first step as well as each of the other features individually. Each of the additional features are evaluated, and the best feature that meets the inclusion criteria is added to the selected feature set. Then the amount of variation explained by each feature in the presence of the other feature is computed and converted to a p-value. If the p-values do not exceed the exclusion criteria, then both are kept and the search procedure proceeds to search for a third feature. However, if a feature has a p-value that exceeds the exclusion criteria then it is removed from the current set of selected features. The removed feature can still enter the model at a later step. This process continues until it meets a convergence criteria.

This approach is problematic for several reasons, and a large literature exists that critiques this method (see Steyerberg, Eijkemans, and Habbema (1999), Whittingham et al. (2006), and Mundry and Nunn (2009)). Harrell (2015) provides a comprehensive indictment of the method that can be encapsulated by the statement:

“… if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing.”

Stepwise selection has two primary faults:

Inflation of false positive findings: Stepwise selection uses many repeated hypothesis tests to make decisions on the inclusion or exclusion of individual predictors. The corresponding p-values are unadjusted, leading to an over-selection of features (i.e., false positive findings). In addition, this problem is exacerbated when highly correlated predictors are present.
Model overfitting: The resulting model statistics, including the parameter estimates and associated uncertainty, are highly optimistic since they do not take the selection process into account.

It should be said that the second issue is also true of all of the search methods described in this chapter and the next. The individual model statistics cannot be taken as literal for this same reason. The one exception to this is the resampled estimates of performance. Suppose that a linear regression model was used inside RFE or a global search procedure. The internal estimates of adjusted $R^2$ , RMSE, and others will be optimistically estimated but the external resampling estimates should more accurately reflect the predictive performance of the model on new, independent data. That is why the external resampling estimates are used to guide many of the search methods described here and to measure overall effectiveness of the process that results in the final model.

One modification to the process that helps mitigate the first issue is to use a statistic other than p-values to select a feature. Akaike information criterion (AIC) is a better choice (Akaike 1974). The AIC statistic is tailored to models that use the likelihood as the objective (i.e., linear or logistic regression), and penalizes the likelihood by the number of parameters included in the model. Therefore models that optimize the likelihood and have fewer parameters are preferred. Operationally, after fitting an initial model, the AIC statistic can be computed for each submodel that includes a new feature or excludes an existing feature. The next model corresponds to the one with the best AIC statistic. The procedure repeats until the current model contains the best AIC statistic.

However, it is important to note that the AIC statistic is specifically tailored to models that are based on the likelihood. To demonstrate stepwise selection with the AIC statistic, a logistic regression model was built for the OkCupid data. For illustration purposes, we are beginning with a model that contained terms for age, essay length, and an indicator for being Caucasian. At the next step, three potential features will be evaluated for inclusion in the model. These features are indicators for the keywords of nerd, firefly, and im.

The first model containing the initial set of three predictors has an associated binomial log-likelihood value of -18525.3. There are 4 parameters in the model, one for each term and one for the intercept. The AIC statistic is computed using the log-likelihood (denoted as $\ell$ ) and the number of model parameters $p$ as follows:

$AIC = -2\ell + 2p = -2 (-18525.3) + 8 = 37059$

In this form, the goal is to minimize the AIC value.

On the first iteration of stepwise selection, six speculative models are created by dropping or adding single variables to the current set and computing their AIC values. The results are, in decreasing order:

term	AIC
`+` nerd	36,863
`+` firefly	36,994
`+` im	37,041
current model	37,059
`-` white	37,064
`-` age	37,080
`-` essay length	37,108

Based on these values, adding any of the three keywords would improve the model, with nerd yielding the best model. Stepwise is a less greedy method than other search methods since it does reconsider adding terms back into the model that have been removed (and vice versa). However, all of the choices are made on the basis of the current optimal step at any given time.

Our recommendation is to avoid this procedure altogether. Regularization methods, such as the previously discussed glmnet model, are far better at selecting appropriate subsets in linear models. If model inference is needed, there are a number of Bayesian methods that can be used (Mallick and Yi 2013; Piironen and Vehtari 2017 b, 2017 a).

$p$ , in this instance, is the number of predictors - not to be confused with p-values from statistical hypothesis tests.↩