11.2 Simple Filters
The most basic approach to feature selection is to screen the predictors to see if any have a relationship with the outcome prior to including them in a model. To do this, a numeric scoring technique is required to quantify the strength of the relationship. Using the scores, the predictors are ranked and filtered with either a threshold or by taking the top \(p\) predictors. Scoring the predictors can be done separately for each predictor, or can be done simultaneously across all predictors (depending on the technique that is used). If the predictors are screened separately, there are a large variety of scoring methods. A summary of some popular and effective scoring techniques are provided below are are organized based on the type of predictor and outcome. The list is not meant to be exhaustive, but is meant to be a starting place. A similar discussion in contained in Chapter 18 of Kuhn and Johnson (2013).
When screening individual categorical predictors, there are several options depending on the type of outcome data:
When the outcome is categorical, the relationship between the predictor and outcome forms a contingency table. When there are three or more levels for the predictor, the degree of association between predictor and outcome can be measured with statistics such as \(\chi^2\) (chi-squared) tests or exact methods (Agresti 2012). When there are exactly two classes for the predictor, the odds-ratio can be an effective choice (see Section 5.6).
When the outcome is numeric, and the categorical predictor has two levels, then a basic \(t\)-test can be used to generate a statistic. ROC curves and precision-recall curves can also be created for each predictor and the area under the curves can be calculated. When the predictor has more than two levels, the traditional ANOVA \(F\)-statistic can be calculated.
When the predictor is numeric, the following options exist:
When the outcome is categorical, the same tests can be used in the case above where the predictor is categorical and the outcome is numeric. The roles are simply reversed in the t-test, curve calculations and \(F\)-test. When there are a large number of tests or if the predictors have substantial multicollinearity, the correlation-adjusted t-scores of Opgen-Rhein and Strimmer (2007) and Zuber and Strimmer (2009) are a good alternative to simple ANOVA statistics.
When the outcome is numeric, a simple pairwise correlation (or rank correlation) statistic can be calculated. If the relationship is nonlinear, then the MIC values (Reshef et al. 2011) or \(A\) statistics (Murrell, Murrell, and Murrell 2016) can be used.
Alternatively, a generalized additive model (GAM) (Wood 2006) can fit nonlinear smooth terms to a set of predictors simultaneously and measure their importance using a p-value that tests against a null hypothesis no trend of each predictor. An example of such a model with a categorical outcome was shown in Figure 4.15(b).
A summary of these simple filters can be found in Figure 11.1.
Generating summaries of how effective individual predictors are at classifying or correlating with the outcome is straightforward when the predictors are all of the same type. But most data sets contain a mix of predictor types. In this setting it can be challenging to arrive at a ranking of the predictors since their screening statistics are on different scales. For example, an odds-ratio and a t-statistic are not compatible since they are on different scales. In many cases, each statistic can be converted to a p-value so that there is a commonality across the screening statistics.
Recall that a p-value stems from the statistical framework of hypothesis testing. In this framework, we assume that there is no association between the predictor and the outcome. The data are then used to refute the assumption of no association. The p-value is then the probability that we observe the value of the statistic if, in fact, no association exists between the predictor and outcome.
Each statistic can be converted to a p-value, but this conversion is easier for some statistics than others. For instance, converting a \(t\)-statistic to a p-value is a well-known process, provided that some basic assumptions are true. On the other hand, it is not easy to convert an AUC to a p-value. A solution to this problem is by using a permutation method (Good 2013; Berry, Mielke Jr, and Johnston 2016). This approach can be applied to any statistic to generate a p-value. Here is how a randomization approach works: for a selected predictor and corresponding outcome, the predictor is randomly permuted, but the outcome is not. The statistic of interest is then calculated on the permuted data. This process disconnects the observed predictor and outcome relationship, thus creating no association between the two. The same predictor is randomly permuted many times to generate a distribution of statistics. This distribution represents the distribution of no association (i.e., the null distribution). The statistic from the original data can then be compared to the distribution of no association to get a probability, or p-value, of coming from this distribution. An example of this is given in Kuhn and Johnson (2013) for Relief scores (Robnik-Sikonja and Kononenko 2003). As previously mentioned in Section 5.6, there are some cautions when to using p-values, but this is a convenient quantity to use when dealing with different predictor types.
Simple filters are effective at identifying individual predictors that are associated with the outcome. However, these filters are very susceptible to finding predictors that have strong associations in the available data but do not show any association with new data. In the statistical literature, these selected predictors are labeled as false positives. An entire sub-filed of statistics has been devoted to developing approaches for minimizing the chance of false positive findings, especially in the context of hypothesis testing and p-values. One approach to reducing false positives is to adjust the p-values to effectively make them larger and thus less significant (as was shown in previous chapters).
In the context of predictive modeling, false positive findings can be minimized by using an independent set of data to evaluate the selected features. This context is exactly parallel to the context of identifying optimal model tuning parameters. Recall from Section 3.4 that cross-validation is used to identify an optimal set of tuning parameters such that the model does not overfit the available data. The model building process now needs to accomplish two objectives: to identify an effective subset of features, and to identify the appropriate tuning parameters such that the selected features and tuning parameters do not overfit the available data. When using simple screening filters, selecting both the subset of features and model tuning parameters cannot be done in the same layer of cross-validation, since the filtering must be done independently of the model tuning. Instead, we must incorporate another layer of cross-validation. The first layer, or external layer, is used to filter features. Then the second layer (the “internal layer”) is used to select tuning parameters. A diagram of this process is illustrated in Figure 11.2.
As one can see from this figure, conducting feature selection can be computationally costly. In general, the number of models constructed and evaluated is \(I\times E \times T\), where \(I\) is the number if internal resamples, \(E\) is the number of external resamples, and \(T\) is the total number of tuning parameter combinations.
11.2.1 Simple Filters Applied to the Parkinson’s Disease Data
The Parkinson’s disease data has several characteristics that could make modeling challenging. Specifically, the predictors have a high degree of multicollinearity, the sample size is small, and the outcome is imbalanced (74.6% of patients have the disease). Given these characteristics, partial least squares discriminant analysis (PLSDA, Barker and Rayens (2003)) would be a good first model to try. This model produces linear class boundaries which may constrain it from overfitting to the majority class. The model tuning parameters for PLSDA is the number of components to retain.
The second choice we must make is the criteria for filtering features. For these data we will use the area under the ROC curve to determine if a feature should be included in the model. The initial analysis of the training set showed that there were 5 predictors with an area under the ROC curve of at least 0.80 and 21 were between 0.75 and 0.80.
For the internal resampling process, an audio feature was selected if it had an area under the ROC curve of at least 0.75. The selected features within the internal resampling node are then passed to the corresponding external sampling node where the model tuning process is conducted.
Once the filtering criteria and model are selected, then the type of internal and external resampling methods are chosen. In this example, we used 20 iterations of bootstrap resampling for internal resampling and 2 repeats of 10-fold cross-validation for external resampling. The final estimates of performance are based on the external resampling hold-out sets. To preprocess the data, a Yeo-Johnson transformation was applied to each predictor, then the values were centered and scaled. These operations were conducted within the resampling process.
During resampling the number of predictors selected ranged from 2 to 12 with an average of 5.7. Some predictors were selected regularly; 2 of them passed the ROC threshold in each resample. For the final model, a total of 5 predictors were selected. For this set of predictors, the PLSDA model was optimized with only 4 components. The corresponding estimated area under the ROC curve was 0.827.
Looking at the selected predictors more closely, there appears to be considerable redundancy. Figure 11.3 shows the rank correlation matrix of these predictors in a heatmap. There are very few pairs with absolute correlations less than 0.5 and many extreme values. As mentioned in Section 6.3, partial least squares models are good at feature extraction by creating new variables that are linear combinations of the original data. For this model, these PLS components are created in a way that summarizes the maximum amount of variation in the predictors while simultaneously minimizing the misclassification among groups. The final model used 4 components, each of which is a combination of 5 of the original predictors that survived the filter. Looking at Figure 11.3, there is a relatively clean separation into distinct blocks/clusters. This indicates that there are probably only a handful of underlying pieces of information in the filtered predictors. A separate principal component analysis was used to gauge the magnitude of the between-predictor correlations. In this analysis, only 1 features we needed to capture 90% of the total variation. This reinforces that there are only small number of underlying effects in these predictors. This is the reason that such a small number of components were found to be optimal.
Did this filtering process help the model? If the same resamples are used to fit the PLS model using the entire predictor set, the model still favored a small number of projections (4 components) and the area under the ROC curve was estimated to be a little worse than the filtered model (0.812 versus 0.827). The p-value from a paired t-test (\(p = 0.446\)) showed no evidence that the improvement is real. However, given that only 0.7% of the original variables were used in this analysis, the simplicity of the new model might be very attractive.
One post hoc analysis that is advisable when conducting feature selection is to determine if the particular subset that is found is any better than a random subset of the same size. To do this, 100 random subsets of 5 predictors were created and PLS models were fit with the same external cross-validation routine. Where does our specific subset fit within the distribution of performance that we see from a random subset? The area under the ROC curve using the above approach (0.827) was better than all of the values generated using random subsets of the same size. The largest AUC from a random subset was 0.832. This result grants confidence that the filtering and modeling approach found a subset with credible signal.
While partial least squares is very proficient at accommodating predictor sets with a high degree of collinearity, it does raise the question about what the minimal variable set might be that would achieve nearly optimal predictive performance. For this example, we could reduce the subset size by increasing the stringency of the filter which was an \(AUC \ge 0.75\). Changing the threshold will affect both the sparsity of the solution and the performance of the model. This process is the same as backwards selection which will be discussed in the next section.
To summarize, the use of a simple screen prior to modeling can be effective and relatively efficient. The filters should be included in the resampling process to avoid optimistic assessments of performance. The drawback is that there may be some redundancy in the selected features and the subjectivity around the filtering threshold might leave the modeler wanting to understand how many features could be removed before performance is impacted.