10.3 Effect of Irrelevant Features

How much do extraneous predictors hurt a model? Predictably, that depends on the type of model, the nature of the predictors, as well as the ratio of the size of the training set to the number of predictors (a.k.a the $$p:n$$ ratio). To investigate this, a simulation was used to emulate data with varying numbers of irrelevant predictors and to monitor performance. The simulation system is taken from Sapp, Laan, and Canny (2014) and consists of a nonlinear function of the 20 relevant predictors:

\begin{align} y = & x_1 + \sin(x_2) + \log(|x_3|) + x_4^2 + x_5x_6 + I(x_7x_8x_9 < 0) + I(x_{10} > 0) + \notag \\ & x_{11}I(x_11 > 0) + \sqrt(|x_{12}|) +\cos(x_{13}) + 2x_{14} + |x_{15}| + I(x_{16} < -1) + \notag \\ & x_{17}I(x_{17} < -1) - 2 x_{18}- x_{19}x_{20} + \epsilon \end{align}

Each of these predictors was generated using independent standard normal random variables and the error was simulated as random normal with a zero mean and standard deviation of 3. To evaluate the effect of extra variables, varying numbers of random standard normal predictors (with no connection to the outcome) were added. Between 10 and 200 extra columns were appended to the original feature set. The training set size was either $$n = 500$$ or $$n = 1,000$$. The root mean squared error (RMSE) was used to measure the quality of the model using a large simulated test set.

A number of models were tuned and trained for each of the simulates sets including linear, nonlinear, and tree/rule-based models79. The details and code for the each model can be found in a GitHub repo80 and the results are summarized below.

    Warning: Detecting old grouped_df format, replacing
vars attribute by groups

Figure 10.2 shows the results. The linear models (ordinary linear regression and the glmnet) showed mediocre performance overall. Extra predictors eroded the performance of linear regression while the lasso penalty used by the glmnet resulted in stable performance as the number of irrelevant predictors increased. The effect on linear regression was smaller as the training set size increases from 500 to 1,000. This reinforces the notation that the $$p:n$$ ratio may be the underlying issue.

Nonlinear models had different trends. $$K$$-nearest neighbors had overall poor performance that was moderately impacted by the increase in $$p$$. MARS performance was good and showed good resistance to noise features due to the intrinsic feature selection used by that model. Again, the effect of extra predictors was smaller with the larger training set size. Single layer neural networks and support vector machines showed the overall best performance (getting closest to the best possible RMSE of 3.0) when no extra predictors were added. However, both methods were drastically affected by the inclusion of noise predictors to the point of having some of the worst performance seen in the simulation.

For trees, random forest and bagging showed similar but mediocre performance that was not affected by the increase in $$p$$. The rule-based ensemble Cubist and boosted trees faired better. However, both models showed a moderate decrease in performance as predictors were added; boosting was more susceptible to this issue than Cubist.

These results clearly show that there are a number of models that may require a reduction of predictors to avoid a decrease in performance. Also, for models such as random forest or the glmnet, it appears that feature selection may be useful to find a smaller subset of predictors without affecting the model’s efficacy.

Since this is a simulation, we can also assess how models with built-in feature selection performed at finding the correct predictor subset. There are 20 relevant predictors in the data. Based on which predictors were selected, a sensitivity-like proportion can be computed that describes the rate at which the true predictors were retained in the model. Similarly, the number of irrelevant predictors that were selected gives a sense of the false positive rate (i.e., one minus specificity) of the selection procedure. Figure 10.3 shows these results using a plot similar to how ROC curves are visualized.

The three tree ensembles (bagging, random forest, and boosting) have excellent true positive rates; the truly relevant predictors are almost always retained. However, they also show poor results for irrelevant predictors by selecting many of the noise variables. This is somewhat related to the nature of the simulation where most of the predictors enter the system as smooth, continuous functions. This may be causing the tree-based models to try too hard to get achieve performance by training deep trees or larger ensembles.

Cubist had a more balanced result where the true positive rate was greater than 70$$\%$$. The false positive rate is around 50$$\%$$ for a smaller training set size and increased when the training set was larger and less than or equal to 100 extra predictors. The overall difference between the Cubist and tree ensemble results can be explained by the factor that Cubist uses an ensemble of linear regression models that are fit on small subsets of the training set (defined by rules). The nature of the simulation system might be better served by this type of model.

Recall that the main random forest tuning parameter, $$m_{try}$$, is the number of randomly selected variables that should be evaluated at each split. While this encourages diverse trees (to improve performance), it can necessitate the splitting of irrelevant predictors. Random forest often splits on the majority of the predictors (regardless of their importance) and will vastly over-select predictors. For this reason, an additional analysis might be needed to determine which predictors are truly important essential based on that model’s variable importance scores. Additionally, trees in general poorly estimate the importance of highly correlated predictors. For example, adding a duplicate column that is an exact copy of another predictor will numerically dilute the importance scores generated by these models. For this reason, important predictors can have low rankings when using these models.

MARS and the glmnet also made trade-offs between sensitivity and specificity in these simulations that tended to favor specificity. These models had false positive rates under 25$$\%$$ and true positive rates that were usually between 40$$\%$$ and 60$$\%$$.

    Warning: Detecting old grouped_df format, replacing
vars attribute by groups

1. Clearly, this data set requires nonlinear models. However, the simulation is being used to look at the relative impact of irrelevant predictors (as opposed to absolute performance). For linear models, we assumed that the interaction terms $$x_5x_6$$ and $$x_{19}x_{20}$$ would be identified and included in the linear models.