7.4 Approaches when Complete Enumeration is Practically Impossible
As the number of predictors increases, the number of interaction terms to explore grows exponentially. Just a moderate number of predictors of several hundred, which could be considered small for today’s data, generates many more interactions than could be considered from a practical or statistical perspective. On the practical side, the computational burden likewise grows exponentially. With enough computing resources, it still could be possible to evaluate a vast number of interactions. Even if these computations are possible, this is not a statistically wise approach due to the chance of finding false positive results. Because the number of important interactions are usually small, the more insignificant interactions are screened, the greater penalty is incurred for evaluating these interactions. Therefore, truly significant interactions may be missed if all possible interactions are computed and appropriate adjustments are made.
When there are a large number of interaction terms to evaluate, other approaches are necessary to make the search more efficient, yet still effective, at finding important interactions. The remainder of this section will survey several approaches that have been developed to uncover interactions among predictors.
7.4.1 Guiding Principles and Two-stage Modeling
In a two-stage modeling approach, the ability of predictors to explain the variation in the response is evaluated using a model that does not directly account for interaction effects. Examples of such models are linear or logistic regression. Once the important predictors are identified, the residuals from this model are calculated. These residuals contain information that the individual predictors themselves cannot explain. The unexplained variation is due to random measurement error, and could also be due to predictor(s) that were not available or measured, or interactions among the observed predictors that were not present in the model. To be more concrete, suppose that the observed data was generated from the equation
\[ y = x_1 + x_2 + 10x_1x_2 + error\]
Suppose also that when the data from this system was collected only the response and \(x_1\) were observed. The best model we could estimate would be limited to
\[ y = \beta_1x_1 + error^*\] Since the true response depends on \(x_2\) and the interaction between \(x_1\) and \(x_2\), the error from the imperfect model (\(error^*\)) contains information about these important terms (that were absent from the model). If it was possible to access \(x_2\) and create the \(x_1 \times x_2\) interaction term, then it would be possible to separate out their contribution through a second model
\[ error^* = \beta_2x_2 + \beta_3x_1x_2 + error\] We cannot do anything to explain random measurement error–it is always inherent to some extent in data. And expert knowledge is necessary for determining if other predictors are relevant for explaining the response and should be included in the model. However, it is still possible to search for interaction terms that help to explain variation in the response. But the focus of this section is the setting where it is practically impossible to search all possible interaction terms.
When there are many predictors, the principles outlined in Section 7.1 can be used to narrow the search. The hierarchy principle would suggest to first look at pairwise interactions for assistance in explaining the residual variation. Next, the sparsity principle would indicate that if there are interactions that are active, then only a relatively few of them are important. And last, the heredity principle would search for these interactions among the predictors that were identified in the first stage. The type of heredity chosen at this point, weak or strong, will determine the number of interaction terms will be evaluated.
To follow the logic of this process, consider modeling the Ames data. In the first stage, the glmnet model using only the base predictor set (i.e., only main effects) selected 12 predictors.
Since 12 individual predictors were selected, there are 66 combinations of pairs between these predictors. However, once the appropriate dummy variables are created, the second stage glmnet model contains 719 predictor columns as inputs, 660 of which are interactions. The same tuning process was followed and the resulting patterns are very similar to those seen in Figure 7.8(a). This two-stage model was also purely lasso and the best penalty was estimated to be 0.002 with an RMSE of 0.081 log units. Of the entire predictor set, this glmnet model selected 7 main effects and 128 interaction terms.
It is important to note that the process for computing error is different when working with a continuous outcome versus a categorical outcome. The error from the two-stage modeling approach is straightforward to compute when the response is numeric as shown above. In the case of a classification outcome, as is with the stroke data, the residuals must be computed differently. The Pearson residual should be used and is defined as
\[ \frac{y_{i} - p_{i}}{\sqrt{p_{i}\left(1 - p_{i}\right)}} \]
For categorical response data \(y_i\) is a binary indicator of response category of the \(i^{th}\) sample and \(p_i\) is the predicted probability of the \(i^{th}\) sample.
7.4.2 Tree-based Methods
The methods discussed thus far can be effective at revealing interactions for data with a moderate number of predictors. However these approaches lose their practical effectiveness as the number of predictors grows. Tree-based methods like recursive partitioning or ensembles of recursive partitioning models can efficiently model data with a large number of predictors. Moreover, tree-based methods have been shown to uncover potential interactions between variables (Breiman et al. 1984). In fact, using tree-based methods for identifying important interactions among predictors has been a popular area of theoretical and empirical research as we will see.
Let’s review the basic concepts of recursive partitioning to understand why this technique has the ability to uncover interactions among predictors. The objective of recursive partitioning is to find the optimal split point of a predictor that separates the samples into more homogeneous groups with respect to the response. Once a split is made, the same procedure is applied for the subset of samples at each of the subsequent nodes. This process then continues recursively until a minimum number of samples in a node is reached or until the process meets an optimization criterion. When data are split recursively, each subsequent node can be thought of as representing a localized interaction between the previous node and the current node. In addition, the more frequent the occurrence of subsequent nodes of the same pair of predictors, the more likely that the interaction between the predictors is global interaction across the observed range of the predictors.
The simple two-predictor example of a synergistic relationship presented in the introduction will be used to demonstrate how an interaction can be detected through recursive partitioning. The optimal tree for these data is presented in Figure 7.9. Tree-based models are usually thought of as pure interaction due to their prediction equation, which can be written as a set of multiplicative statements. For example, the path to terminal node 4 goes through three splits and can be written as a single rule:
node_4 <- I(x1 < 0.655) * I(x2 < 0.5) * I(x2 < 0.316) * 0.939
where 0.939 is the mean of the training set data falling into this node (I
is a function that is zero or one based on the logic inside the function). The final prediction equation would be computed by adding similar terms from the other eight terminal nodes (which would all be zero except a single rule when predicting). Notice that in Figure 7.9 the terminal nodes have small groups of the data that are homogeneous and range from smaller response values (left) to larger response values (right) on each side of the first split. Achieving homogeneous terminal nodes requires information from both predictors throughout the tree.
But why would a tree need to be this complicated model the somewhat simple synergistic relationship between \(x_1\) and \(x_2\)? Recall from Figure 7.2 that the synergistic relationship induces curved contours for the response. Because the recursive partitioning model breaks the space into rectangular regions, the model needs more regions to adequately explain the response. In essence, the partitioning aspect of trees impedes their ability to represent smooth, global interactions.
To see this, let’s look at Figure 7.10. In this figure, the predictions from each terminal node are represented by rectangular regions where the color indicates the predicted response value for the region. Overlayed on this graph are the contour lines generated from the linear model where the interaction term has been estimated. The tree-based model is doing well at approximating the synergistic response. But, because the model breaks the space into rectangular regions, it requires many regions to explain the relationship.
Figure 7.10 illustrates that a tree pixelates the relationship between the predictors and the response. It is well known that the coarse nature of the predictions from a tree generally makes this model less accurate than other models. In addition, trees have some potential well-known liabilities. As one example, an individual tree can be highly variable (i.e., the model can substantially change with small changes to the data). Ensembles of trees such as bagging (Breiman 2001) and boosting (Friedman 2001) have been shown to mitigate these problems and improve the predictivity of the response.
A brief explanation of these techniques is provided here to aid in the understanding of these techniques. An in-depth explanation of these techniques can be found in Kuhn and Johnson (2013). To build a bagged tree-model, many samples of the original data are independently generated with replacement. For each sample, a tree is built. To predict a new sample, the model averages the predicted response across all of the trees. Boosting also utilizes a sequence of trees, but creates these in a fundamentally different way. Instead of building trees to maximum depth, a boosted model restricts the tree depth. In addition, a boosted tree model utilizes the predictive performance of the previous tree on the samples to reweigh poorly predicted samples. Third, the contribution of each tree in the boosted model is weighted using statistics related to the individual model fit. The prediction for a new sample is a weighted combination across the entire set of trees.
Recent research has shown that these ensemble methods are able to identify important interactions among the predictors. Garcı'a-Magariños et al. (2009) and Qi (2012) illustrate how random forest, which is a further modification of bagging, can be used to find interactions in biological data. While Basu et al. (2018) uses a variation of the random forest algorithm to identify high-order interactions. For boosting, Lampa et al. (2014) and Elith, Leathwick, and Hastie (2008) found that this methodology could also be used to uncover important interactions, Using a theoretical perspective, Friedman and Popescu (2008) proposed a new statistics for assessing the importance of an interaction in a tree-based model.
The reason why tree ensembles are effective at identifying predictor-response relationships is because they are aggregating many trees from slightly different versions of the original data. The aggregation allows for many more possible response values and better approximating the predictor-response relationships. To see this a random forest model and a gradient boosting machine model will be built on the two-predictor synergistic example.
Figure 7.11 shows that ensembles of trees are better able to approximate the synergistic relationship than an individual tree. In fact, the cross-validated root mean squared error (RMSE) for boosting is 0.27 and for bagged trees is 0.6. The optimal single tree model worse with a cross-validated RMSE of 0.9. In this example, smoother models win; the linear regression interaction model’s RMSE 0.05.
Notice that an individual tree or ensemble of trees’ are less effective and are much more complicated models for uncovering predictive interactions that are that are globally true across all of the samples. This simple example highlights the main premise of feature engineering. If meaningful predictor relationships can be identified and placed in a more effective form, then a simpler and perhaps more appropriate model to estimate the relationship between the predictors and the response can be used. However, there is a scenario where tree-based methods may have better predictive ability. When important interactions occur within one or more subsets of the samples (i.e., locally), then tree-based methods have the right architecture to model the interactions. Creating features for localized interactions to be included in simpler models is much more tedious. This process would require the creation of a term that is a product of the identifier of the local space (i.e., a dummy predictor) and the predictors that interact within that space, which is essentially an interaction among three terms.
Since trees and ensembles of trees can approximate interactions among predictors, can we harness information from the resulting models to point us to terms that should be included in a simpler linear model? The approach presented by Basu et al. (2018) uses a form of a random forest model (called feature weighted random forests) that randomly selects features based on weights that are determined by the features’ importance. After the ensemble is created, a metric for each co-occurring set of features is calculated which can then be used to identify the top interacting features.
Friedman and Popescu (2008) present an approach based on the theoretical concept of partial dependence (Friedman 2001) for identifying interactions using ensembles. In short, this approach compares the joint effect of two (or more) predictors with the individual effect of each predictor in a model. If the individual predictor does not interact with any of the other predictors, the difference between the joint effect and the individual effect will be close to zero. However, if there is an interaction between an individual predictor and at least one of the predictors, the difference will be greater than zero. This comparison is numerically summarized in what they term the \(H\) statistic. For the 18 predictors in the Ames data (including the random predictor), the \(H\) statistics were computed on the entire training set. One issue is that the \(H\) statistic, across predictors, can have a fairly bimodal distribution since many of the values are zero (when a predictor isn’t involved in any interactions). This data set is fairly small and, for these reasons, 20 bootstrap estimates were computed and summarized using the median \(H\) statistic. Both estimates are shown in Figure 7.12. The original statistics and their resampled counterparts agree fairly well. Some \(H\) values are fairly close to zero so a cutoff of \(H >= 0.001\) was used on the bootstrap estimates to select variables. This would lead us to believe that 6 predictors are involved in at least one interaction. Following the previous two-stage approach, all corresponding two-way interactions of these predictors where added to the main effects in a glmnet model (mimicking the previous analysis). Using resampling, another pure lasso model was selected, this time with an optimal tuning parameter value of \(\lambda = 10^{-4}\). From this model, 48 main effects and 68 interactions were selected.
A third and practical way to identify interactions use predictor importance information along with principles of hierarchy, sparsity, and heredity. Simply put, if an interaction among predictors is important, then a tree-based method will use these predictors multiple times throughout multiple trees to uncover the relationship with the response. These predictors will then have non-zero, and perhaps high, importance values. Pairwise interactions among the most important predictors can then be created and evaluated for their relevance in predicting the response. This is very similar to the two-stage modeling approach, except that the base learner is a tree-based model.
Each tree-based method has its own specific algorithm for computing predictor importance. One of the most popular methods is generated by a random forest. Random forest uses bootstrap samples to create many models to be used in the ensemble. Since the bootstrap is being used, each tree has an associated assessment set (historically called the out-of-bag (OOB) samples) that we not used to fit the model. These can be predicted for each tree to understand the quality of the model. Suppose a random forest model contains 1,000 trees. If classification accuracy is computed for the out-of-bag samples for each tree, the average of these accuracy statistics is the OOB estimate. To measure importance for a specific predictor, the values of that predictor are shuffled, and the OOB estimate is recalculated. If the predictor were unimportant, there would not be much difference in the original OOB statistic and the one where the predictor has been made irrelevant. However, if a predictor was crucial, the permuted OOB statistic should be significantly worse. Random forest conducts this analysis for each predictor in the model and a single score is created per predictor where larger values are associated with higher importance68.
For the Ames data, the random forest model (using the default value of \(m_{try}\)) was fit and the importance scores were estimated. Figure 7.13 displays the top 10 predictors by importance, most of which have all been identified as part of multiple pairwise interactions from the other methods. Of particular importance are the living area and the year built which are the most important predictors and are combined in a scientifically meaningful interaction. This approach is similar in spirit to the two-stage approach discussed above. The primary difference is that the random forest model is simply used for identifying predictors with the potential for being part of important interactions. Interactions among the top predictors identified here could then be included in a simpler model, like logistic regression, to evaluate their importance.
7.4.3 The Feasible Solution Algorithm
As mentioned above, the problem of identifying important interactions can be thought of as identifying the optimal model that includes main effects and interactions. A complete enumeration and evaluation of all of the possible models is practically impossible for even a moderate number of predictors since the number of models is \(p!\). When using a linear model such as linear or logistic regression, methods such as forward, backward, and stepwise selection are commonly used to locate an optimal set of predictors (Neter et al. 1996). These techniques are as straightforward to understand as their names imply. The forward selection technique starts with no predictors and identifies the predictor that is most optimally related to the response. In the next step, the next predictor is selected such that it in combination with the first predictor are most optimally related to the response. This process continues until none of the remaining predictors improve the model based on the optimization criteria. The backward selection technique starts with all of the predictors and sequentially removes the predictor that has the least contribution to the optimization criteria. This process continues until the removal of a predictor significantly reduces the model performance. The stepwise procedure adds or removes one predictor at a time at each step based on the optimization criteria until the addition or removal of any of the predictors does not significantly improve the model69.
Miller (1984) provided an alternative search technique to the stepwise method for the linear model problem. Instead of removing a predictor, he recommended a substitution approach in which each selected predictor is systematically chosen for substitution. For example, suppose that there are 10 predictors, \(x_1\)–\(x_{10}\), and we want to find the best model that contains three predictors. Miller’s process begins by randomly selecting three of the ten predictors, say \(x_2\), \(x_5\) and \(x_9\). Model performance is calculated for this trio of predictors. Then the first two predictors are fixed and model performance of these two predictors is computed with the remaining seven other predictors. If none of the other seven predictors give better performance than \(x_9\), then this predictor is kept. If \(x_9\) is not optimal, then it is exchanged with the best predictor. The process then fixes the first and third predictors and uses the same steps to select the optimal second predictor. Then, the second and third predictor are fixed and a search is initiated for the optimal first predictor. This process continues to iterate until it converges on an optimal solution. Because the algorithm is not guaranteed to find the global optimal solution, different random starting sets of predictors should be selected. For each different starting set, the converged solution and the frequency of each solution are determined. Hawkins (1994) extended this algorithm to a robust modeling approach and called the procedure the Feasible Solution Algorithm (FSA).
A primary advantage of this technique is that the search space of the algorithm, which is on the order of \(q \times m \times p\) where \(q\) is the number of random starts, \(m\) is the number of terms in the subset, and \(p\) is the number of predictors, is much smaller than the entire search space (\(p^m\)). The FSA algorithm has been shown to converge to an optimal solution, although it is not guaranteed to be the global optimum.
The FSA algorithm focuses on finding an optimal subset of main effects. However, the original implementation of FSA does not consider searching for interactions among predictors. This can be a considerable drawback to using the algorithm since important interactions do occur among predictors for many problems. Lambert et al. (2018) recognized this deficiency and has provided a generalization to the algorithm to search for interactions. Lambert’s approach begins with a base model, which typically includes predictors thought to be important for predicting the response. The order of the desired interaction is chosen (i.e., two-factor, three-factor, etc.). The process of identifying potentially relevant interactions follows the same type of logic as the original FSA algorithm. To illustrate the process, let’s return to the example from above. Suppose that the base model included three predictors (\(x_1\), \(x_5\), and \(x_9\)) and the goal is to identify important pairwise interactions. Lambert’s algorithm randomly selects two predictors, say \(x_3\) and \(x_9\), and computes the performance of the model that includes the base terms and this interaction term. The algorithm then fixes the first predictor and systematically replaces the second predictor with the remaining eight (or nine if we want to study quadratic effects) predictors and computes the corresponding model performance. If none of the other predictors give better predictive performance, then \(x_9\) is kept. But if \(x_9\) is not optimal, then it is exchanged with the best predictor. The process then fixes the second predictor and exchanges the first predictor with eight (or nine) of the other predictors. This process continues to iterate until convergence. Then a different random pair of predictors is selected and the process is applied. The converged interactions are tallied across the random starts to identify the potential feasible solutions. This algorithm can be easily generalized to any order of interaction.
The FSA algorithm was used on the Ames data. Unlike the previous analyses, we chose to look at the specific interactions terms using individual dummy variables to contrast the type of results that can be obtained with the previous results. For example, a randomly generated interaction might involve two dummy variables from different predictors. We constrained the selection so that dummy variables from the same predictor would never be paired. Fifty random starts were used and, during each random start, there were 60 potential swaps of predictors into the randomly generated interactions. Resampling70 was used to determine if each candidate set of interaction terms was associated with a smaller RMSE value. This determination was made using the base model and the candidate interaction. Table 7.2 shows the results for the top 15 interactions (indexed by their p-values). Many of the same variables occur in these lists (e.g., living area, neighborhood, etc.) along with some of the nonlinear terms (e.g., splines for the geocodes). The next step would be to add a selection of these terms to the base model after more exploratory work.
Predictor 1 | Predictor 2 | p-value | RMSE |
---|---|---|---|
Foundation: CBlock | Living Area | 0.00539 | 0.0513 |
Foundation: PConc | Living Area | 0.00828 | 0.0515 |
Bldg Type: Duplex | Living Area | 0.00178 | 0.0516 |
Living Area | MS SubClass: Duplex All Styles and Ages | 0.00102 | 0.0516 |
Roof Style: Hip | Year Built | 0.04945 | 0.0518 |
Roof Style: Gable | Year Built | 0.00032 | 0.0518 |
Longitude (spline) | Neighborhood: Northridge Heights | 0.00789 | 0.0518 |
Living Area | Neighborhood: Northridge Heights | 0.09640 | 0.0519 |
Full Bath | Land Contour: HLS | 0.07669 | 0.0519 |
Foundation: CBlock | Lot Area | 0.01698 | 0.0519 |
MS SubClass: One Story PUD 1946 and Newer | Year Built | 0.02508 | 0.0519 |
Latitude (spline) | Neighborhood: other | 0.03926 | 0.0520 |
Latitude (spline) | Neighborhood: College Creek | 0.00986 | 0.0520 |
Foundation: CBlock | MS SubClass: One Story PUD 1946 and Newer | 0.04577 | 0.0520 |
Longitude (spline) | Neighborhood: other | 0.03762 | 0.0520 |
For example, there appears to be a potential interaction between the building type and the living area. What is the nature of their relationship to the outcome? Figure 7.14 shows a scatter plot that is separated for each building type with a linear regression line superimposed in each subgroup. The figure shows very different slopes for several building types. The largest contrast being between Townhouse End Units (TwnhsE
) and two-family conversions (TwoFmCon
, originally built as one-family dwelling). It might make sense to encode all five interaction terms or to group some terms with similar slopes together for a more simplistic encoding.