7.3 The Brute-Force Approach to Identifying Predictive Interactions

For data sets that have a small to moderate number of predictors, all possible pairwise interactions for an association with the response can be evaluated. However, the more interaction terms that can be evaluated, the higher the probability that an interaction that is associated with the response due to random chance is found. More concretely, as more terms that are truly unrelated to the response are evaluated, the chance that at least one of them is associated with the response increases. Terms that have a statistically significant signal simply due to random chance and not due to a true relationship are called false positive findings. An entire sub-field of statistics is devoted to developing methodology for controlling the chance of false positive findings (Dickhaus 2014). False positive findings can substantially decrease a model’s predictive performance, and we need to protect against selecting these types of findings.

7.3.1 Simple Screening

The traditional approach to screening for important interaction terms is to use nested statistical models. For a linear regression model with two predictors, $x_1$ and $x_2$ , the main effects model is:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + error$

The second model with main effects plus an interaction is:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + error$

These two models are called “nested”" since the first model is a subset of the second. When models are nested, a statistical comparison can be made regarding the amount of additional information that is captured by the interaction term. For linear regression, the residual error is compared between these two models and the hypothesis test evaluates whether the improvement in error, adjusted for degrees of freedom, is sufficient to be considered real. The statistical test results in a p-value which reflects the probability that the additional information captured by the term is due to random chance. Small p-values, say less than 0.05, would indicate that there is less than a 5% chance that the additional information captured is due to randomness. It should be noted that the 5% is the rate of false positive findings, and is a historical rule-of-thumb. However, if one is willing to take on more risk of false positive findings for a specific problem, then the cut-off can be set to a higher value.

For linear regression, the objective function used to compare models is the statistical likelihood (the residual error, in this case). For other models, such as logistic regression, the objective function to compare nested models would be the binomial likelihood.

An alternatively methodology for protecting against false positive findings was presented in Section 2.3. This methodology took the same nested model approach as above, but with two important differences:

Resampling was used to create a number of nested model pairs for different versions of the training set. For each resample, the assessment set is used to evaluate the objective function. In the traditional approach, the same data used to create the model are re-predicted and used to evaluate the model (which is understandably problematic for some models). If the model overfits, then computing the objective function from a separate data set allows a potential dissenting opinion that will reflect the overfitting.
The other distinction between the resampling approach and the traditional approach is that any measure of performance can be used to compare the nested model; it need not be a statistically tractable metric. In Section 2.3, the area under the ROC curve was the objective function to make the comparison. We could have chosen to identify interactions that optimize sensitivity, specificity, accuracy, or any other appropriate metric. This provides great versatility and enables us to match the screening with an appropriate modeling technique and performance metric.

For both these reasons, but especially the second, the two approaches for evaluating nested models are not guaranteed to make consistent results. This is not problematic per se; different objective functions answer different questions and one cannot be said to be objectively better than the other without some context. Friedman (2001) illustrates the potential disconnect between metrics. In his case, different metrics, binomial likelihood and the error rate, were used to choose the optimal number of iterations for a boosted tree. Using the likelihood as the objective function, significantly fewer iterations were chosen when compared to using the misclassification rate. In his view:

“… the misclassification error rate continues to decrease well after the logistic likelihood has reached its optimum. This degrading the likelihood by overfitting actually improves misclassification error rate. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality.”

Both the traditional approach and the resampling method from Section 2.3 generate p-values⁶⁷. The more comparisons, the higher the chance of finding a false-positive interaction. There are a range of methods for controlling for false positive findings. One extreme is to not control for false positive findings. We may choose to do this early in our explorations of potential interactions, keeping in mind that there is a good chance of finding one or more false positives. At the other extreme is the Bonferroni correction (Shaffer 1995) which uses a very strict exponential penalty in order to minimize any false positive findings. Because of the severe nature of this penalty, it is not good when we need to perform many statistical tests. The false discovery rate (FDR) procedure, previously mentioned in Section 5.6, falls along this correction spectrum and was developed to minimize false positive findings when there are many tests (Benjamini and Hochberg 1995). Because complete enumeration can lead to exploring many possible pairwise interactions, the FDR can be a pragmatic compromise between no adjustment and the Bonferroni correction.

One shortfall of this interaction detection approaches (previously used in Section 2.3) was that it was overly reductive. These candidate models only included the terms involved in the interaction. When the interactions that were discovered were included in a broader model that contains other (perhaps correlated) predictors, their importance to the model may be diminished. As this approach is used in this chapter, the interaction is added to a model that includes all of the potentially relevant predictors. This might reduce the number of predictors considered important (since the residual degrees of freedom are smaller) but the discovered interactions are likely to be more reliably important to a larger model.

Let’s return to the Ames data example to illustrate evaluating interaction terms. Here nested models will be compared using the traditional approach with the p-values derived through resampling. Recall that a number of predictors in the base model are qualitative. For this analysis, the main effects and interactions for these predictions are done at the group level. This means that a predictor like the type of garage type (with four possible values) has three main effects in the model. When this factor is crossed with a quantitative predictor, such as living area, an additional three terms are added. Alternatively, one could look at the effects of more specific interactions (such as a “finished” garage type interacting with another variable). This is very similar to the discussion of factors versus dummy variables for tree-based models in Section 5.7.

Within each of these types of p-value computations, the original p-values will be compared to the FDR and Bonferroni adjusted p-values. Figure 7.7 illustrates the distribution of p-values under each calculation method and adjustment type. For these data, the traditional approach identifies almost every interaction effect to be significant (even after an FDR correction or Bonferroni adjustment). However, the cross-validated approach has a lower rate of significance and is more affected by both types of corrections. Again, this is likely due to different estimation methods being used. Since the traditional approach simply re-predicts the original training set, it is plausible that many of these significant findings are the results of overfitting the interactions to the data.

A comparison of the the distribution of p-values and adjustment method for p-values between traditional and cross-validated estimates for interactions among predictors that produce a reduction in the RMSE.

Figure 7.7: A comparison of the the distribution of p-values and adjustment method for p-values between traditional and cross-validated estimates for interactions among predictors that produce a reduction in the RMSE.

As will be seen later, the list of significant interactions can be used as a starting point of an investigation using exploratory data analysis. For example, using resampling, the potential interaction with the smallest p-value was between longitude and latitude. From this, visualizations should be used to understand if the effect is practically significant and potentially meaningful. After these activities, the modeler can make a determination about the best approach for encoding the interactions and so on.

7.3.2 Penalized Regression

Section 7.3.1 presented a way to search through potential interaction terms when complete enumeration is practically possible. This involved evaluating the change in model performance between the model with only the main effects and the model with the main effects and interactions. This approach is simple to implement and can be effective at identifying interaction terms for further evaluation. However, evaluating interaction terms in a one-at-a-time fashion prevents the simplistic models from evaluating the importance of the interaction terms in the presence of the full model the individual predictors and pairwise interaction terms.

If complete enumeration is practically possible, then another approach to identifying important interactions would be to create and add all interaction terms to the data. By doing this, the data may contain more predictors (original predictors plus interaction terms) than samples. Some models such as trees (or ensembles of trees), neural networks, support vector machines, and K-nearest neighbors, are feasible when the data contains more predictors than samples. However, other techniques like linear and logistic regression cannot be directly used under these conditions. But these models are much more interpretable and could provide good predictive performance if they could be utilized. A goal of feature engineering is to identify features that improve model performance while improving interpretability. Therefore, having a methodology for being able to utilize linear and logistic regression in the scenario where we have more predictors than samples would be useful.

A family of modeling techniques, called penalized models, has been developed to handle the case where there are more predictors than samples yet an interpretable model is desired. To understand the basic premise of these tools, we need to briefly review the objective of linear regression. Suppose that the data consist of $p$ predictors, labeled as $x_1$ , $x_2$ , through $x_p$ . Linear regression seeks to find the coefficients $\beta_1$ , $\beta_2$ , through $\beta_p$ that minimize the sum of the squared distances (or error) from the estimated prediction of the samples to the observed values of the samples. The sum of squared errors can be written as:

$SSE = \sum_{i=1}^n (y_i-\widehat{y}_i)^2$ where

$\widehat{y}_i = \widehat{\beta}_1x_1 + \widehat{\beta}_2x_2 + \ldots + \widehat{\beta}_px_p$ and $\widehat{\beta}_j$ is the regression coefficient estimated from the available data that minimizes SSE. Calculus can be used to solve the minimization problem. But it turns out that the solution requires inverting the covariance matrix of the predictors. If there are more predictors than samples or if one predictor can be written as a combination of one or more of the other predictors, then it is not possible to make the inversion. As predictors become more correlated with each other, the estimated regression coefficients get larger (inflate) and become unstable. Under these conditions, another solution is needed. Hoerl (1970) recognized this challenge, and proposed altering the optimization problem to:

$SSE_{L_2} = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 + \lambda_r \sum_{j=1}^P\beta_j^2$

Because the objective of the equation is to minimize the entire quantity, the $\lambda_r$ term is called a penalty, and the technique is called ridge regression. As the regression coefficients grow large, the penalty must also increase to enforce the minimization. In essence, the penalty causes the resulting regression coefficients to become smaller and shrink towards zero. Moreover, by adding this simple penalty we can obtain reasonable parameter estimates. However, many of the parameter estimates may not be truly zero, thus making the resulting model interpretation much more difficult. Tibshirani (1996) made a simple and notable modification to the ridge optimization criteria and proposed minimizing the following equation:

$SSE_{L_1} = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 + \lambda_\ell \sum_{j=1}^P|\beta_j|$

This method was termed the least absolute shrinkage and selection operator (lasso). By modifying the penalization term $\lambda_\ell$ , the lasso method forces regression coefficients to zero. In doing so, the lasso practically selects model terms down to an optimal subset of predictors and, as such, this method could be used on an entire data set of predictors and interactions to select those that yield the best model performance. Friedman, Hastie, and Tibshirani (2010) extended this technique so that it could be applied to the classification setting.

Traditionally, these two different types of penalties have been associated with different tasks; the ridge penalty is mostly associated with combating collinearity between predictors while the lasso penalty is used for the elimination of predictors. In some data sets, both effects may be required to create an effective model. Zou and Hastie (2005) and Friedman, Hastie, and Tibshirani (2010) devised approaches for blending both types of penalties together. The glmnet model has tuning parameters for the total penalty $\lambda = \lambda_r + \lambda_\ell$ and the proportion of $\lambda$ that is associated with the lasso penalty (usually denoted as $\alpha$ ). Selecting $\alpha = 1$ would be a fully lasso penalty model whereas $\alpha = 0.5$ would be an even mix. For regression models, this approach optimizes

$SSE = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 + (1- \alpha) \lambda_r \sum_{j=1}^P\beta_j^2 + \alpha\lambda_\ell \sum_{j=1}^P|\beta_j|$

Figure 7.8(a) shows the results of tuning the model using resampling on the Ames data using the base model plus all possible interactions. Each line is the relationship between the estimated RMSE and $\lambda$ for a given value of $\alpha$ . As the model becomes more lasso-like, the RMSE is reduced. In the end, a pure-lasso model using $\lambda = 0.003$ was best. The estimated RMSE from this model was 0.076 log units. Of the 1,033 predictors, the model selected a subset of 115 terms (only 3 of which were main effects). For these data, the lasso appears to prefer primarily interaction terms in the model; this technique does not enforce the hierarchy principle although Bien, Taylor, and Tibshirani (2013) developed a methodology which does for this model.

Figure 7.8: (a) The tuning parameter profile for the glmnet method applied to the Ames data. (b) The relationship of the top selected interaction term, Year Built x Living Area, to the outcome data.

The interaction effect with the largest coefficient crossed the year built and the living area. This predictor is shown in Figure 7.8(b) where there is a large log-linear effect. Table 7.1 shows the top 15 interaction terms. In this set, several model terms show up multiple times (i.e., neighborhood, living area, etc.).

Table 7.1: The top terms (predictors and pairwise interactions) selected by the glmnet procedure.
Predictor 1	Predictor 2	Coefficient
Living Area	Year Built	0.10143
Latitude	Year Built	0.02363
Central Air: Y	Lot Area	0.02281
Lot Frontage	Neighborhood: Northridge Heights	0.01144
Central Air: Y	Neighborhood: Gilbert	-0.01117
Lot Area	Year Built	0.00984
Foundation: PConc	Roof Style: Hip	0.00984
Lot Frontage	Neighborhood: Edwards	-0.00944
Foundation: PConc	Lot Area	0.00893
Alley: No Alley Access	MS SubClass: Two Story PUD 1946 and Newer	-0.00874
Alley: No Alley Access	MS SubClass: Two Story 1946 and Newer	-0.00624
Bldg Type: Duplex	Central Air: Y	-0.00477
Foundation: Slab	Living Area	-0.00454
Land Contour: HLS	Roof Style: Hip	0.00414
Lot Frontage	Pool Area	-0.00409

Although, as noted in Section 3.7, there are Bayesian methods that can be used to compare models without generating p-values.↩