5.2 Assessing interactions

5.2.1 Tree-based modeling

In settings where one is interested in formally evaluating interactions, unique challenges are involved. First, we already discussed how evaluating several covariates and high-order interactions within a regression framework will rapidly increase the number of parameters to be estimated, and the resulting complexity of the model will make classical regression techniques of little use. Summary and classification approaches like WQS will not be able to provide an estimate of interaction effects, and we have just discussed how BKMR can only provide some qualitative assessment of interactions, and only among those exposures that have passed the selection procedure.

To account for the complexity of joint effects and high-dimensional interactions, one should consider specific techniques that have been specifically develop to deal with complex and big data. One machine learning (ML) approach that can be useful in the context of interaction analysis, and specifically when evaluating environmental exposures, is the application of boosted regression trees (BRT). BRT is a tree-based modeling technique that can be used to evaluate complex high-dimensional interactions among several variables, which can be continuous, categorical, or binary. Boosted trees are designed to improve the performance of classification and regression trees (CARTs), which partition the data into several disjoint regions approximating the outcome as constant within these regions. CARTs can account for complex interactions by conditioning subsequent splits on previous ones, a feature that is controlled by a “depth” option. Higher-order depths correspond to accounting for higher-order interactions. In practical terms, this implies that by modifying the depth option of the algorithm we can incorporate an increasingly higher number of interaction orders. How many interactions should be evaluated, together with other parameters of the model, are identified by the machine through cross validation techniques.

Boosted trees improve the predictive performance of a single CART by combining several weak learners to accurately identify a set of explanatory variables that are associated with the outcome. The improved predictive performance, however, will come at the expense of an easy interpretation. Specifically, the output of a BRT will provide identification of variable importance, partial dependence plot, and interactions hierarchy, but will not provide effect estimates for each variable or interaction as in classical regression. A BRT model will provide the following objects as output:

  • Variable importance: this is based on how many times each variable is involved in a split, capturing its independent predictive power with respect to the outcome. This measure holds similar interpretation of PIPs in BKMR

  • Dependence plots: similarly to the univariate dose-responses in BKMR, these provide a graphical visualization of the fitted function that presents the associations between one or more predictors and the outcome. These plots are especially helpful with continuous predictors, but let’s stress that this technique can be used with any kind of exposures.

  • H-statistics: these are the unique measures of interaction relevance, which indicate, for any pair of predictors, the fraction of variance that is not captured by the sum of the two fitted response functions. Of importance, depending on the depth of the algorithm, H-statistics can be calculated for all levels of interactions including 2-way and more. These measures do not provide a summary of relative importance (i.e. they do not sum up to 1) but rather indicate a ranking of importance of interactions.

For more details on boosted trees we refer to previous publications (Lampa et al. (2014)) and online documentation.

5.2.2 Interaction screening and regression approaches

Both BKMR, which provide a qualitative graphical assessment of interactions, and BRT models, which allow estimating H-statistics to rank interactions of different orders, do not provide direct estimates or tests for interactions effects. For this reason, a recommended practice is to use these techniques as interaction screening procedures and employ a 2-steps approach in which selected interactions are then evaluated in a final regression model. As an illustrative example, the reader can refer to a recent publication where we used this approach to identify 2-ways interactions between occupational exposures and health factors that we later integrated in a regression models evaluating the effect of this mixture on ALS risk (Bellavia et al. (2021)).

H-statistics from Bellavia et al. 2021

Figure 5.10: H-statistics from Bellavia et al. 2021


Bellavia, Andrea, Aisha S Dickerson, Ran S Rotem, Johnni Hansen, Ole Gredal, and Marc G Weisskopf. 2021. “Joint and Interactive Effects Between Health Comorbidities and Environmental Exposures in Predicting Amyotrophic Lateral Sclerosis.” International Journal of Hygiene and Environmental Health 231: 113655.
Lampa, Erik, Lars Lind, P Monica Lind, and Anna Bornefalk-Hermansson. 2014. “The Identification of Complex Interactions in Epidemiology and Toxicology: A Simulation Study of Boosted Regression Trees.” Environmental Health 13 (1): 1–17.