6.2 Binary and zero-inflated exposures

The common setting that we have described so far was making the implicit assumption that we are dealing with a set of multiple continuous exposures (e.g. concentrations of chemicals or pollutants) of joint interest. One important caveat, however, is that continuous exposures evaluated in this context are usually highly skewed (they are strictly non-negative). Log-transformation are commonly used, but these are ineffective when several values are zero. Zero-inflated exposures are skewed covariates with a lot of zeros, typically occurring in environmental epidemiology when several individuals have values below the limit of detection. Removing those individuals from the study (that is, considering the information as missing) might reduce power and, most importantly, does not reflect real levels of exposures (it would silence all effects occurring at low levels of exposures). Common alternative options include dicothomization of each exposure into detected/non detected, the use of categorical exposures, or imputation of non-detected values. Even with the latter, however, in the presence of a high number of zeros we would end up getting inflated covariates with a large proportion of individuals with the same exposure value (in practical terms, we might find it hard to really consider the exposure as continuous). If one wants to include zero-inflated covariates in the mixture without any transformation, available techniques include zero inflated poisson models (ZIP), zero-inflated negative binomial models (ZINB), or hurdle models.

When exposures are instead dicothomized (or, in general, when the interest is to evaluate multiple binary exposures), some additional techniques can be considered:

  • First of all, evaluating the crude association between binary exposures, as we presented earlier with the correlation matrix, can be done using the \(\phi\) coefficients, with \(\phi=\chi^2/n\).
  • Correspondence analysis: This will graphically display all covariates based on their proximity. We can think of this approach as an unsupervised method to investigate and depict patterns of exposures
  • Hierarchical models and penalized methods can be used with binary exposures. If all covariates are binary, you may prefer not to standardize in order to improve interpretation.
  • For high dimensional data, extensions of the regression and classification tree approaches for binary data have been developed, both unsupervised and supervised (e.g. CART/MARS, logic regression). BRT can be used with binary exposures.