4.1 Unsupervised summary scores
A very intuitive approach to address this question is to create one or more summary score(s) that summarize individual levels of exposure to the mixture, thus reducing the number of covariates that are going to be evaluated. A very common example of such approach is used by investigators working on phthalates. In this context, analyses are often hampered by the presence of extreme correlation between metabolites of Di(2-ethylhexyl)phthalate (DEHP), and researchers are commonly summarizing this information into a molar sum of DEHP mtabolites. Li et al. (2019) writes, for example “we calculated the molar sum of DEHP metabolites (ΣDEHP) by dividing each metabolite concentration by its molecular weight and then summing: ΣDEHP=[MEHP (μg/L)×(1/278.34 (g/mol))]+[MEHHP (μg/L) × (1/294.34 (g/mol))] + [MEOHP (μg/L) × (1/292.33 (g/ mol))] + [MECPP (μg/L) × (1/308.33 (g/mol))].” Note that, with this approach, the score targets a selected sub-sample of exposures (the highly-correlated cluster creating problems), and other phthalates metabolites are included in the model without any transformation.
Another common approach is to use components derived from PCA, as described in section 2. PCA allows identifying continuous covariates that summarize the variability of the mixture exposure. Including these derived components into a regression model has the great advantage that all collinearity issues will be resolved, as the components are uncorrelated by definition. On the other hand, the validity of this approach is severely affected by whether the obtained components have clear biological interpretation. An example of application of this approach in environmental epidemiology can be found in Souter et al. (2020).