## 3.1 Ordinary Least Squares (OLS) regression

### 3.1.1 Chemical-specific regression (EWAS)

A simple way to assess the association between a set of $$p$$ environmental exposures ($$X_1 - X_p$$) and a given outcome $$Y$$ is to build $$p$$ different regression models, one for each exposure (the approach that we previously described as “one-at-the-time”). Each model can be further adjusted for potential confounders of each exposure-outcome association. For example, is $$Y$$ was a continuous exposure, we could fit a set of linear regression models such as: $$E[Y|X_1,C]=\beta_0+\beta_1 \cdot X_1 + \beta\cdot C$$. The implicit assumption of this modeling procedure is that, for each element of the mixture, the other components do not act as confounders of the exposure-outcome association, as depicted in Figure 3.1. Figure 3.1: DAG for five mixture components independently associated with the outcome

When evaluating a set of environmental exposures, this procedure of fitting a set of independent regression models is usually referred to as environment-wide association study (EWAS, Patel et al. 2010). This approach usually requires correcting for multiple comparisons using either the Bonferroni approach or the false discovery rate (FDR).

Table 3.1 reports results from independent linear regression models (here without any adjustment for multiple comparisons) in selected exposures from our illustrative example.

Table 3.1: Single regressions for selected exposures in the simulated dataset
Estimate p.value
x3 0.078 0.007
x4 0.089 0.003
x5 0.068 0.005
x12 0.294 0.000
x13 0.238 0.000
These results see m to indicate that all exposures are independently associated with the outcome.

### 3.1.2 Multiple regression

Results from independent linear regression are hampered by the strong assumption that mixture components do not act as confounders of the association between each other component and the outcome of interest. This assumption is very seldom met in practice. A common situation, for example, is that two or more constituents of the mixture share one or more source, which usually results in moderate to high levels of correlation between exposures. Using DAGs, we can depict this situation as in Figure 3.2. Figure 3.2: DAG for two exposures that share a common source

In this situation, a statistical model evaluating the association between $$X_1$$ and $$Y$$ will need to adjust for $$X_2$$ to reduce the impact of bias due to residual confounding. In general, when any level of correlation exists between two mixture components, we do expect them to act as confounders of the association between the other exposure and the outcome. This implies that results from independent linear regressions are likely biased due to uncontrolled confounding. In our illustrative example, for instance, we know that $$X_{12}$$ and $$X_{13}$$ are highly correlated; results from independent linear regressions indicated that both exposures are positively associated with the outcome (Table 3.1), but these coefficients are probably biased. Mutually adjusting for the two exposures in the same statistical model is therefore required to account for such confounding and possibly identify whether both exposures are really associated with the outcome, or if the real driver of the association is just one of the two. Note that both situations are realistic: we might have settings where a specific exposure is biologically harmful (say $$X_{12}$$), and the association between the correlated one ($$X_{13}$$) and the outcome was a spurious result due to this high correlation, as well as settings where both exposures are really associated with the outcome (maybe because it is the source of exposure to have a direct effect). We need statistical methodologies that are able to detect and distinguish these possible scenarios.

The most intuitive way to account for co-confounding between mixture components is to mutually adjust for all exposures in the same regression model:

$E[Y|X,C]=\beta_0+\sum_{i=1}^p\beta_i \cdot X_i + \beta \cdot C$

Table 3.2 presents results from a multiple regression that includes the 14 exposures in our example, as well as results from the chemical-specific models.

Table 3.2: Multiple and single regression results from the simulated dataset
Estimate - multiple p.value - multiple Estimate - single p.value - single
x1 0.058 0.080 0.106 0.001
x2 0.018 0.554 0.073 0.012
x3 -0.030 0.774 0.078 0.007
x4 0.053 0.644 0.089 0.003
x5 0.004 0.923 0.068 0.005
x6 0.060 0.047 0.120 0.000
x7 -0.031 0.620 0.153 0.000
x8 0.017 0.679 0.137 0.000
x9 0.025 0.673 0.160 0.000
x10 0.052 0.260 0.125 0.000
x11 0.049 0.341 0.149 0.000
x12 0.222 0.138 0.294 0.000
x13 -0.083 0.586 0.238 0.000
x14 0.054 0.293 0.185 0.000

### 3.1.3 The problem of multicollinearity

Results from the multiple regression are not consistent with those obtained from independent regression models, especially (and unsurprisingly) for those exposures that showed high levels of correlations. For example, within the exposure cluster $$X_{12}-X_{13}$$, the multiple regression model suggests that only $$X_{12}$$ is associated with the outcome, while the coefficient of $$X_{13}$$ is strongly reduced. Something similar happens for the $$X_3-X_4-X_5$$ cluster, where only $$X_4$$ remains associated with $$Y$$. Can we safely conclude that $$X_{12}$$ and $$X_4$$ are associated with $$Y$$ and that the other results were biased due to uncontrolled confounders? Before addressing this question, let’s take a look at a published paper where we evaluated the performance of several statistical models to evaluate the association between a mixture of 8 phthalate metabolites and birth weight in a pregnancy cohort (). Figure 3.3 presents results from the 8 independent regressions and a multiple regression model. Figure 3.4 presents instead the correlation plot of the 8 metabolites. Figure 3.3: Regression results from Chiu et al. 2018 Figure 3.4: Correlation plot from Chiu et al. 2018

While we were expecting results from the two approaches to be different in the presence of high correlations, the coefficients obtained from the multiple regression leave room to a lot of skepticism. For example, the coefficients for MEOHP and MEHHP, when evaluated together, change respectively from -24 to 247, and from -28 to -127. Are these results reliable? Are we getting any improvement from to the biased results that we obtained from independent linear regressions?

The most common problem that arises when using multiple regression to investigate mixture-outcome association is multicollinearity (or simply collinearity). This occurs when independent variables in a regression model are correlated, with stronger consequences the higher the correlation. More specifically, a high correlation between two predictors simultaneously included in a regression model will decrease the precision of their estimates and increase their standard errors. If the correlation between two covariates (say $$X_1$$ and $$X_2$$) is very high, then one is a pretty accurate linear predictor of the other. Collinearity does not influence the overall performance of the model, but has an important impact on individual predictors. In general (as a rule of thumb), given two predictors $$X_1$$ and $$X_2$$ that are associated with the outcome ($$\beta=0.2$$ for both) when their correlation is equal to 0, the estimates in a linear model will be impacted by $$\rho(X_1, X_2)$$ as presented in Figure 3.5. This issue, usually referred to as reverse paradox (the coefficients of 2 correlated covariates will inflate in opposite directions), is clearly affecting results from the paper presented above (the coefficients of highly correlated phthalate metabolites are either extremely large or extremely small), and possibly also results from the illustrative example (coefficients from correlated variables have opposite signs). Nevertheless, it should be noted that high correlation does not automatically imply that coefficients will be inflated. In another example (), for instance, we evaluated a mixture of three highly correlated parabens compounds, yet results from multiple regression were in line to those obtained from other mixture modeling techniques.

To quantify the severity of multicollinearity in a regression analysis one should calculate the Variance Inflation Factor (VIF). The VIF provides a measure of how much the variance of an estimated regression coefficient is increased because of collinearity. For example, if the VIF for a given predictors were 4, than the standard error of that predictors would be 2 times larger than if that predictor had 0 correlation with other variables. As a rule of thumb, VIFs above 4 should set the alarm off, as they indicate that those coefficients are likely affected by the high correlations between the corresponding predictor and other covariates in the model. Table 3.3 shows VIFs in our illustrative example, indicating that our results are deeply affected by multicollinearity. In this situation, alternative modeling options should be pursued.

Table 3.3: VIFs from multiple regression results presented in Table 3.2
x
x1 1.235658
x2 1.317951
x3 49.479946
x4 58.241935
x5 11.256382
x6 2.271043
x7 2.722583
x8 3.892965
x9 2.553431
x10 2.810535
x11 3.694404
x12 6.085748
x13 6.557098
x14 3.152092
z1 1.139690
z2 4.784064
z3 1.135437

### References

Bellavia, Andrea, Yu-Han Chiu, Florence M Brown, Lidia Mı́nguez-Alarcón, Jennifer B Ford, Myra Keller, John Petrozza, et al. 2019. “Urinary Concentrations of Parabens Mixture and Pregnancy Glucose Levels Among Women from a Fertility Clinic.” Environmental Research 168: 389–96.
Chiu, Yu-Han, Andrea Bellavia, Tamarra James-Todd, Katharine F Correia, Linda Valeri, Carmen Messerlian, Jennifer B Ford, et al. 2018. “Evaluating Effects of Prenatal Exposure to Phthalate Mixtures on Birth Weight: A Comparison of Three Statistical Approaches.” Environment International 113: 231–39.