4.2 The Weighted Quantile Sum (WQS) and its extensions

Taking one step further, researchers might be interested in taking into account the relationship between the exposures and the outcome while summarizing the complex exposure to the mixture of interest. The weighted quantile sum (WQS), developed specifically for the context of environmental mixtures analysis, is an increasingly common approach that allows evaluating a mixture-outcome association by creating a summary score of the mixture in a supervised fashion (Czarnota, Gennings, and Wheeler (2015)), (Carrico et al. (2015)). Specifically, WQS is a statistical model for multivariate regression in high-dimensional dataset that operates in a supervised framework, creating a single score (the weighted quantile sum) that summarizes the overall exposure to the mixture, and by including this score in a regression model to evaluate the overall effect of the mixture on the outcome of interest. The score is calculated as a weighted sum (so that exposures with weaker effects on the outcome have lower weight in the index) of all exposures categorized into quartiles, or more groups, so that extreme values have less impact on the weight estimation.

4.2.1 Model definition and estimation

A WQS regression model takes the following form:

\[g(\mu) = \beta_0 + \beta_1\Bigg(\sum_{i=1}^{c}w_iq_i\Bigg) + \boldsymbol{z'\varphi}\]

The \(\sum_{i=1}^{c}w_iq_i\) term represents the index that weights and sums the components included in the mixture. As such, \(\beta_1\) will be the parameter summarizing the overall effect to the (weighted) mixture. In addition, the model will also provide an estimate of the individual weights \(w_i\) that indicate the relative importance of each exposure in the mixture-outcome association.

To estimate the model, the data may be split in a training and a validation dataset: the first one to be used for the weights estimation, the second one to test for the significance of the final WQS index. The weights are estimated through a bootstrap and constrained to sum to one and to be bounded between zero and one: \(\sum_{i=1}^{c}w_i=1\) and \(0 \leq w_i \leq 1\). For each bootstrap sample (usually \(B=100\) total samples) a dataset is created sampling with replacement from the training dataset and the parameters of the model are estimated through an optimization algorithm.

Once the weights are estimated, the model is fitted in order to find the regression coefficients in each ensemble step. After the bootstrap ensemble is completed, the estimated weights are averaged across bootstrap samples to obtain the WQS index:

\[WQS = \sum_{i=1}^c \bar{w}_iq_i\]

Typically weights are estimated in a training set then used to construct a WQS index in a validation set, which can be used to test for the association between the mixture and the health outcome in a standard generalized linear model, as:

\[g(\mu) = \beta_0 + \beta_1WQS + \boldsymbol{z'\varphi}\]

After the final model is complete one can test the significance of the \(\beta_1\) to see if there is an association between the WQS index and the outcome. In the case the coefficient is significantly different from 0 then we can interpret the weights: the highest values identify the associated components as the relevant contributors in the association. A selection threshold can be decided a priori to identify those chemicals that have a significant weight in the index.

4.2.2 The unidirectionality assumption

WQS makes an important assumption of uni-direction (either a positive or a negative) of all exposures with respect to the outcome. The model is inherently one-directional, in that it tests only for mixture effects positively or negatively associated with a given outcome. In practice analyses should therefore be run twice to test for associations in either direction.

The one-directional index allows not to incur in the reversal paradox when we have highly correlated variables thus improving the identification of bad actors.

4.2.3 Extensions of the original WQS regression

  • Dependent variables

The WQS regression can be generalized and applied to multiple types of dependent variables. In particular, WQS regression has been adapted to four different cases: logistic, multinomial, Poisson and negative binomial regression. For these last two cases it is also possible to fit zero-inflated models keeping the same objective function used to estimate the weights as for the Poisson and negative binomial regression but taking into account the zero inflation fitting the final model.

  • Random selection

A novel implementation of WQS regression for high-dimensional mixtures with highly correlated components was proposed in Curtin et al. (2021). This approach applies a random selection of a subset of the variables included in the mixture instead of the bootstrapping for parameter estimation. Through this method we are able to generate a more de-correlated subsets of variables and reduce the variance of the parameter estimates compared to a single analysis. This extension was shown to be more effective compared to WQS in modeling contexts with large predictor sets, complex correlation structures, or where the numbers of predictors exceeds the number of subjects.

  • Repeated holdout validation for WQS regression

One limit of WQS is the reduced statistical power caused by the necessity to split the dataset in training and validation sets. This partition can also lead to unrepresentative sets of data and unstable parameter estimates. A recent work from Tanner, Bornehag, and Gennings (2019) showed that conducing a WQS on the full dataset without splitting in training and validation produces optimistic results and proposed to apply a repeated holdout validation combining cross-validation and bootstrap resampling. They suggested to repeatedly split the data 100 times with replacement and fit a WQS regression on each partitioned dataset. Through this procedure we obtain an approximately normal distribution of the weights and the regression parameters and we can apply the mean or the median to estimate the final parameters. A limit of this approach is the higher computational intensity.

  • Other extensions

To complete the set of currently available extensions of this approach, it is finally worthy to mention the Bayesian WQS (Colicino et al. (2020)), which also allows relaxing the uni-directional assumption, and the lagged WQS (Gennings et al. (2020)), which deals with time-varying mixtures of exposures to understand the role of exposure timing.

4.2.4 Quantile G-computation

A recent paper by Keil et al. (2020) introduced an additional modeling technique for environmental mixture that builds up on WQS regression integrating its estimation procedure with g-computation. This approach, called Quantile-based g-Computation estimates the overall mixture effect with the same procedure used by WQS, but estimating the parameters of a marginal structural model, rather than a standard regression. In this way, under common assumptions in causal inference such as exchangeability, causal consistency, positivity, no interference, and correct model specification, this model will also improve the causal interpretation of the overall effect. Importantly, the procedure also allegedly overcomes the assumption of uni-direction, and the flexibility of marginal structural models also allows incorporating non-linearities in the contribution of each exposure to the score. Additional details on the models can be found on the original paper or in this useful R vignette.

4.2.5 WQS regression in R

WQS is available in the R package gWQS (standing for generalized WQS). Documentation and guidelines can be found here. The recently developed quantile G-computation approach is instead available in the qgcomp package.

Fitting WQS in R will require some additional data management. First of all, both gWQS and qgcomp will require an object with the names of the exposures, rather than a matrix with the exposures themselves.

exposure<- names(data2[,3:16])

The following lines will fit a WQS regression model for the positive direction, with a 40-60 training validation split, and without adjusting for covariates. The reader can refer to the link above for details on all available options.

results1 <- gwqs(y ~ wqs, mix_name = exposure, data = data2, q = 4, 
                 validation = 0.6, b = 10, b1_pos = T, b1_constr = F, 
                 family = "gaussian", seed = 123)

After fitting the model, this line will produce a barplot with the weights as well as the summary of results (overall effect and weights estimation), presented in Table 4.1 and Figure 4.1/

gwqs_barplot(results1, tau=NULL)
WQS: weights estimation in the simulated dataset

Figure 4.1: WQS: weights estimation in the simulated dataset

Table 4.1: WQS: weights estimation in the simulated dataset
mix_name mean_weight
x12 x12 0.2080444
x5 x5 0.1990505
x4 x4 0.1860921
x6 x6 0.1190146
x8 x8 0.1120772
x2 x2 0.0586329
x1 x1 0.0518465
x9 x9 0.0295788
x14 x14 0.0224343
x7 x7 0.0074698
x13 x13 0.0057590
x3 x3 0.0000000
x10 x10 0.0000000
x11 x11 0.0000000

To estimate the negative index, still without direct constraint on the actual \(\beta\), we change the b1_pos option to FALSE. In this situation, all bootstrap samples provide a positive coefficient. This suggests that we are in a situation where all covariates have a positive (or null) effect. Even constraining the coefficient would likely not make any difference in this case - coefficients would either be all around 0, or the model will not converge.

To adjust the positive WQS regression model for confounders we can add them in the model as presented here:

results1_0_adj <- gwqs(y ~ wqs+z1+z2+z3, mix_name = exposure, data = data2, 
                       q = 4, validation = 0.6, b = 10, b1_pos = T, 
                       b1_constr = F, family = "gaussian", seed = 123)
gwqs_barplot(results1_0_adj, tau=NULL)
WQS: weights estimation with covariates adjustment in the simulated dataset

Figure 4.2: WQS: weights estimation with covariates adjustment in the simulated dataset

After adjustment the association is largely attenuated, and the weights of the most important contributors change both in magnitude as well as in ranking (Figure 4.2). This implies that the three confounders have a different effect on each of the components (e.g. the contribution of \(X_6\) was attenuated before adjusting, while the contribution of \(X_4\) was overestimated).

The following lines will instead fit a quantile G-computation model. What we have to specify in the command is the list of exposures, the name of the mixture object, the data, the type of outcome (continuous here), and whether we want quartiles or other categorizations. Weights estimates are presented in Figure 4.3.

qc <- qgcomp(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14
                         data2, family=gaussian(), q=4)
qgcomp: weights estimation with covariates adjustment in the simulated dataset

Figure 4.3: qgcomp: weights estimation with covariates adjustment in the simulated dataset

The Authors also recommended fitting the model using bootstrap, which can be achieved with the following command. Note that the number of iterations, her set to 10, should be at least 200. The plot from this model will provide the estimate of the overall effect of the mixture (Figure 4.4).

qc.boot <- qgcomp.boot(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14
                         data2, family=gaussian(), q=4, B=10, seed=123)
qgcomp: overall mixture effect in the simulated dataset

Figure 4.4: qgcomp: overall mixture effect in the simulated dataset

It is interesting to note that in this situation of high-collinearity, qgcomp’s results are still affected as we see a strikingly high (and, as we know since data are simulated, wrong) negative weight for \(X_3\).

As a final note please consider that both packages are very recent and constantly updated and revised. One should always refer to online vignettes and documentations for updates and eventual modification in the syntax.

4.2.6 Example from the literature

Thanks for its easy implementation in statistical software and the development of the several discussed extensions, WQS is rapidly becoming one of the most common techniques used by investigators to evaluate environmental mixtures.

As an illustrative example on how methods and results can be presented the reader can refer to a paper from Deyssenroth et al. (2018), evaluating the association between 16 trace metals, measured in post-partum maternal toe nails in about 200 pregnant women from the Rhode Island Child Health Study, and small for gestational age (SGA) status. Before fitting WQS the Authors conduct a preliminary analysis using conditional logistic regression, which indicates that effects seem to operate in both directions (Figure 4.5).

Conditional logistic regression results from Deyssenroth et al. 2018

Figure 4.5: Conditional logistic regression results from Deyssenroth et al. 2018

As a consequence, WQS results are presented for both the positive and negative directions, summarizing both weights estimates and total effects in a clear and informative figure (Figure 4.6).

WQS results from Deyssenroth et al. 2018

Figure 4.6: WQS results from Deyssenroth et al. 2018


Carrico, Caroline, Chris Gennings, David C Wheeler, and Pam Factor-Litvak. 2015. “Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting.” Journal of Agricultural, Biological, and Environmental Statistics 20 (1): 100–120.
Colicino, Elena, Nicolo Foppa Pedretti, Stefanie A Busgang, and Chris Gennings. 2020. “Per-and Poly-Fluoroalkyl Substances and Bone Mineral Density: Results from the Bayesian Weighted Quantile Sum Regression.” Environmental Epidemiology (Philadelphia, Pa.) 4 (3).
Curtin, Paul, Joshua Kellogg, Nadja Cech, and Chris Gennings. 2021. “A Random Subset Implementation of Weighted Quantile Sum (WQSRS) Regression for Analysis of High-Dimensional Mixtures.” Communications in Statistics-Simulation and Computation 50 (4): 1119–34.
Czarnota, Jenna, Chris Gennings, and David C Wheeler. 2015. “Assessment of Weighted Quantile Sum Regression for Modeling Chemical Mixtures and Cancer Risk.” Cancer Informatics 14: CIN–S17295.
Deyssenroth, Maya A, Chris Gennings, Shelley H Liu, Shouneng Peng, Ke Hao, Luca Lambertini, Brian P Jackson, Margaret R Karagas, Carmen J Marsit, and Jia Chen. 2018. “Intrauterine Multi-Metal Exposure Is Associated with Reduced Fetal Growth Through Modulation of the Placental Gene Network.” Environment International 120: 373–81.
Gennings, Chris, Paul Curtin, Ghalib Bello, Robert Wright, Manish Arora, and Christine Austin. 2020. “Lagged WQS Regression for Mixtures with Many Components.” Environmental Research 186: 109529.
Keil, Alexander P, Jessie P Buckley, Katie M O’Brien, Kelly K Ferguson, Shanshan Zhao, and Alexandra J White. 2020. “A Quantile-Based g-Computation Approach to Addressing the Effects of Exposure Mixtures.” Environmental Health Perspectives 128 (4): 047004.
Tanner, Eva M, Carl-Gustaf Bornehag, and Chris Gennings. 2019. “Repeated Holdout Validation for Weighted Quantile Sum Regression.” MethodsX 6: 2855–60.