Weighted Quantile Sum (WQS) regression is a statistical approach for multivariate regression in high-dimensional data with complex correlation patterns commonly encountered in environmental exposures, epi/genomics, and metabolomic studies, among others. The model constructs an empirically weighted index estimating the mixture effect of predictor (e.g., exposure) variables on an outcome, which may then be used in a regression model with relevant covariates to test the association of the index with a dependent variable or outcome. The contribution of each individual predictor to the overall index effect may is assessed by the relative strength of the estimated weights since the components are quantiled and are therefore on the same scale.
The gWQS package extends WQS regression to applications with continuous and categorical outcomes and implements the random subset WQS and the repeated holdout WQS. In practical terms, the primary outputs of an analysis are the parameter estimates and significance tests for the overall index effect of predictor variables, and the estimated weights assigned to each predictor, which identify the relevant contribution of each variable to the relationship between the WQS index and the outcome variable.
For additional theoretical background on WQS regression and its extensions, see the references provided below.
gWQS
packageThe main functions of the gWQS
package are
gwqs
and gwqs_multinom
. The first extends WQS
regression to applications with continuous, categorical and count
outcomes; the second extends WQS regression to applications with
categorical outcomes with more than 2 non-ordinal categories while both
functions include the option rs
that allows to apply a
random subset implementation of WQS and the argument rh
that if set greater than 1 implements a repeated holdout validation
procedure. In this vignette we will only show the application of WQS to
a continuous outcome. We created the wqs_data
dataset
(available once the package is installed and loaded) to show how to use
this function. These data reflect 59 exposure concentrations simulated
from a distribution of 34 PCB exposures and 25 phthalate biomarkers
measured in subjects participating in the NHANES study (2001-2002).
Additionally, 8 outcome measures were simulated applying different
distributions and fixed beta coefficients to the predictors. In
particular y
and yLBX
were simulated from a
normal distribution, ybin
and ybinLBX
from a
binomial distribution, ymultinom
and
ymultinomLBX
from a multinomial distribution and
ycount
and ycountLBX
from a Poisson
distribution. The sex
variable was also simulated to allow
to adjust for a covariate in the model (see the wqs_data
help page for more details). This dataset can thus be used to test the
gWQS
package by analyzing the mixture effect of the
simulated chemicals on the different outcomes, with adjustments for
covariates.
Following the algorithm proposed in Renzetti et al. (2023) we start
fitting a WQS regression with two indices, one exploring the positive
and the second exploring the negative direction specifying the terms
pwqs
and nwqs
, respectively. We also opted for
the repeated holdout validation procedure to get more stable results.
The following script calls a WQS model for a continuous outcome using
the function gwqs
that returns an object of class
gwqs
; the three functions gwqs_barplot
,
gwqs_scatterplot
and gwqs_fitted_vs_resid
allows to plot the figures shown in figure @ref(fig:model1):
## Welcome to Weighted Quantile Sum (WQS) Regression.
## If you are using a Mac you have to install XQuartz.
## You can download it from: https://www.xquartz.org/
library(ggplot2)
library(knitr)
library(kableExtra)
library(reshape2)
# we save the names of the mixture variables in the variable "PCBs"
PCBs <- names(wqs_data)[1:34]
# we run the model and save the results in the variable "results2i"
results2i <- gwqs(yLBX ~ pwqs + nwqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016)
This WQS model tests the relationship between our dependent variable,
y
, and a WQS index estimated from ranking exposure
concentrations in deciles (q = 10
); in the
gwqs
formula the wqs
(when we want to build a
single index) or the pwqs
and nwqs
(when we
want to build a double index) terms must be included as if they were
present in the dataset. The data were divided in 40% of the dataset for
training and 60% for validation (validation = 0.6
) and we
repeated this split procedure 5 times (rh = 5
), and 5
bootstrap samples (b = 5
) for parameter estimation were
assigned (in practical applications we suggest at least 100 repeated
holdout validation and 100 bootstrap samples to be used). We first
examined a bidirectional association including both the positive
(pwqs
) and negative (nwqs
) indices. We linked
our model to a gaussian distribution to test for relationships between
the continuous outcome and exposures (family = "gaussian"
,
other families available within the gwqs
function are
"binomial"
, "poisson"
,
"quasipoisson"
and "negbin"
while the function
gwqs_multinom
allows to fit a WQS regression for
multinomial outcomes), and fixed the seed to 2016 for reproducible
results (seed = 2016
).
To test the statistical significance of the association between the
variables in the model, the following code has to be run as for a
classical R
regression function:
##
## Call:
## gwqs(formula = yLBX ~ pwqs + nwqs, data = wqs_data, mix_name = PCBs,
## rh = 5, b = 5, b1_pos = TRUE, q = 10, validation = 0.6, family = "gaussian",
## seed = 2016)
##
##
## Coefficients:
## Estimate Std. Error 2.5 % 97.5 %
## (Intercept) -4.60584 0.34268 -5.27749 -3.934
## pwqs 1.12430 0.09264 0.94273 1.306
## nwqs -0.08110 0.04300 -0.16539 0.003
##
## (Mean dispersion parameter for gaussian family taken to be 1.211663)
##
## Mean null deviance: 633.40 on 301 degrees of freedom
## Mean residual deviance: 357.82 on 299 degrees of freedom
##
## Mean AIC: 907.27
From this output we can observe a statistically significant
association in the positive (β
= 1.108 95%CI 0.903, 1.314) but not in the negative direction (β = -0.069 95%CI -0.170, 0.031). We
can also draw some useful plots using the following secondary functions
built in the gWQS
package:
# bar plot
gwqs_barplot(results2i)
# scatter plot y vs wqs
gwqs_scatterplot(results2i)
# scatter plot residuals vs fitted values
gwqs_fitted_vs_resid(results2i)
# boxplot of the weights estimated at each repeated holdout step
gwqs_boxplot(results2i)
Figure @ref(fig:model1) A is a barplot showing the weights assigned
to each variable ordered from the highest weight to the lowest of the
positive index (by default). These results indicate that the variables
LBXF07LA
, LBX138LA
and LBXD02LA
are the largest contributors to this mixture effect in the positive
direction while we have different contributors in the negative direction
but there was not a significant association. The dashed red line
represents the cutoff τ (by
default equal to the inverse of the number of elements in the mixture as
suggested in Carrico et al. 2014) to discriminate which elements are of
most importance.
In plot B of figure @ref(fig:model1) we have a representation of the
wqs index vs the outcome (adjusted for the model residual when
covariates are included in the model) that shows the direction and the
shape of the association between the exposure and the outcome. For
example, in this case we observe a linear and positive relationship
between the mixture and the yLBX
variable in a positive
direction and an almost horizontal line in the negative direction
(meaning there is no negative effect of the mixture on the outcome).
In plot C a diagnostic graph of the residuals vs the fitted values is
shown to check if they are randomly spread around zero or if there is a
trend. All these plots are built using the ggplot2
package.
In plot D the boxplots of the distribution of the weights estimated for each repeated holdout is displayed.
To have the exact values of the estimated weights we can apply the
command results2i$final_weights
. The following code shows
the first six highest weights; the full list of weights can be called by
omitting the head function:
## mix_name Estimate pos 2.5% pos 97.5% pos Estimate neg 2.5% neg
## LBXF07LA LBXF07LA 0.15807190 0.10884708 0.19881882 0.022000584 0.009982359
## LBX138LA LBX138LA 0.14395982 0.10345902 0.17900197 0.017483450 0.005799736
## LBXD02LA LBXD02LA 0.12454117 0.09010640 0.14339100 0.023137081 0.012062192
## LBX105LA LBX105LA 0.07604367 0.05422905 0.10173885 0.016357239 0.008697873
## LBXF06LA LBXF06LA 0.06071122 0.03708558 0.07796695 0.009400526 0.002745705
## LBX157LA LBX157LA 0.05662742 0.01624598 0.09803891 0.015256792 0.004842040
## 97.5% neg
## LBXF07LA 0.04126325
## LBX138LA 0.03304435
## LBXD02LA 0.05531871
## LBX105LA 0.03170679
## LBXF06LA 0.01647274
## LBX157LA 0.02962416
These same tables are also shown in the Viewer window through the
functions gwqs_summary_tab
and
gwqs_weights_tab
respectively. Both these two functions use
the package kableExtra
to produce the output. The output
(table @ref(tab:sum1) and @ref(tab:w1)) and respective code is shown
below:
Estimate | Std. Error | 2.5 % | 97.5 % | |
---|---|---|---|---|
(Intercept) | -4.6100 | 0.3430 | -5.280 | -3.93000 |
pwqs | 1.1200 | 0.0926 | 0.943 | 1.31000 |
nwqs | -0.0811 | 0.0430 | -0.165 | 0.00319 |
mf_df <- as.data.frame(signif(coef(summary(results2i)), 3))
kable_styling(kable(mf_df, row.names = TRUE))
mix_name | Estimate pos | 2.5% pos | 97.5% pos | Estimate neg | 2.5% neg | 97.5% neg |
---|---|---|---|---|---|---|
LBXF07LA | 0.15800 | 0.109000 | 0.19900 | 0.0220 | 0.00998 | 0.0413 |
LBX138LA | 0.14400 | 0.103000 | 0.17900 | 0.0175 | 0.00580 | 0.0330 |
LBXD02LA | 0.12500 | 0.090100 | 0.14300 | 0.0231 | 0.01210 | 0.0553 |
LBX105LA | 0.07600 | 0.054200 | 0.10200 | 0.0164 | 0.00870 | 0.0317 |
LBXF06LA | 0.06070 | 0.037100 | 0.07800 | 0.0094 | 0.00275 | 0.0165 |
LBX157LA | 0.05660 | 0.016200 | 0.09800 | 0.0153 | 0.00484 | 0.0296 |
LBXF05LA | 0.02620 | 0.010700 | 0.04000 | 0.0308 | 0.00749 | 0.0740 |
LBX170LA | 0.02260 | 0.012600 | 0.03270 | 0.0190 | 0.00286 | 0.0372 |
LBX187LA | 0.02240 | 0.003810 | 0.04280 | 0.0250 | 0.01310 | 0.0411 |
LBX167LA | 0.02190 | 0.008670 | 0.03310 | 0.0131 | 0.00342 | 0.0260 |
LBXD04LA | 0.02180 | 0.006820 | 0.02980 | 0.0187 | 0.00583 | 0.0369 |
LBX074LA | 0.02040 | 0.003140 | 0.02840 | 0.0171 | 0.01110 | 0.0213 |
LBX153LA | 0.01960 | 0.006700 | 0.03610 | 0.0206 | 0.00715 | 0.0474 |
LBXF04LA | 0.01960 | 0.003870 | 0.04000 | 0.0389 | 0.00524 | 0.0608 |
LBX099LA | 0.01850 | 0.008270 | 0.02680 | 0.0202 | 0.00608 | 0.0380 |
LBXF08LA | 0.01610 | 0.006010 | 0.02710 | 0.0296 | 0.01200 | 0.0416 |
LBX194LA | 0.01600 | 0.003260 | 0.03460 | 0.0209 | 0.00695 | 0.0445 |
LBX180LA | 0.01530 | 0.004500 | 0.03330 | 0.0285 | 0.00224 | 0.0755 |
LBXD05LA | 0.01530 | 0.005750 | 0.02850 | 0.0292 | 0.01240 | 0.0518 |
LBXD01LA | 0.01390 | 0.006260 | 0.03060 | 0.0314 | 0.00367 | 0.0916 |
LBX118LA | 0.01160 | 0.001680 | 0.03020 | 0.0511 | 0.01900 | 0.1410 |
LBXD03LA | 0.01150 | 0.003240 | 0.01810 | 0.0324 | 0.01400 | 0.0845 |
LBXPCBLA | 0.01140 | 0.004070 | 0.02890 | 0.0259 | 0.00408 | 0.0421 |
LBXHXCLA | 0.01020 | 0.002240 | 0.02100 | 0.0327 | 0.00175 | 0.0712 |
LBXF01LA | 0.00868 | 0.000698 | 0.01590 | 0.0372 | 0.00195 | 0.0785 |
LBX199LA | 0.00768 | 0.001160 | 0.01160 | 0.0418 | 0.02940 | 0.0624 |
LBX156LA | 0.00739 | 0.001230 | 0.01460 | 0.0321 | 0.00865 | 0.0493 |
LBX189LA | 0.00715 | 0.003970 | 0.01010 | 0.0223 | 0.00882 | 0.0536 |
LBX196LA | 0.00685 | 0.000644 | 0.01380 | 0.0691 | 0.01460 | 0.1470 |
LBXD07LA | 0.00684 | 0.001390 | 0.01800 | 0.0309 | 0.00706 | 0.0547 |
LBXF09LA | 0.00675 | 0.000416 | 0.01350 | 0.0459 | 0.00409 | 0.1010 |
LBXF03LA | 0.00650 | 0.000810 | 0.01250 | 0.0405 | 0.00984 | 0.0679 |
LBXF02LA | 0.00565 | 0.000386 | 0.01250 | 0.0464 | 0.02810 | 0.0698 |
LBXTCDLA | 0.00226 | 0.000661 | 0.00383 | 0.0451 | 0.02300 | 0.0710 |
final_weight <- results2i$final_weights
final_weight[, -1] <- signif(final_weight[, -1], 3)
kable_styling(kable(final_weight, row.names = FALSE))
The gwqs
function gives back other outputs. If a
repeated holdout was performed, we obtain the dataset
y_wqs_df
that contains the dependent variable (adjusted for
other covariates if present in the model) and the wqs
or
the pwqs
and nwqs
indices depending on how the
model was specified; the dataset used in the analysis that includes the
WQS indices specified in the model (results2i$data
); the
list of vectors containing the cutoffs used to determine the quantiles
of each variable in the mixture (results2i$qi
); a matrix or
a list of two matrices (if two indices were included in the model) with
all the estimated weights for each repeated holdout; the list of the
output from the repeated WQS regressions
(results2i$gwqslist
). Each element of the list contains the
information that we would obtain from a WQS regression with single
split: the vector of the values that indicate whether the solver
converged (0) or not (1) (results2i$gwqslist[[1]]$conv
, the
specified value [[1]]
allows to access the information of
the first model); the matrix or the list of two matrices (if two indices
were included in the model) with all the estimated weights and the
associated β1,
standard errors, statistics and p-values for each bootstrap sample
(results2i$gwqslist[[1]]$bres
); the list of vectors
containing the rows of the subjects included in each bootstrap dataset
(results2i$gwqslist[[1]]$bindex
); a logical vector that
identifies the rows used to estimate the parameters of the final model
(results2i$gwqslist[[1]]$vindex
); the vector of the values
of the objective function at the optimal parameter estimates obtained at
each bootstrap step (results2i$gwqslist[[1]]$objfn_values
)
and any messages from the optim
function
(results2i$gwqslist[[1]]$optim_messages
).
The following script allows to reproduce the figures that are automatically generated using the plots functions:
# bar plot
w_ord <- order(results2i$final_weights$`Estimate pos`)
mean_weight_pos <- results2i$final_weights$`Estimate pos`[w_ord]
mean_weight_neg <- results2i$final_weights$`Estimate neg`[w_ord]
mix_name <- factor(results2i$final_weights$mix_name[w_ord],
levels = results2i$final_weights$mix_name[w_ord])
data_plot <- data.frame(mean_weight = c(mean_weight_pos, mean_weight_neg),
mix_name = rep(mix_name, 2),
index = factor(rep(c("pwqs", "nwqs"), each = length(w_ord)),
levels = c("pwqs", "nwqs")))
ggplot(data_plot, aes(x = mix_name, y = mean_weight)) +
geom_bar(stat = "identity", color = "black") + theme_bw() +
theme(axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text.x = element_text(color='black'),
legend.position = "none") + coord_flip() +
geom_hline(yintercept = 1/length(PCBs), linetype="dashed", color = "red") +
facet_wrap(~ index)
#
# scatter plot y vs wqs
ggplot(melt(results2i$y_wqs_df, measure.vars = c("pwqs", "nwqs")), aes(value, y_adj)) +
geom_point() + facet_wrap(~ variable) + xlab("wqs") +
stat_smooth(method = "loess", se = FALSE, linewidth = 1.5) + theme_bw()
#
# scatter plot residuals vs fitted values
fit_df <- data.frame(fitted = fitted(results2i),
resid = residuals(results2i, type = "response"))
res_vs_fitted <- ggplot(fit_df, aes(x = fitted, y = resid)) + geom_point() +
theme_bw() + xlab("Fitted values") + ylab("Residuals")
We then run three WQS regressions setting three different shrinkage
parameter values: 1. one equal to the magnitude of the AIC of the
regression fitted at step 1 2. one equal to a lower order of magnitude
3. one equal to a greater order of magnitude Additional models can be
fitted following a bisection algorithm setting lambda
between the values with the two smallest AICs.
# we run the model setting the penalization term equal to 90
results2i_l90 <- gwqs(yLBX ~ pwqs + nwqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 90)
# we run the model setting the penalization term equal to 900
results2i_l900 <- gwqs(yLBX ~ pwqs + nwqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 900)
# we run the model setting the penalization term equal to 900
results2i_l9000 <- gwqs(yLBX ~ pwqs + nwqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 9000)
Based on the results obtained from this analysis we can choose the
value of lambda
that minimizes the AIC. In our case,
lambda = 90
is the optimal value as shown by the table
below:
lambda_AIC_2i <- data.frame(lambda = c(0, 90, 900, 9000),
AIC = c(results2i$fit$aic, results2i_l90$fit$aic,
results2i_l900$fit$aic, results2i_l9000$fit$aic))
kable(lambda_AIC_2i) %>% kable_styling()
lambda | AIC |
---|---|
0 | 907.2686 |
90 | 893.9336 |
900 | 902.0220 |
9000 | 968.1171 |
The results still show a significant positive association and a non-significant negative effect of the mixture on the outcome:
##
## Call:
## gwqs(formula = yLBX ~ pwqs + nwqs, data = wqs_data, mix_name = PCBs,
## rh = 5, b = 5, b1_pos = TRUE, q = 10, validation = 0.6, family = "gaussian",
## seed = 2016, lambda = 90)
##
##
## Coefficients:
## Estimate Std. Error 2.5 % 97.5 %
## (Intercept) -4.104784 0.395893 -4.880735 -3.329
## pwqs 0.939062 0.078724 0.784762 1.093
## nwqs -0.007621 0.047690 -0.101094 0.086
##
## (Mean dispersion parameter for gaussian family taken to be 1.160095)
##
## Mean null deviance: 633.40 on 301 degrees of freedom
## Mean residual deviance: 341.86 on 299 degrees of freedom
##
## Mean AIC: 893.93
The identification and the ranking of the signnificant weights did not changed compared to the non-penalized WQS but we can appreciate how the non relevant weights have shrunk towards zero
Since the mixture did not show a negative association with the outcome we can fit a final model with a single positive index and follow the same procedure:
results1i <- gwqs(yLBX ~ wqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016)
# we run the model setting the penalization term equal to 90
results1i_l90 <- gwqs(yLBX ~ wqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 90)
# we run the model setting the penalization term equal to 900
results1i_l900 <- gwqs(yLBX ~ wqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 900)
# we run the model setting the penalization term equal to 900
results1i_l9000 <- gwqs(yLBX ~ wqs, mix_name = PCBs, data = wqs_data,
q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
family = "gaussian", seed = 2016, lambda = 9000)
Based on the results obtained also in this case we can observe that
when lambda = 90
we have the lower AIC:
lambda_AIC_1i <- data.frame(lambda = c(0, 90, 900, 9000),
AIC = c(results1i$fit$aic, results1i_l90$fit$aic,
results1i_l900$fit$aic, results1i_l9000$fit$aic))
kable(lambda_AIC_1i) %>% kable_styling()
lambda | AIC |
---|---|
0 | 900.7376 |
90 | 889.1634 |
900 | 897.3578 |
9000 | 959.9152 |
We can then choose the WQS regression with single index and a
penalization term set to lambda = 90
with results displayed
below and confirm a positive significant association of the mixture with
the outcome and LBXF07LA
, LBX138LA
and
LBXD02LA
as the elements that contribute the most in this
association:
##
## Call:
## gwqs(formula = yLBX ~ wqs, data = wqs_data, mix_name = PCBs,
## rh = 5, b = 5, b1_pos = TRUE, q = 10, validation = 0.6, family = "gaussian",
## seed = 2016, lambda = 90)
##
##
## Coefficients:
## Estimate Std. Error 2.5 % 97.5 %
## (Intercept) -3.99505 0.28251 -4.54878 -3.441
## wqs 0.90659 0.06598 0.77728 1.036
##
## (Mean dispersion parameter for gaussian family taken to be 1.14506)
##
## Mean null deviance: 633.40 on 301 degrees of freedom
## Mean residual deviance: 338.46 on 300 degrees of freedom
##
## Mean AIC: 889.16
Carrico C, Gennings C, Wheeler D, Factor-Litvak P. Characterization of a weighted quantile sum regression for highly correlated data in a risk analysis setting. J Agricul Biol Environ Stat. 2014:1-21. ISSN: 1085-7117. DOI: 10.1007/ s13253-014-0180-3. http://dx.doi.org/10.1007/s13253-014-0180-3.
Czarnota J, Gennings C, Colt JS, De Roos AJ, Cerhan JR, Severson RK, Hartge P, Ward MH, Wheeler D. 2015. Analysis of environmental chemical mixtures and non-Hodgkin lymphoma risk in the NCI-SEER NHL study. Environmental Health Perspectives.
Czarnota J, Gennings C, Wheeler D. 2015. Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk. Cancer Informatics, 2015:14(S2) 159-171.
Curtin P, Kellogg J, Cech N, and Gennings C. A random subset implementation of weighted quantile sum (wqsrs) regression for analysis of high-dimensional mixtures. Communications in Statistics - Simulation and Computation, 0(0):1–16, 2019. doi: 10.1080/03610918.2019.1577971.
Tanner EM, Bornehag CG, and Gennings C. Repeated holdout validation for weighted quantile sum regression. MethodsX, 6:2855 – 2860, 2019. doi: https://doi.org/10.1016/j.mex.2019.11.008.
Renzetti S, Gennings C and Calza S (2023) A weighted quantile sum regression with penalized weights and two indices. Front Public Health 11:1151821. doi: 10.3389/fpubh.2023.1151821.
This package was developed at the CHEAR Data Center (Dept. of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai) with funding and support from NIEHS (U2C ES026555-01) with additional support from the Empire State Development Corporation.