8.3 Weighted descriptive statistics

There are a number of survey functions for computing weighted descriptive statistics, as well as a gtsummary (Sjoberg et al. 2021, 2023) function to conveniently create a “Table 1”. We will compute these statistics overall and by exposure or outcome.

8.3.1 Overall

Example 8.1 (continued): Compute weighted descriptive statistics that estimate values for the population represented by NHANES 2017-2018 participants with non-missing values on all our analysis variables. Since the outcome variable is fasting glucose, use the design corresponding to the fasting subsample with complete data and positive weights (design.FST.nomiss).

Examples are shown first for two variables, one continuous and one categorical. Later, we will use gtsummary to create a Table 1 of weighted descriptive statistics for all the variables.

For continuous variables, use svymean(), svyvar(), and svyquantile() to compute the weighted mean, standard deviation, median, and interquartile range, and svyhist() and svyboxplot() to plot weighted histograms and boxplots (Figures 8.1 and 8.2, respectively). When using confint() to get a 95% confidence interval, add df = degf(design) to use the design DF.

NOTE: If not using a complete case analysis and a variable has missing values, add na.rm = T to each svymean or svyvar call to avoid returning a missing value. When using confint(svymean()), the na.rm=T must go in the svymean() call, not the confint() call (e.g., confint(svymean(~X, design, na.rm=T), df = degf(design))).

# Weighted mean, standard deviation, and 95% CI
cbind(
  "wMEAN" = svymean(    ~LBDGLUSI, design.FST.nomiss),
  "wSD"   = sqrt(svyvar(~LBDGLUSI, design.FST.nomiss)[1]),
        confint(svymean(~LBDGLUSI, design.FST.nomiss), df = degf(design.FST.nomiss))
)
##          wMEAN   wSD 2.5 % 97.5 %
## LBDGLUSI 6.106 1.763 5.976  6.236
# Weighted median, 95% CI, and standard error
# df = degf(design) is already the default for svyquantile
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.50)[[1]]
##     quantile ci.2.5 ci.97.5      se
## 0.5     5.72   5.66    5.83 0.03988
# Weighted interquartile range
c("wIQR" = 
    svyquantile(~LBDGLUSI, design.FST.nomiss, 0.75)[[1]][1, "quantile"] - 
    svyquantile(~LBDGLUSI, design.FST.nomiss, 0.25)[[1]][1, "quantile"])
## wIQR 
## 0.83
par(mfrow=c(1,2))
# Unweighted probability histogram
hist(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss], probability = T,
     xlab = "Fasting Glucose (mmol/L)", main = "Unweighted",
     ylim = c(0,0.35))
# Weighted probability histogram
svyhist(   ~LBDGLUSI, design.FST.nomiss,
     xlab = "Fasting Glucose (mmol/L)", main = "Weighted",
     ylim = c(0,0.35))
Unweighted vs. weighted histograms

Figure 8.1: Unweighted vs. weighted histograms

par(mfrow=c(1,2))
# Unweighted boxplot
boxplot(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss],
     ylab = "Fasting Glucose (mmol/L)",
     main = "Unweighted")
# Weighted boxplot
svyboxplot(~LBDGLUSI~1, design.FST.nomiss, all.outliers = T,
     ylab = "Fasting Glucose (mmol/L)",
     main = "Weighted")
Unweighted vs. weighted boxplots

Figure 8.2: Unweighted vs. weighted boxplots

For categorical variables, use svytotal() or svytable() to estimate population totals. In a sample we compute sample frequencies but the corresponding population quantity is a population total. Use svymean(), svytable() or svyciprop() to estimate population proportions. Use barplot() on the output of svytable() to plot a weighted bar chart (Figure 8.3).

NOTE: If not using a complete case analysis and a variable has missing values, add na.rm = T to each svytotal() or svymean() call to avoid returning a missing value.

# Weighted total with SE of total
svytotal(~race_eth, design.FST.nomiss)
##                                total       SE
## race_ethHispanic            32712402  4552659
## race_ethNon-Hispanic White 137704704 10603571
## race_ethNon-Hispanic Black  23063852  3388626
## race_ethNon-Hispanic Other  21072543  2998573
# Weighted total (no SE)
svytable(~race_eth, design.FST.nomiss)
## race_eth
##           Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other 
##           32712402          137704704           23063852           21072543
# Weighted proportion using svymean()
svymean(~race_eth, design.FST.nomiss)
##                              mean   SE
## race_ethHispanic           0.1525 0.02
## race_ethNon-Hispanic White 0.6418 0.03
## race_ethNon-Hispanic Black 0.1075 0.02
## race_ethNon-Hispanic Other 0.0982 0.01
# Weighted proportion using svytable by normalizing the total to 1
svytable(~race_eth, design.FST.nomiss, Ntotal=1)
## race_eth
##           Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other 
##            0.15247            0.64182            0.10750            0.09822
# Weighted proportion with 95% CI
# df = degf(design) is already the default for svyciprop
svyciprop(~I(race_eth=="Hispanic"), design.FST.nomiss)
##                                  2.5% 97.5%
## I(race_eth == "Hispanic") 0.152 0.113   0.2
# (results for other 3 levels not shown)
svyciprop(~I(race_eth=="Non-Hispanic White"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Black"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Other"), design.FST.nomiss)
par(mfrow=c(2,1))
# Unweighted barplot
barplot(prop.table(table(nhanes.mod$race_eth[nhanes.mod$nomiss])),
        ylab = "Proportion", xlab = "Race/Ethnicity",
        main = "Unweighted", cex.names = 0.65)
# Weighted barplot
mybar <- svytable(~race_eth, design.FST.nomiss, Ntotal=1)
barplot(mybar,
        ylab = "Proportion", xlab = "Race/Ethnicity",
        main = "Weighted", cex.names = 0.65)
Unweighted vs. weighted bar charts

Figure 8.3: Unweighted vs. weighted bar charts

NHANES over-sampled minority groups in order to have sufficient sample sizes for subgroup analyses. Thus, there are large differences between the unweighted and weighted distributions of race/ethnicity. Unweighted proportions reflect the composition of the sample, whereas weighted proportions estimate the population distribution.

Finally, use tbl_svysummary() from the gtsummary library to produce a Table 1 of weighted descriptive statistics. Instead of starting with a dataset as we did in unweighted analyses, start with the design object (design.FST.nomiss). For categorical variables, \(N\) and \(n\) values in the table are estimated population totals – when we did unweighted analyses, these were sample sizes.

The results are shown in Table 8.1.

library(gtsummary)
design.FST.nomiss %>% 
  tbl_svysummary(
    # Use include to select variables
    include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
                smoker, RIAGENDR, race_eth, income),
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(LBDGLUSI ~ c(2, 2),
                  BMXWAIST ~ c(1, 1),
                  RIDAGEYR ~ c(1, 1),
                  all_categorical() ~ c(0, 1))
  ) %>%
  modify_header(label = "**Variable**") %>%
  modify_caption("Weighted descriptive statistics") %>%
  bold_labels()
Table 8.1: Weighted descriptive statistics
Variable N = 214,553,5011
LBDGLUSI 6.11 (1.76)
BMXWAIST 100.2 (17.1)
RIDAGEYR 47.0 (17.5)
smoker
    Never 123,589,210 (57.6%)
    Past 55,919,877 (26.1%)
    Current 35,044,414 (16.3%)
RIAGENDR
    Male 104,286,253 (48.6%)
    Female 110,267,248 (51.4%)
race_eth
    Hispanic 32,712,402 (15.2%)
    Non-Hispanic White 137,704,704 (64.2%)
    Non-Hispanic Black 23,063,852 (10.7%)
    Non-Hispanic Other 21,072,543 (9.8%)
income
    < $25,000 38,141,114 (17.8%)
    $25,000 to < $55,000 57,342,659 (26.7%)
    $55,000+ 119,069,729 (55.5%)
1 Mean (SD); n (%)

8.3.2 By exposure or outcome

Example 8.1 (continued): Compute weighted descriptive statistics by smoking status.

For svymean(), svyvar(), svyquantile(), svytotal(), and svyciprop(), use the wrapper function svyby() to compute statistics at each level of another variable. For svytable(), however, add the by variable to the formula.

# Continuous variables, by smoking status
# Weighted mean, standard deviation, and 95% CI
WTD <- cbind(
  svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean)[, 1:2],
  svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyvar)[ ,   2],
  confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean),
          df = degf(design.FST.nomiss))
)
WTD[, 3] <- sqrt(WTD[, 3])
names(WTD)[3] <- paste(names(WTD)[2], "wSD")
names(WTD)[2] <- paste(names(WTD)[2], "wMEAN")
WTD
##          smoker LBDGLUSI wMEAN LBDGLUSI wSD 2.5 % 97.5 %
## Never     Never          5.965        1.543 5.796  6.135
## Past       Past          6.457        2.066 6.297  6.617
## Current Current          6.041        1.894 5.798  6.285
# Weighted median, standard error, and 95% CI
# df = degf(design) is already the default for svyquantile
#      but not for confint
cbind(
  svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
  confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
          df = degf(design.FST.nomiss))
)
##          smoker LBDGLUSI se.LBDGLUSI 2.5 % 97.5 %
## Never     Never     5.66     0.03753 5.580  5.740
## Past       Past     5.83     0.03988 5.745  5.915
## Current Current     5.66     0.06568 5.520  5.800
# Weighted interquartile range
Q75 <- as.data.frame(
  svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.75)
  )[,1:2]
Q25 <- as.data.frame(
  svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.25)
  )[,1:2]
names(Q75)[2] <- paste(names(Q75)[2], "wQ75")
names(Q25)[2] <- paste(names(Q25)[2], "wQ25")
WIQR <- merge(Q75, Q25)
WIQR$wIQR <- WIQR[,2] - WIQR[,3]
WIQR
##    smoker LBDGLUSI wQ75 LBDGLUSI wQ25 wIQR
## 1 Current          6.11          5.27 0.84
## 2   Never          6.11          5.27 0.84
## 3    Past          6.49          5.50 0.99
# Categorical variables, by smoking status
# Weighted total
svytable(~race_eth + smoker, design.FST.nomiss)
##                     smoker
## race_eth                Never     Past  Current
##   Hispanic           20524606  7861212  4326583
##   Non-Hispanic White 75120526 40106079 22478099
##   Non-Hispanic Black 15137153  3487312  4439386
##   Non-Hispanic Other 12806924  4465273  3800346
# Weighted total with SE of total
svyby(~race_eth, ~smoker, design.FST.nomiss, svytotal)
# (results not shown)

Be careful when normalizing svytable() to get proportions. Using Ntotal=1 as we did previously normalizes the frequencies to the overall total. Rather, we want to normalize to the column totals to get proportions within each level of smoking status. This is accomplished using prop.table(, margin = 2). Optionally, use addmargins(, 1) to confirm that each column sums to 1.

# Weighted proportion
addmargins(
  prop.table(svytable(~race_eth + smoker, design.FST.nomiss), margin = 2)
  , 1)
##                     smoker
## race_eth               Never    Past Current
##   Hispanic           0.16607 0.14058 0.12346
##   Non-Hispanic White 0.60782 0.71721 0.64142
##   Non-Hispanic Black 0.12248 0.06236 0.12668
##   Non-Hispanic Other 0.10362 0.07985 0.10844
##   Sum                1.00000 1.00000 1.00000
# Weighted proportion with SE
svyby(~I(race_eth=="Hispanic"), ~smoker, design.FST.nomiss, svyciprop)
##          smoker I(race_eth == "Hispanic") se.as.numeric(I(race_eth == "Hispanic"))
## Never     Never                    0.1661                                  0.02650
## Past       Past                    0.1406                                  0.01976
## Current Current                    0.1235                                  0.02512
# (results for other levels not shown)
svyby(~I(race_eth=="Non-Hispanic White"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Black"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Other"), ~smoker, design.FST.nomiss, svyciprop)

Finally, use tbl_svysummary() from the gtsummary library to produce a Table 1 of weighted descriptive statistics by smoking status. Be careful, however, to use a character version of the smoking status variable; using the factor smoker will lead to an error. When we created this dataset in Section 8.2, we created smoker_char, a character formatted version of smoker.

The results are shown in Table 8.2.

library(gtsummary)
design.FST.nomiss %>% 
  tbl_svysummary(
    # Use a character variable here. A factor leads to an error
    by = smoker_char,
    # Use include to select variables
    include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
                RIAGENDR, race_eth, income),
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(LBDGLUSI ~ c(2, 2),
                  BMXWAIST ~ c(1, 1),
                  RIDAGEYR ~ c(1, 1),
                  all_categorical() ~ c(0, 1))
  ) %>%
  modify_header(label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
  modify_caption("Weighted descriptive statistics, by smoking status") %>%
  bold_labels()
Table 8.2: Weighted descriptive statistics, by smoking status
Variable 1 Never N = 123589210 (57.6%)1 2 Past N = 55919877 (26.1%)1 3 Current N = 35044414 (16.3%)1
LBDGLUSI 5.97 (1.54) 6.46 (2.07) 6.04 (1.89)
BMXWAIST 98.6 (17.0) 104.0 (16.6) 99.9 (17.4)
RIDAGEYR 45.4 (17.7) 51.5 (18.0) 45.4 (14.4)
RIAGENDR


    Male 51,169,518 (41.4%) 34,805,126 (62.2%) 18,311,610 (52.3%)
    Female 72,419,692 (58.6%) 21,114,752 (37.8%) 16,732,805 (47.7%)
race_eth


    Hispanic 20,524,606 (16.6%) 7,861,212 (14.1%) 4,326,583 (12.3%)
    Non-Hispanic White 75,120,526 (60.8%) 40,106,079 (71.7%) 22,478,099 (64.1%)
    Non-Hispanic Black 15,137,153 (12.2%) 3,487,312 (6.2%) 4,439,386 (12.7%)
    Non-Hispanic Other 12,806,924 (10.4%) 4,465,273 (8.0%) 3,800,346 (10.8%)
income


    < $25,000 18,081,806 (14.6%) 9,269,756 (16.6%) 10,789,552 (30.8%)
    $25,000 to < $55,000 31,806,525 (25.7%) 15,168,200 (27.1%) 10,367,933 (29.6%)
    $55,000+ 73,700,879 (59.6%) 31,481,921 (56.3%) 13,886,929 (39.6%)
1 Mean (SD); n (%)

References

Sjoberg, Daniel D., Joseph Larmarange, Michael Curry, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2023. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://github.com/ddsjoberg/gtsummary.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.