8.5 Domain (subgroup) analysis

Sometimes we are interested in estimating quantities among subgroups of the population, such as “females age 45 years and older.” This is called a domain analysis or subpopulation analysis. In an unweighted analysis, one would subset the dataset to filter out those who do not meet the criteria and then run the analysis on the resulting subset of observations. In a weighted analysis, however, removing observations from the dataset results in incorrect standard errors, confidence intervals, and p-values. The information about the sample design found in the excluded cases is needed to incorporate the sampling design correctly. Instead, use the subset() function to subset the design, that is, to specify the domain of interest in the design object. Rather than actually removing observations, this retains all observations and lets survey know what subset you are interested in. This is exactly what we did earlier, using a complete case indicator variable (nomiss) to subset on those with non-missing values. However, here, we show how to apply this method more generally to any subpopulation.

Example 8.1 (continued): Repeat the creation of a Table 1 and the linear regression from Example 8.1 among females age 45 years and older. Due to the inclusion criteria, gender must be removed from the analysis since it no longer varies between individuals.

Previously, we used subset() to specify a domain with no missing values and positive weights. Here, we add additional conditions to limit to females age 45 years and older.

design.FST.domain <- subset(design.FST,
                            nomiss & RIAGENDR == "Female" & RIDAGEYR >= 45)

NOTE: If the process of specifying your domain of interest involves creating new variables, you must re-create your design object to make those variables available to the design before using subset() on the survey.design object.

Use tbl_svysummary() to create a table of descriptive statistics. The results are shown in Table 8.4.

library(gtsummary)
design.FST.domain %>% 
  tbl_svysummary(
    # Use a character variable here. A factor leads to an error
    by = smoker_char,
    # Use include to select variables
    include = c(LBDGLUSI, BMXWAIST, RIDAGEYR, race_eth, income),
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(LBDGLUSI ~ c(2, 2),
                  BMXWAIST ~ c(1, 1),
                  RIDAGEYR ~ c(1, 1),
                  all_categorical() ~ c(0, 1))
  ) %>%
  modify_header(label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
  modify_caption("Weighted descriptive statistics, by smoking status 
                 (Females age 45y and older)") %>%
  bold_labels()
Table 8.4: Weighted descriptive statistics, by smoking status (Females age 45y and older)
Variable 1 Never N = 40923599 (65.8%)1 2 Past N = 13054149 (21.0%)1 3 Current N = 8207032 (13.2%)1
LBDGLUSI 6.12 (1.55) 6.46 (2.38) 6.20 (1.42)
BMXWAIST 98.1 (15.1) 104.4 (18.0) 98.6 (17.1)
RIDAGEYR 60.9 (10.5) 62.0 (10.3) 57.0 (7.9)
race_eth


    Hispanic 4,701,252 (11.5%) 1,219,766 (9.3%) 841,551 (10.3%)
    Non-Hispanic White 28,446,162 (69.5%) 9,779,559 (74.9%) 5,853,934 (71.3%)
    Non-Hispanic Black 4,125,440 (10.1%) 1,217,709 (9.3%) 829,841 (10.1%)
    Non-Hispanic Other 3,650,745 (8.9%) 837,114 (6.4%) 681,706 (8.3%)
income


    < $25,000 5,158,320 (12.6%) 3,323,765 (25.5%) 3,168,536 (38.6%)
    $25,000 to < $55,000 10,585,563 (25.9%) 3,199,300 (24.5%) 1,753,277 (21.4%)
    $55,000+ 25,179,717 (61.5%) 6,531,083 (50.0%) 3,285,219 (40.0%)
1 Mean (SD); n (%)

Fit the model using the subsetted domain design.

fit.ex8.1.domain <- svyglm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
                      race_eth + income,
                 family=gaussian(), design=design.FST.domain)

Finally, use tbl_regression() to create a table of regression coefficients. The results are shown in Table 8.5.

fit.ex8.1.domain %>%
  tbl_regression(intercept = T,
                 estimate_fun = function(x) style_sigfig(x, digits = 3),
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3),
                 label  = list(BMXWAIST ~ "Waist Circumference (cm)",
                               smoker   ~ "Smoking Status",
                               RIDAGEYR ~ "Age (years)",
                               race_eth ~ "Race/ethnicity",
                               income   ~ "Annual Income")) %>%
  add_global_p(keep = T, test.statistic = "F"
  # Uncomment to use design df for Type III test p-values
  # (but then they will not match the regression term p-values for
  #  binary predictors)
  #             , error.df = degf(fit.ex8.1$survey.design)
                                 ) %>%
  modify_caption("Weighted linear regression results for fasting glucose (mmol/L) 
                 (Females age 45y and older)")
Table 8.5: Weighted linear regression results for fasting glucose (mmol/L) (Females age 45y and older)
Characteristic Beta 95% CI1 p-value
(Intercept) 3.15 1.85, 4.46 0.001
Waist Circumference (cm) 0.031 0.015, 0.047 0.003
Smoking Status

0.731
    Never
    Past 0.118 -0.367, 0.604 0.573
    Current 0.037 -0.559, 0.632 0.885
Age (years) 0.008 -0.012, 0.028 0.362
Race/ethnicity

0.017
    Hispanic
    Non-Hispanic White -0.480 -0.924, -0.036 0.038
    Non-Hispanic Black -0.391 -1.02, 0.234 0.177
    Non-Hispanic Other 0.071 -0.476, 0.618 0.762
Annual Income

0.436
    < $25,000
    $25,000 to < $55,000 -0.124 -0.501, 0.252 0.450
    $55,000+ -0.255 -0.706, 0.196 0.216
1 CI = Confidence Interval