8.2 Specifying the survey design

The first step when using the survey package is to specify the variables in the dataset that define the components of the complex survey design (e.g., strata, PSUs, sampling weights). This information is needed by all the other survey analysis functions, and is stored in a survey.design object which is a required argument in all the survey functions.

Example 8.1: Specify the survey design for the full NHANES 2017-2018 dataset (nhanes1718_rmph.Rdata). This dataset has all the participants, not just a random subsample as was used in earlier chapters, although it still has just a subset of all the NHANES variables. Ultimately, our goal with this example is, as we did in Chapter 5, to carry out a linear regression analysis for the outcome fasting glucose (LBDGLUSI) with the predictors waist circumference (BMXWAIST), smoking status (smoker) , age (RIDAGEYR), gender (RIAGENDR), race/ethnicity (RIDRETH3), and income (income). Unlike in Chapter 5, this analysis will use the entire NHANES 2017-2018 dataset and incorporate the complex survey design in order to produce unbiased population estimates with appropriate estimates of variation.

load("Data/nhanes1718_rmph.Rdata")
nrow(nhanes)
## [1] 9254

When incorporating a complex survey design, always use the full dataset. Do not exclude any observations, not even to apply inclusion/exclusion criteria or to remove cases with missing data. The survey procedures need to have access to the survey design variable values for all observations in order to obtain valid estimates, even estimates in subgroups. However, we can tell R that we want the final analysis to apply only to a subgroup by creating a subpopulation indicator variable and subsetting the design (not the dataset) based on that variable at a later step. In this example, we will subset on those with non-missing data and positive weights. More generally, you can subset to any subpopulation, as discussed in Section 8.5.

For this example, create a complete case indicator variable called nomiss that also conditions on sampling weight > 0. This makes sure descriptive statistics generalize to the same subpopulation as the regression analysis, as well as avoids an error when running some survey functions which do not accept zero weights. If doing a regression without a Table 1 of descriptive statistics, you only need to condition on sampling weight > 0. If using a design in which all sampling weights are positive, however, there is no need to condition on sampling weight > 0.

Also, carry out the following data management steps:

  • Collapse the race/ethnicity variable as we have done with other analyses.
  • Center waist circumference at 100 cm.
  • Create a character version of smoker for use as a by variable in our Table 1 (the function we will use to create this table will expect a character variable, not a factor – see Section 8.3.2).
  • Set missing fasting subsample weights to 0 (svydesign returns an error if there are any missing weights).
nhanes.mod <- nhanes %>% 
  mutate(# Collapse race/ethnicity variable
         race_eth = fct_collapse(RIDRETH3,
          "Hispanic" = c("Mexican American", "Other Hispanic"),
          "Non-Hispanic Other" = c("Non-Hispanic Asian", "Other/Multi")),
         
         # Create a centered waist circumference variable we will use later
         cBMXWAIST = BMXWAIST - 100,
         
         # Create a character version for use as a by variable in tbl_svysummary,
         # and change the values so they are in the correct order when alphabetized
         smoker_char = case_when(smoker == "Never"   ~ "1 Never",
                                 smoker == "Past"    ~ "2 Past",
                                 smoker == "Current" ~ "3 Current"),
         
         # This step is only needed when the variable is labeled.
         # Without this, replace_na() in the next step results in an
         # error for a labeled variable.
         WTSAF2YR = as.numeric(WTSAF2YR),
         
         # Set missing fasting subsample weights to 0 since
         # svydesign returns an error if there are any missing weights.
         WTSAF2YR = replace_na(WTSAF2YR, 0),

         # Complete case and positive weight indicator
         # (a vector of TRUE and FALSE values)
         nomiss = !is.na(LBDGLUSI) & !is.na(BMXWAIST) & !is.na(smoker) &
                  !is.na(RIDAGEYR) & !is.na(RIAGENDR) & !is.na(race_eth) &
                  !is.na(income)   & WTSAF2YR > 0)
# Verify the derivations
table(nhanes$smoker, nhanes.mod$smoker_char, exclude = NULL)

# Subset on missing weights
SUB <- is.na(nhanes$WTSAF2YR)
# Old NAs
summary(nhanes$WTSAF2YR[SUB])
# are now 0 in the new dataset
summary(nhanes.mod$WTSAF2YR[SUB])
# For those with non-missing weights, new - old = 0 (no change)
summary(nhanes$WTSAF2YR[!SUB] - nhanes.mod$WTSAF2YR[!SUB])
# (results not shown)

Next, use svydesign to create the survey.design object. For each of the strata, id, and weights arguments, input a formula (e.g., ~X) specifying the variable that defines the survey strata, PSUs, and sampling weights, respectively.

For NHANES 2017-2018, specify the following three designs: (1) Interview, (2) Examination, and (3) Fasting subsample. Each design has the same strata and PSU variable names, but a different sampling weight variable name. See Section 8.1.1 for definitions of the variables used. As mentioned in that Section, only use the interview weights if all the variables in the analysis were included in the interview; if any were from the examination, use the examination weights unless any were only measured in the fasting subsample in which case use the fasting subsample weights.

library(survey)
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)

design.INT <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTINT2YR,
                        nest=TRUE, data=nhanes.mod)

design.MEC <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTMEC2YR,
                        nest=TRUE, data=nhanes.mod)

design.FST <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTSAF2YR,
                        nest=TRUE, data=nhanes.mod)

# View basic information about the design
design.FST
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTSAF2YR, 
##     nest = TRUE, data = nhanes.mod)
# View more information about the design
summary(design.FST)
# (results not shown)

Finally, for Example 8.1, use the subset() function to create a design object to be used with analyses of cases with nomiss = TRUE, those with complete data and positive weights.

design.FST.nomiss <- subset(design.FST, nomiss)

Filtering out cases with missing values is not actually necessary for the regression analysis itself, but will ensure that the descriptive statistics and the regression analysis generalize to the same subpopulation. Had we used design.FST instead of design.FST.nomiss, the subpopulation represented by the descriptive statistics for a variable would be those who would have had non-missing values for that variable had they been sampled, regardless of whether or not other variables were missing. With design.FST.nomiss, descriptive statistics for all variables and the regression analysis all generalize to those in the population who would have had non-missing values for all the variables had they been sampled.

NOTES:

  • The global options, survey.lonely.psu = "adjust" and survey.adjust.domain.lonely = TRUE specify how variance estimation will be handled in cases where a stratum contains only one PSU. These options “center the stratum at the population mean rather than the stratum mean” and make the same adjustment within subgroup analyses (see ?surveyoptions). These options only need to be set once per R session.
  • The survey package assumes the id variable has a unique value for every unique PSU. However, in many datasets, including NHANES and NSDUH, the id variable values are nested within strata, duplicated across strata, requiring the use of the the nest=TRUE option in svydesign(). If you leave this option out but the id values are actually nested, R will return the warning Clusters not nested in strata at top level; you may want nest=TRUE.
  • svydesign() does not accept missing weights. That is why we set missing weights to 0 before specifying the design for the fasting subsample.
  • Some functions (e.g., svycoxph()) return an error if there are any weights of 0. That is why we conditioned on sampling weight > 0 in the code above. See also Section 8.7.3.2.
  • lm(), glm(), and coxph() all have a weights argument that can be used to specify a weighted regression model. However, passing sampling weights to that argument will not produce the correct results for a complex survey design since weights in the context of those functions refers to a different kind of weight. Instead, use the methods described in this Chapter, using functions from the survey library, to correctly incorporate the survey design into your analysis.

8.2.1 Design degrees of freedom

Estimated confidence intervals and p-values depend on the degrees of freedom. Both NHANES and NSDUH documentation recommended using the design degrees of freedom (DF), the number of PSUs less the number of strata, to obtain correct measures of variation (NHANES Tutorial: Variance Estimation, accessed February 7, 2023; 2019 NSDUH Public Use File Codebook, page H-2, accessed February 7, 2023).

Where does this quantity come from? “Degrees of freedom” are related to the number of independent pieces of information in a sample. For a simple random sample, the DF is the sample size minus the number of population quantities being estimated. For example, the sample mean has DF equal to the sample size minus 1. However, in many complex survey designs, there are fewer independent pieces of information. Where does the “design DF” come from? Here is one way to think about it. After splitting the population up into strata, NHANES and NSDUH sampled PSUs within strata, that is, they conducted a simple random sample of PSUs within each stratum. Thus, the sample mean of a PSU-level variable within a single stratum has DF = (number of PSUs within the stratum) - 1. Summing that up over strata results in DF = number of PSUs less the number of strata.

For the NHANES 2017-2018 data, the design DF are 15 because there are 30 PSUs and 15 strata. For the 2019 NSDUH data, the design DF are 50 because there are 100 PSUs and 50 strata.

Different statistical software may use different default DF. In R, most survey functions do not use the design DF by default but do have an option to change the DF. The code degf(design) computes the design DF for a given design. In a subgroup analysis (see Section 8.5) the correct design DF is based on the numbers of PSUs and strata in the subsample representing the subgroup of interest, either of which might be smaller than in the full sample. Fortunately, degf() modifies the DF appropriately for an analysis on a subgroup with fewer PSUs and/or strata. See the “Details” section in ?summary.svyglm for more information.