Chapter 13 9. Survey specific functions
As James showed you yesterday if we are using survey data that comes with weights, we need to account for those.
The ‘survey’ package has adapted functions for the different things that we did today. Let’s first create the weights for the data at hand, specify the sample design and store it in an R object that we can then call in our tests.
# add weighting variable
nhanes <- nhanes %>% dplyr::mutate(WTMEC4YR = 0.5 * WTMEC2YR)
# Create weighted R data object that holds the information about the sampling specifications
nhanes_design <- survey::svydesign(data = nhanes,
strata = ~SDMVSTRA,
id = ~SDMVPSU,
nest = TRUE,
weights = ~WTMEC4YR)
Now we can seemlessly work with the weighted data. Let’s create a frequency table that accounts for sample weights.
table_gender_smoke_w <- survey::svytable(~Gender + SmokeNow, design = nhanes_design) # contains estimated frequencies based on survey weights
table_gender_smoke_w
## SmokeNow
## Gender No Yes
## female 24074780 19887290
## male 29578392 24544477
For reference, there is some sample code for the other main analyses using the survey package.
13.0.0.1 Survey-weighted histogram
# BMI
ggplot(data = nhanes,
mapping = aes(BMI, weight = WTMEC4YR)) +
geom_histogram(binwidth = 2,
color = "white") +
labs(x = "BMI")
## Warning: Removed 2279 rows containing non-finite values (stat_bin).
13.0.0.2 Survey-weighted bar plot
Bar plot of SleepHrsNight by gender
# Compute the survey-weighted mean by Gender
Sleep_mean_gender <- survey::svyby(formula = ~SleepHrsNight,
by = ~Gender,
design = nhanes_design,
FUN = svymean,
na.rm = TRUE,
keep.names = FALSE)
# Construct a bar plot of average SleepHrsNight by gender
ggplot(data = Sleep_mean_gender, mapping = aes(x = Gender, y = SleepHrsNight)) +
geom_col() +
labs("Average Hours of Sleep")
13.0.0.3 Survey-weighted Chi2 test
Next, let’s conduct a Chi2 test with survey weights
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: survey::svychisq(~Gender + SmokeNow, design = nhanes_design, statistic = "Chisq")
## X-squared = 0.025482, df = 1, p-value = 0.9573
13.0.0.4 Survey-weighted t-test
# do mean sleeping hours differ between men and women
svyttest(formula = SleepHrsNight~Gender,
design = nhanes_design)
## Warning in summary.glm(g): observations with zero weight not used
## for calculating dispersion
## Warning in summary.glm(glm.object): observations with zero weight
## not used for calculating dispersion
##
## Design-based t-test
##
## data: SleepHrsNight ~ Gender
## t = -3.4077, df = 32, p-value = 0.001785
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
## -0.15287218 -0.04123256
## sample estimates:
## difference in mean
## -0.09705237
# Let's get the mean values printed out (created in step prior to t-test) to aid interpretation of the t-test
Sleep_mean_gender
## Gender SleepHrsNight se
## 1 female 6.976103 0.02374684
## 2 male 6.879050 0.01953263
13.0.0.5 Survey-weighted scatter plot
Let’s look at the relationship between height and weight in 8 year old children. You can set the transparency parameter alpha to be the sample weight (WTMEC4YR). Observations with a bigger weight will show as darker points.
ggplot(data = nhanes %>% filter(Age == 8),
mapping = aes(x = Height, y = Weight, alpha = WTMEC4YR)) +
geom_point() +
guides(alpha = FALSE)
## Warning: Removed 20 rows containing missing values (geom_point).
13.0.0.6 Survey-weighted regression
Let’s model Weight based on Height for 8 year olds
model <- svyglm(Weight ~ Height, data = nhanes %>% filter(Age == 8),
design = nhanes_design)
summary(model)
##
## Call:
## svyglm(formula = Weight ~ Height, design = nhanes_design, data = nhanes %>%
## filter(Age == 8))
##
## Survey design:
## survey::svydesign(data = nhanes, strata = ~SDMVSTRA, id = ~SDMVPSU,
## nest = TRUE, weights = ~WTMEC4YR)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -92.1619 1.5032 -61.31 <2e-16 ***
## Height 1.0181 0.0101 100.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 362.9893)
##
## Number of Fisher Scoring iterations: 2