Survey Design and Analysis
2023-03-05
Preface
This is a compilation of notes from my study of survey design and analysis. I completed Data Camp courses Survey and Measurement Development in R (Mount, n.d.) and Analyzing Survey Data in R (McConville, n.d.), then moved on to Thomas Lumley’s Complex Surveys: a guide to analysis using R (Lumley 2010). The following resources are also helpful.
- UCLA IDRE seminar (“Survey Data Analysis with r. UCLA: Statistical Consulting Group,” n.d.)
The survey Package
Only simple random sample survey designs can be analyzed with with normal statistical test functions - complex survey designs require special treatment. The survey package (Lumley 2021) handles both simple and complex survey designs. It contains a few data sets that will serve as examples in these notes. There is also a helper package named jtools (Long 2022) that will come in handy for regression.
library(tidyverse)
library(survey)
library(jtools)
data(api, package = "survey")
Common Terms
Complex survey designs scale the n data observations up to the N population values. Each observation has an associated sampling weight. A sampling weight is a probability weight with on or more adjustments for non-response, non-coverage, calibration, and trimming.
For a simple random sample, the sampling weights equal the probability weights, N/n, meaning each observation represents N/n people.
In stratified sampling, the population is initially segmented by interesting factors (e.g. gender, race/ethnicity, or SES), then sampled with simple random sampling. This ensures a specified number of observations for each stratum. The sample is less variable because the stratification process eliminates one source of variability. The sampling weights equal the probability weights again, but now there are separate weights for each stratum, \(N_A / n_A\), \(N_B / n_B\), etc.
Cluster sampling segments a population (usually geographically) into clusters, then randomly selects clusters to survey. Within a cluster, all members may be sampled (single stage design), or a random sample is taken (multi-stage design). Cluster sampling is almost universal in large-scale surveys involving in-person interviews because it dramatically reduces cost. Unfortunately, it also increases variance because observations within a cluster tend to be similar.
Two other terms used throughout the notes are primary sampling unit (PSU) and finite population correction. Primary sampling unit is the initial sampling factor. For cluster sampling, the PSU is the top level cluster; for stratified and simple random samples, it is the elementary unit. Finite population corrections is a multiplier, \(\sqrt{(N-n)/(N-1)}\), used to scale standard errors when the sample size is large relative to the population size.1