Preface

This is a compilation of notes from my study of survey design and analysis. I completed Data Camp courses Survey and Measurement Development in R (Mount, n.d.) and Analyzing Survey Data in R (McConville, n.d.), then moved on to Thomas Lumley’s Complex Surveys: a guide to analysis using R (Lumley 2010). The following resources are also helpful.

The survey Package

Only simple random sample survey designs can be analyzed with with normal statistical test functions - complex survey designs require special treatment. The survey package (Lumley 2021) handles both simple and complex survey designs. It contains a few data sets that will serve as examples in these notes. There is also a helper package named jtools (Long 2021) that will come in handy for regression.

library(tidyverse)
library(survey)
library(jtools)

data(api, package = "survey")

Common Terms

Complex survey designs scale the n data observations up to the N population values. Each observation has an associated sampling weight. A sampling weight is a probability weight with on or more adjustments for non-response, non-coverage, calibration, and trimming.

For a simple random sample, the sampling weights equal the probability weights, N/n, meaning each observation represents N/n people.

In stratified sampling, the population is initially segmented by interesting factors (e.g. gender, race/ethnicity, or SES), then sampled with simple random sampling. This ensures a specified number of observations for each stratum. The sample is less variable because the stratification process eliminates one source of variability. The sampling weights equal the probability weights again, but now there are separate weights for each stratum, \(N_A / n_A\), \(N_B / n_B\), etc.

Cluster sampling segments a population (usually geographically) into clusters, then randomly selects clusters to survey. Within a cluster, all members may be sampled (single stage design), or a random sample is taken (multi-stage design). Cluster sampling is almost universal in large-scale surveys involving in-person interviews because it dramatically reduces cost. Unfortunately, it also increases variance because observations within a cluster tend to me similar.

You will often encounter the term primary sampling unit (PSU). It is the factor used for the initial sampling. For cluster sampling, it is the initial cluster; for stratified and simple random samples, it is the elementary unit.

A second term you will encounter is finite population correction. It is a probability term that becomes relevant when the sampling fraction of the population becomes large. It equals \(\sqrt{(N-n)/(N-1)}\).

References

Long, Jacob A. 2021. Jtools: Analysis and Presentation of Social Scientific Data. https://jtools.jacob-long.com.
Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using R. Wiley. http://r-survey.r-forge.r-project.org/svybook/.
———. 2021. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/.
McConville, Kelly. n.d. “Analyzing Survey Data in r.” https://app.datacamp.com/learn/courses/analyzing-survey-data-in-r.
Mount, George. n.d. “Survey and Measurement Development in r.” https://app.datacamp.com/learn/courses/survey-and-measurement-development-in-r.
“Survey Data Analysis with r. UCLA: Statistical Consulting Group.” n.d. https://stats.idre.ucla.edu/r/seminars/survey-data-analysis-with-r/.