Chapter 4 Loading Survey Data
4.1 Analysing a survey in R
Further Reading: R for Data Science, Hadley.
4.2 Loading the data
As an example dataset we’ll use the CDC National Health & Nutrition Examination Survey. It’s American, but it’s easier to access than the Health Survey for England.
In RStudio create a new project, start a new script, and create a data/
folder.
Download the demographic data file and the Body Measures data file to your data folder.
We’ll load some libraries and the demographic data:
library(tidyverse)
library(haven)
library(janitor)
# Load demographic data
nhanes <- read_xpt("data/DEMO_J.XPT")
And look at the first few rows:
We need the data dictionary to make sense of this.
4.3 Cleaning the data
Cleaning data is long, and repetitive.
- best practise: clean it once, share the clean data. Example
- good-enough practise: keep the columns you’re interested in, clean those.
For 1-off analysis (2) is fair and proportionate. For weekly/monthly stats (1) is better - talk to Data Science team about RAP.
4.4 Exploring the data
We’ve already explored the data a little with View
. This is perfectly valid.
Hypothetical scenario - a stakeholder wants to know if targeting weight management services at demographics with lower education levels might improve health inequalities.
Education level is in the demographics table, BMI is in the examination table. We want education & participant ID from demographics, to join it with BMI & participant ID from examinations.
(Adult) education level is held in column DMDEDUC2.
# recode Adult education
nhanes <- nhanes %>%
mutate(Education = case_when(
DMDEDUC2 == 1 ~ "Less than 9th grade",
DMDEDUC2 == 2 ~ "9-11th grade (Includes 12th grade with no diploma)",
DMDEDUC2 == 3 ~ "High school graduate/GED or equivalent",
DMDEDUC2 == 4 ~ "Some college or AA degree",
DMDEDUC2 == 5 ~ "College graduate or above",
DMDEDUC2 == 7 ~ "Refused",
DMDEDUC2 == 9 ~ "Don't Know"
)) %>%
select(ID = SEQN, Education)
nhanes %>%
slice_head(n=10) %>%
DT::datatable()
data dictionary for examination dataset
# Load examination data
exam <- read_xpt("data/BMX_J.XPT") %>%
select(ID = SEQN, BMI = BMXBMI)
exam %>%
slice_head(n = 10) %>%
DT::datatable()
Joining them on ID:
Keeping people with Education level recorded, & valid BMI:
nhanes %>%
filter(!is.na(Education), !is.na(BMI)) %>%
select(-ID) %>%
group_by(Education) %>%
summarise(average_BMI = mean(BMI)) %>%
knitr::kable()
Education | average_BMI |
---|---|
9-11th grade (Includes 12th grade with no diploma) | 29.27825 |
College graduate or above | 28.50249 |
Don’t Know | 31.11250 |
High school graduate/GED or equivalent | 30.13217 |
Less than 9th grade | 29.93982 |
Refused | 30.20000 |
Some college or AA degree | 30.82326 |
Refresh on grouping & summarising
No obvious relationship there, but I didn’t apply the survey weighting.
4.5 Applying survey weighting for exploratory stats
In reality someone has tidied the NHANES data for R, so I’ll load that.
rm(exam, nhanes) # We're not using these data any more, we can remove them from memory.
nhanes <- NHANES::NHANESraw
nhanes %>%
slice_head(n=10) %>%
DT::datatable()
The survey weighting is WTMEC2YR, and we can summarise with weighted.mean
.:
nhanes %>%
filter(!is.na(Education), !is.na(BMI)) %>%
group_by(Education) %>%
summarise(average_BMI = weighted.mean(BMI, WTMEC2YR)) %>%
knitr::kable()
Education | average_BMI |
---|---|
8th Grade | 29.22906 |
9 - 11th Grade | 29.20260 |
High School | 29.40650 |
Some College | 29.17616 |
College Grad | 27.50059 |
The manual page for weighted.mean
can be viewed with ?weighted.mean
or F1 when the cursor is inside weighted.mean
.
It looks like there’s a distinction between college grads and non-college grads.