Chapter 4 Loading Survey Data

4.1 Analysing a survey in R

Further Reading: R for Data Science, Hadley.

4.2 Loading the data

As an example dataset we’ll use the CDC National Health & Nutrition Examination Survey. It’s American, but it’s easier to access than the Health Survey for England.

In RStudio create a new project, start a new script, and create a data/ folder.

Download the demographic data file and the Body Measures data file to your data folder.

We’ll load some libraries and the demographic data:

library(tidyverse)
library(haven)
library(janitor)

# Load demographic data
nhanes <- read_xpt("data/DEMO_J.XPT")

And look at the first few rows:

slice_head(nhanes, n=10) %>% 
  View()
slice_head(nhanes, n=10) %>% 
  DT::datatable()

We need the data dictionary to make sense of this.

4.3 Cleaning the data

Cleaning data is long, and repetitive.

  1. best practise: clean it once, share the clean data. Example
  2. good-enough practise: keep the columns you’re interested in, clean those.

For 1-off analysis (2) is fair and proportionate. For weekly/monthly stats (1) is better - talk to Data Science team about RAP.

4.4 Exploring the data

We’ve already explored the data a little with View. This is perfectly valid.

Hypothetical scenario - a stakeholder wants to know if targeting weight management services at demographics with lower education levels might improve health inequalities.

Education level is in the demographics table, BMI is in the examination table. We want education & participant ID from demographics, to join it with BMI & participant ID from examinations.

(Adult) education level is held in column DMDEDUC2.

# recode Adult education

nhanes <- nhanes %>% 
  mutate(Education = case_when(
    DMDEDUC2 == 1 ~ "Less than 9th grade",
    DMDEDUC2 == 2 ~ "9-11th grade (Includes 12th grade with no diploma)",
    DMDEDUC2 == 3 ~ "High school graduate/GED or equivalent",
    DMDEDUC2 == 4 ~ "Some college or AA degree",
    DMDEDUC2 == 5 ~ "College graduate or above",
    DMDEDUC2 == 7 ~ "Refused",
    DMDEDUC2 == 9 ~ "Don't Know"
  )) %>% 
  select(ID = SEQN, Education)

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable()

data dictionary for examination dataset

# Load examination data

exam <- read_xpt("data/BMX_J.XPT") %>% 
  select(ID = SEQN, BMI = BMXBMI)

exam %>% 
  slice_head(n = 10) %>% 
  DT::datatable()

Joining them on ID:

Refresher on joins

nhanes <- left_join(nhanes, exam, by="ID")

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable(nhanes)

Keeping people with Education level recorded, & valid BMI:

nhanes %>% 
  filter(!is.na(Education), !is.na(BMI)) %>% 
  select(-ID) %>% 
  group_by(Education) %>% 
  summarise(average_BMI = mean(BMI)) %>% 
  knitr::kable()
Education average_BMI
9-11th grade (Includes 12th grade with no diploma) 29.27825
College graduate or above 28.50249
Don’t Know 31.11250
High school graduate/GED or equivalent 30.13217
Less than 9th grade 29.93982
Refused 30.20000
Some college or AA degree 30.82326

Refresh on filter

Refresh on select

Refresh on grouping & summarising

No obvious relationship there, but I didn’t apply the survey weighting.

4.5 Applying survey weighting for exploratory stats

In reality someone has tidied the NHANES data for R, so I’ll load that.

rm(exam, nhanes) # We're not using these data any more, we can remove them from memory.

nhanes <- NHANES::NHANESraw

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable()

The survey weighting is WTMEC2YR, and we can summarise with weighted.mean.:

nhanes %>% 
  filter(!is.na(Education), !is.na(BMI)) %>% 
  group_by(Education) %>% 
  summarise(average_BMI = weighted.mean(BMI, WTMEC2YR)) %>% 
  knitr::kable()
Education average_BMI
8th Grade 29.22906
9 - 11th Grade 29.20260
High School 29.40650
Some College 29.17616
College Grad 27.50059

The manual page for weighted.mean can be viewed with ?weighted.mean or F1 when the cursor is inside weighted.mean.

It looks like there’s a distinction between college grads and non-college grads.