Chapter 4 Loading Survey Data

4.1 Analysing a survey in R

Further Reading: R for Data Science, Hadley.

4.2 Loading the data

As an example dataset we’ll use the CDC National Health & Nutrition Examination Survey. It’s American, but it’s easier to access than the Health Survey for England.

In RStudio create a new project, start a new script, and create a data/ folder.

Download the demographic data file and the Body Measures data file to your data folder.

We’ll load some libraries and the demographic data:

library(tidyverse)
library(haven)
library(janitor)

# Load demographic data
nhanes <- read_xpt("data/DEMO_J.XPT")

And look at the first few rows:

slice_head(nhanes, n=10) %>% 
  View()

slice_head(nhanes, n=10) %>% 
  DT::datatable()

Show entries

Search:

	SEQN	SDDSRVYR	RIDSTATR	RIAGENDR	RIDAGEYR	RIDAGEMN	RIDRETH1	RIDRETH3	RIDEXMON	RIDEXAGM	DMQMILIZ	DMDBORN4	DMDCITZN	DMDYRSUS	DMDEDUC3	DMDEDUC2	DMDMARTL	SIALANG	SIAPROXY	SIAINTRP	FIALANG	FIAPROXY	FIAINTRP	MIALANG	MIAPROXY	MIAINTRP	AIALANGA	DMDHHSIZ	DMDFMSIZ	DMDHHSZA	DMDHHSZB	DMDHHSZE	DMDHRGND	DMDHRAGZ	DMDHREDZ	DMDHRMAZ	DMDHSEDZ	WTINT2YR	WTMEC2YR	SDMVPSU	SDMVSTRA	INDHHIN2	INDFMIN2	INDFMPIR
1	93703	10	2	2	2		5	6	2	27		1	1					1	1	2	1	2	2					5	5	3	0	0	1	2	3	1	3	9246.49186482052	8539.7313482887	2	145	15	15	5
2	93704	10	2	1	2		3	3	1	33		1	1					1	1	2	1	2	2					4	4	2	0	0	1	2	3	1	2	37338.7683431381	42566.6147498209	1	143	15	15	5
3	93705	10	2	2	66		4	4	2		2	1	1			2	3	1	2	2	1	2	2	1	2	2	1	1	1	0	0	1	2	4	1	2		8614.57117241211	8338.41978618326	2	145	3	3	0.82
4	93706	10	2	1	18		5	6	2	222	2	1	1		15			1	2	2				1	2	2	1	5	5	0	0	1	1	4	3	1	2	8548.63261927425	8723.43981388878	2	134
5	93707	10	2	1	13		5	7	2	158		1	1		6			1	1	2	1	2	2	1	2	2	1	7	7	0	3	0	1	3	2	1	3	6769.34456666972	7064.60972999734	1	138	10	10	1.88
6	93708	10	2	2	66		5	6	2		2	2	1	7		1	1	1	2	1	1	2	2	1	2	1	3	2	2	0	0	2	1	4	1	1	1	13329.4505888369	14372.4887646224	2	138	6	6	1.63
7	93709	10	2	2	75		4	4	1		2	1	1			4	2	1	2	2	1	2	2					1	1	0	0	1	2	4	2	2		12043.3882714762	12277.5566617585	1	136	2	2	0.41
8	93710	10	2	2	0	11	3	3	2	13		1	1					1	1	2	1	2	2					3	3	1	0	0	1	2	3	1	3	16418.2984160994	16848.0201168608	1	134	15	15	4.9
9	93711	10	2	1	56		5	6	2		2	2	1	6		5	1	1	2	2	1	2	2	1	2	2	1	3	3	0	0	0	1	3	3	1	3	11178.2601064187	12390.9197244453	2	134	15	15	5
10	93712	10	2	1	18		1	1	2	227	2	2	2	5	12			1	2	2	2	2	2	1	2	2	1	4	4	0	2	0	2	3	1	2		29040.4965580623	30336.6543246729	2	147	4	4	0.76

Showing 1 to 10 of 10 entries

Previous1Next

We need the data dictionary to make sense of this.

4.3 Cleaning the data

Cleaning data is long, and repetitive.

best practise: clean it once, share the clean data. Example
good-enough practise: keep the columns you’re interested in, clean those.

For 1-off analysis (2) is fair and proportionate. For weekly/monthly stats (1) is better - talk to Data Science team about RAP.

4.4 Exploring the data

We’ve already explored the data a little with View. This is perfectly valid.

Hypothetical scenario - a stakeholder wants to know if targeting weight management services at demographics with lower education levels might improve health inequalities.

Education level is in the demographics table, BMI is in the examination table. We want education & participant ID from demographics, to join it with BMI & participant ID from examinations.

(Adult) education level is held in column DMDEDUC2.

# recode Adult education

nhanes <- nhanes %>% 
  mutate(Education = case_when(
    DMDEDUC2 == 1 ~ "Less than 9th grade",
    DMDEDUC2 == 2 ~ "9-11th grade (Includes 12th grade with no diploma)",
    DMDEDUC2 == 3 ~ "High school graduate/GED or equivalent",
    DMDEDUC2 == 4 ~ "Some college or AA degree",
    DMDEDUC2 == 5 ~ "College graduate or above",
    DMDEDUC2 == 7 ~ "Refused",
    DMDEDUC2 == 9 ~ "Don't Know"
  )) %>% 
  select(ID = SEQN, Education)

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable()

Show entries

Search:

	ID	Education
1	93703
2	93704
3	93705	9-11th grade (Includes 12th grade with no diploma)
4	93706
5	93707
6	93708	Less than 9th grade
7	93709	Some college or AA degree
8	93710
9	93711	College graduate or above
10	93712

Showing 1 to 10 of 10 entries

Previous1Next

data dictionary for examination dataset

# Load examination data

exam <- read_xpt("data/BMX_J.XPT") %>% 
  select(ID = SEQN, BMI = BMXBMI)

exam %>% 
  slice_head(n = 10) %>% 
  DT::datatable()

	ID	BMI
1	93703	17.5
2	93704	15.7
3	93705	31.7
4	93706	21.5
5	93707	18.1
6	93708	23.7
7	93709	38.9
8	93710
9	93711	21.3
10	93712	19.7

Joining them on ID:

Refresher on joins

nhanes <- left_join(nhanes, exam, by="ID")

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable(nhanes)

Show entries

Search:

	ID	Education	BMI
1	93703		17.5
2	93704		15.7
3	93705	9-11th grade (Includes 12th grade with no diploma)	31.7
4	93706		21.5
5	93707		18.1
6	93708	Less than 9th grade	23.7
7	93709	Some college or AA degree	38.9
8	93710
9	93711	College graduate or above	21.3
10	93712		19.7

Showing 1 to 10 of 10 entries

Previous1Next

Keeping people with Education level recorded, & valid BMI:

nhanes %>% 
  filter(!is.na(Education), !is.na(BMI)) %>% 
  select(-ID) %>% 
  group_by(Education) %>% 
  summarise(average_BMI = mean(BMI)) %>% 
  knitr::kable()

Education	average_BMI
9-11th grade (Includes 12th grade with no diploma)	29.27825
College graduate or above	28.50249
Don’t Know	31.11250
High school graduate/GED or equivalent	30.13217
Less than 9th grade	29.93982
Refused	30.20000
Some college or AA degree	30.82326

Refresh on filter

Refresh on select

Refresh on grouping & summarising

No obvious relationship there, but I didn’t apply the survey weighting.

4.5 Applying survey weighting for exploratory stats

In reality someone has tidied the NHANES data for R, so I’ll load that.

rm(exam, nhanes) # We're not using these data any more, we can remove them from memory.

nhanes <- NHANES::NHANESraw

nhanes %>% 
  slice_head(n=10) %>% 
  DT::datatable()

Show entries

Search:

	ID	SurveyYr	Gender	Age	AgeMonths	Race1	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Length	Height	BMI	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	DirectChol	TotChol	UrineVol1	UrineFlow1	UrineVol2	UrineFlow2	Diabetes	DiabetesAge	HealthGen	DaysPhysHlthBad	DaysMentHlthBad	LittleInterest	Depressed	nPregnancies	nBabies	Age1stBaby	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDayChild	CompHrsDayChild	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	AgeRegMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	WTINT2YR	WTMEC2YR	SDMVPSU	SDMVSTRA
1	51624	2009_10	male	34	409	White	High School	Married	25000-34999	30000	1.36	6	Own	NotWorking	87.4		164.7	32.22	30.0_plus	70	113	85	114	88	114	88	112	82	1.29	3.49	352				No		Good	0	15	Most	Several				4	Yes	No				Yes		0	No	Yes	18	Yes	17	No		Yes	Yes	16	8	1	No	Heterosexual	80100.54351	81528.77201	1	83
2	51625	2009_10	male	4	49	Other			20000-24999	22500	1.07	9	Own		17		105.4	15.3	12.0_18.5																No														4	1																		53901.10429	56995.03543	2	79
3	51626	2009_10	male	16	202	Black			45000-54999	50000	2.27	5	Own	NotWorking	72.3		181.3	22	18.5_to_24.9	68	109	59	112	62	114	60	104	58	1.55	4.97	281	0.415			No		Vgood	2	0						8	No	Yes	5																				13953.07834	14509.27886	1	84
4	51627	2009_10	male	10	131	Black			20000-24999	22500	0.81	6	Rent		39.8		147.8	18.22	12.0_18.5	68	93	41	92	36	94	44	92	38	1.89	4.16	139	1.078			No														1	1																		11664.8994	12041.63537	2	86
5	51628	2009_10	female	60	722	Black	High School	Widowed	10000-14999	12500	0.69	6	Rent	NotWorking	116.8		166	42.39	30.0_plus	72	150	68	154	70	150	68	150	68	1.16	5.22	30	0.476	246	2.51	Yes	56	Fair	20	25	Most	Most	1	1		4	No	No				No		0	Yes	Yes	16					No	Yes	15	4		No		20090.33926	21000.33872	2	75
6	51629	2009_10	male	26	313	Mexican	9 - 11th Grade	Married	25000-34999	30000	1.01	4	Rent	Working	97.6		173	32.61	30.0_plus	72	104	49	102	50	104	48	104	50	1.16	4.14	202	0.563			No		Good	2	14	None	Most				4	No	Yes	2			Yes	19	48	No	Yes	15	Yes	10	Yes	12	Yes	Yes	9	10	1	No	Heterosexual	22537.827	22633.58187	1	88
7	51630	2009_10	female	49	596	White	Some College	LivePartner	35000-44999	40000	1.91	5	Rent	NotWorking	86.7		168.4	30.57	30.0_plus	86	112	75	118	82	108	74	116	76	1.16	6.7	77	0.094			No		Good	0	10	Several	Several	2	2	27	8	Yes	No				Yes	2	20	Yes	Yes	38	Yes	18	No		Yes	Yes	12	10	1	Yes	Heterosexual	74212.26999	74112.48684	2	85
8	51631	2009_10	female	1	12	White			35000-44999	40000	1.36	5	Rent		9.4	75.7																			No																																	23306.39774	24776.49196	2	86
9	51632	2009_10	male	10	124	Hispanic			65000-74999	70000	2.68	7	Own		26		140.3	13.21	12.0_18.5	70	108	53	106	60	106	50	110	56	1.58	4.14	39	0.3			No														1	0																		8056.943427	8175.945593	2	88
10	51633	2009_10	male	80		White	Some College	Married	15000-19999	17500	1.27	4	Own	NotWorking	79.1		174.3	26.04	25.0_to_29.9	88	139	43	142	62	140	46	138	40	1.94	4.71	128	1.208			No		Excellent	0	0	None	None				6	No	Yes	4			Yes	1	52	No	Yes	16												11998.4012	12381.11532	1	77

Showing 1 to 10 of 10 entries

Previous1Next

The survey weighting is WTMEC2YR, and we can summarise with weighted.mean.:

nhanes %>% 
  filter(!is.na(Education), !is.na(BMI)) %>% 
  group_by(Education) %>% 
  summarise(average_BMI = weighted.mean(BMI, WTMEC2YR)) %>% 
  knitr::kable()

Education	average_BMI
8th Grade	29.22906
9 - 11th Grade	29.20260
High School	29.40650
Some College	29.17616
College Grad	27.50059

The manual page for weighted.mean can be viewed with ?weighted.mean or F1 when the cursor is inside weighted.mean.

It looks like there’s a distinction between college grads and non-college grads.