14.6 Case Study: The NHANES data

In Sects. 12.10 and 13.7, the NHANES data were introduced (Centers for Disease Control and Prevention (CDC) 1988--1994; Center for Disease Control and Prevention 1996; Pruim 2015), and graphs and numerical summaries used to understand the data relevant to answering this RQ:

Among Americans, is the mean direct HDL cholesterol different for those who smoke now, and those who do not smoke now?

The data can be summarised numerically: the response variable (HDL cholesterol), the explanatory variable (current smoking status), and potential extraneous and confounding variables. Different summaries are needed for quantitative (means and standard deviations; medians and IQR) and qualitative (percentages; odds) variables (Table 14.8).

TABLE 14.8: A summary of some variables in the NHANES data set, according to current smoking status (current smoking status was not reported for 6789 respondents, and some other variables were not reported for all respondents). Quantitative variables are summarised using either the mean and standard deviation, or median and IQR; qualitative variables using percentages. There are many missing values.
Quantity	Statistic	Overall	Non-smokers	Smokers
Sample size		10000	1745	1466
Direct HDL (mmol/L)	n	8474	1668	1388
	Mean	1.36	1.39	1.31
	Std. dev.	0.4	0.43	0.42
Gender	n	10000	1745	1466
	% Female	50.2	43.8	43.5
Age (years)	n	10000	1745	1466
	Mean	36.74	54.28	42.68
	Std. dev.	22.4	16.64	14.79
Height (cm)	n	9647	1726	1459
	Mean	161.88	170.06	170.43
	Std. dev.	20.19	9.75	9.27
Weight (kg)	n	9922	1727	1458
	Mean	70.98	84.5	80.54
	Std. dev.	29.13	20.73	19.72
BMI (kg/m-sq)	n	9634	1726	1458
	Mean	26.66	29.09	27.7
	Std. dev.	7.38	6.19	6.42
Diabetes	n	9858	1743	1466
	% Yes	7.7	15.3	7.2
Urine volume (mL)	n	9013	1723	1447
	Median	94	97	102
	IQR	114	118.5	104

A number of interesting questions emerge from Table 14.8:

How can the mean age of all respondents be 36.7 years, but the mean age for non-smokers and smokers both be much larger than this (54.3 and 42.7 years respectively)?
Similarly, the percentage of females in the whole sample is 50.2%, but the percentage of females is less than this for both non-smokers and smokers (43.8% and 43.5% respectively)?

Table 14.9 summarises the relationship between current smoking status and having a diabetes diagnosis. Again, questions emerge:

For current non-smokers, the percentage of diabetics is \(15.32\)%.
For current smokers, the percentage of diabetics is \(7.23\)%.

The percentage of diabetics in the sample is different for non-smokers and smokers. Why? Similarly,

For current non-smokers, the odds of finding a diabetic is \(0.181\).
For current smokers, the odds of finding a diabetic is \(0.078\).

The odds of finding a diabetic in the sample is different for non-smokers and smokers. Why?

As noted before (Sect. 14.4), two possible reasons could explain this difference in percentages and odds in the sample:

Sampling variation: The percentages (and odds) are the same in the population, but difference in the sample occur because of the people that happened to end up with the sample. Sampling variation explains the difference in the sample percentages (and odds).
The percentages (and odds) are different in population: for non-smokers and smokers, and the difference in the sample percentages (and odds) simply reflects a difference* between non-smokers and smokers in the population.

In the next chapters, tools for deciding which of these explanations is the most likely are discussed.

TABLE 14.9: The two-way table of diabetes diagnosis against current smoking status
	Doesn’t smoke now	Smokes now
Not diabetic	1476	1360
Diabetic	267	106

References

Center for Disease Control and Prevention. National Center for Health Statistics. Third National Health and Nutrition Examination Survey, 1988–1994, NHANES III Laboratory Data File [Internet]. Hyattsville, MD: Public Use Data File Documentation Number 76200; U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; 1996. Available from: https://wwwn.cdc.gov/nchs/data/nhanes3/1a/readme.txt.

Centers for Disease Control and Prevention (CDC). National Center for Health Statistics. National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; 1988--1994.

Pruim R. NHANES: Data from the US National Health and Nutrition Examination Study [Internet]. 2015. Available from: https://CRAN.R-project.org/package=NHANES.