14.6 Case Study: The NHANES data

In Sects. 12.10 and 13.7, the NHANES data were introduced (Centers for Disease Control and Prevention (CDC) 1988--1994; Center for Disease Control and Prevention 1996; Pruim 2015), and graphs and numerical summaries used to understand the data relevant to answering this RQ:

Among Americans, is the mean direct HDL cholesterol different for those who smoke now, and those who do not smoke now?

The data can be summarised numerically: the response variable (HDL cholesterol), the explanatory variable (current smoking status), and potential extraneous and confounding variables. Different summaries are needed for quantitative (means and standard deviations; medians and IQR) and qualitative (percentages; odds) variables (Table 14.8).

TABLE 14.8: A summary of some variables in the NHANES data set, according to current smoking status (current smoking status was not reported for 6789 respondents, and some other variables were not reported for all respondents). Quantitative variables are summarised using either the mean and standard deviation, or median and IQR; qualitative variables using percentages. There are many missing values.
Quantity Statistic Overall Non-smokers Smokers
Sample size 10000 1745 1466
Direct HDL (mmol/L) n 8474 1668 1388
Mean 1.36 1.39 1.31
Std. dev. 0.4 0.43 0.42
Gender n 10000 1745 1466
% Female 50.2 43.8 43.5
Age (years) n 10000 1745 1466
Mean 36.74 54.28 42.68
Std. dev. 22.4 16.64 14.79
Height (cm) n 9647 1726 1459
Mean 161.88 170.06 170.43
Std. dev. 20.19 9.75 9.27
Weight (kg) n 9922 1727 1458
Mean 70.98 84.5 80.54
Std. dev. 29.13 20.73 19.72
BMI (kg/m-sq) n 9634 1726 1458
Mean 26.66 29.09 27.7
Std. dev. 7.38 6.19 6.42
Diabetes n 9858 1743 1466
% Yes 7.7 15.3 7.2
Urine volume (mL) n 9013 1723 1447
Median 94 97 102
IQR 114 118.5 104

A number of interesting questions emerge from Table 14.8:

  • How can the mean age of all respondents be 36.7 years, but the mean age for non-smokers and smokers both be much larger than this (54.3 and 42.7 years respectively)?
  • Similarly, the percentage of females in the whole sample is 50.2%, but the percentage of females is less than this for both non-smokers and smokers (43.8% and 43.5% respectively)?

Table 14.9 summarises the relationship between current smoking status and having a diabetes diagnosis. Again, questions emerge:

  • For current non-smokers, the percentage of diabetics is \(15.32\)%.
  • For current smokers, the percentage of diabetics is \(7.23\)%.

The percentage of diabetics in the sample is different for non-smokers and smokers. Why? Similarly,

  • For current non-smokers, the odds of finding a diabetic is \(0.181\).
  • For current smokers, the odds of finding a diabetic is \(0.078\).

The odds of finding a diabetic in the sample is different for non-smokers and smokers. Why?

As noted before (Sect. 14.4), two possible reasons could explain this difference in percentages and odds in the sample:

  • Sampling variation: The percentages (and odds) are the same in the population, but difference in the sample occur because of the people that happened to end up with the sample. Sampling variation explains the difference in the sample percentages (and odds).

  • The percentages (and odds) are different in population: for non-smokers and smokers, and the difference in the sample percentages (and odds) simply reflects a difference* between non-smokers and smokers in the population.

In the next chapters, tools for deciding which of these explanations is the most likely are discussed.

TABLE 14.9: The two-way table of diabetes diagnosis against current smoking status
Doesn’t smoke now Smokes now
Not diabetic 1476 1360
Diabetic 267 106

References

Center for Disease Control and Prevention. National Center for Health Statistics. Third National Health and Nutrition Examination Survey, 1988–1994, NHANES III Laboratory Data File [Internet]. Hyattsville, MD: Public Use Data File Documentation Number 76200; U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; 1996. Available from: https://wwwn.cdc.gov/nchs/data/nhanes3/1a/readme.txt.
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics. National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; 1988--1994.
Pruim R. NHANES: Data from the US National Health and Nutrition Examination Study [Internet]. 2015. Available from: https://CRAN.R-project.org/package=NHANES.