1.20 Datasets used in this book

Each dataset used is described below, along with links to the original data sources. The actual datasets used in this text may in some cases be modified from the original to illustrate various R programming points and, as such, are not suitable for research purposes. The actual datasets used in this text are NOT being made publicly available. If you are using this material as part of a Wright State University course or tutorial, datasets will be provided to you by the instructor or in your learning management system.

1.20.1 Arthritis Treatment

“Rheumatoid arthritis (RA) patients in two age ranges who were receiving care at a clinic in Philadelphia are included. Variables include age and sex, several indicators of disease activity and whether or not patients were administered selected common treatments for RA.” (Edward Gracely, “Arthritis Treatment Dataset”, TSHS Resources Portal (2020). Available at Arthritis Treatment.)

Sample size: 530

Documentation & Codebook:

1.20.2 2013-2014 National Health and Nutrition Examination Survey (NHANES)

“The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.” (About the National Health and Nutrition Examination Survey, accessed 7/9/2020).

What data were collected? Demographics, information about chronic conditions, risk factors, lifestyle and behavioral factors. More details, as well as a description of the 2013-2014 survey cycle target population, objectives, and data collection procedures can be found at NHANES 2013-2014 Overview.

A nice table showing how the variables collected changed over time can be found at the National Health and Nutrition Examination Survey1999–2016 Survey Content Brochure.

Sample size: “In 2013-2014, 14,332 persons were selected for NHANES from 30 different survey locations. Of those selected, 10,175 completed the interview and 9,813 were examined.” (NHANES 2013-2014 Overview) Thus, for examination variables, the sample size will be smaller. income is a recoding of the NHANES variable INDHHIN2 and smoke is a recoding of multiple smoking status variables. SEQN is the unique respondent identifier.

Documentation & Codebook: Go to NHANES 2013-2014 Data, Documentation, Codebooks for links to documentation for all variables. If you are not sure where to look for a specific variable, you can search online and usually the link you need will appear near the top of your search. For example, the variable INDHHIN2 (Annual Household Income) has values from 1 to 15, as well as 77 and 99, but no labels to tell you what these mean (and it turns out they are not in order from lowest to highest!). See Demographic Variables for documentation.

1.20.3 2017 Youth Risk Behavior Surveillance System (YRBSS)

“The Youth Risk Behavior Surveillance System (YRBSS) was developed in 1990 to monitor health behaviors that contribute markedly to the leading causes of death, disability, and social problems among youth and adults in the United States. These behaviors, often established during childhood and early adolescence, include: Behaviors that contribute to unintentional injuries and violence; Sexual behaviors related to unintended pregnancy and sexually transmitted infections, including HIV infection; Alcohol and other drug use; Tobacco use; Unhealthy dietary behaviors; Inadequate physical activity.

“In addition, the YRBSS monitors the prevalence of obesity and asthma and other health-related behaviors plus sexual identity and sex of sexual contacts.

“From 1991 through 2017, the YRBSS has collected data from more than 4.4 million high school students in more than 1,900 separate surveys.” (Youth Risk Behavior Surveillance System (YRBSS) Overview, accessed 8/24/2018)

The target population for the YRBSS is all public, Catholic, and other private school students in grades 9 through 12 in the United States.

Sample size: 203,663

Documentation & Codebook: See the YRBS Data and Documentation page, in particular the 2017 YRBS Data User’s Guide. This is the national-level data, so the “sitecode” variable is always “United States”. There are state-level datasets available at the YRBSS website, too.

1.20.4 Caveat regarding survey weights

Surveys (e.g., NHANES, YRBSS) are often conducted using methods more complex than simple random sampling and therefore require, for many types of analyses, the incorporation of survey weights. This is beyond the scope of this book. However, see “Analyzing Complex Survey Data” in “Introduction to Regression Methods for Public Health Using R” for more information.

For your future work, refer to the documentation for each dataset to determine if survey weights are available and how to use them. Since we are not using them here, any computations based on these datasets in this text should not be considered representative of the target population of that survey.