3.1 Examining the data

Before fitting a regression model, it is a good idea to examine each of the variables individually. One reason to do this is to look for anomalous values that may require you to look more closely at your data source. Additionally, a common element in the presentation of regression analysis results is a table containing descriptive statistics for each analysis variable.

How a variable is examined depends on whether it is continuous or categorical (as defined in Chapter 2).

  • For continuous variables, create a numerical summary and plot a histogram using summary() and hist().
  • For categorical variables, create a frequency table and plot a bar chart using table() and barplot().

NOTE: Throughout this text, we assume that categorical variables are coded in R as factor variables (for more information, see “Factors” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).

Example 3.1: Using data from a random subset of 1,000 adults from the NHANES 2017-2018 examination teaching dataset (see Appendix A.1), summarize the continuous variables systolic blood pressure (sbp) and age (RIDAGEYR) and the categorical variables gender (RIAGENDR) and annual household income category (income).

First, load the NHANES examination teaching dataset using load().

load("Data/nhanes1718_adult_exam_sub_rmph.Rdata")
# For convenience, give the dataset a shorter name
nhanes <- nhanes_adult_exam_sub

Next, use summary() to look at some basic descriptive statistics for the continuous variables. None of these values seem out of the ordinary, although note that there are 42 missing values for SBP.

# Continuous variables
summary(nhanes$sbp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      83     111     121     124     134     234      42
summary(nhanes$RIDAGEYR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    32.0    47.0    47.7    61.0    80.0

Use hist() to create a visualization of the entire distribution for each variable. As shown in Figure 3.1, SBP is a bit skewed to the right, which is typical for many health-related measures, and there are more younger than older individuals. There is a spike at the upper end of the age distribution because NHANES, for privacy reasons, reports ages \(\ge\) 80 years as exactly 80 years (see the NHANES documentation for RIDAGEYR, accessed May 20, 2022).

par(mfrow=c(1,2))
hist(nhanes$sbp,
     xlab = "", main = "Systolic Blood Pressure (mmHg)")
hist(nhanes$RIDAGEYR,
     xlab = "", main = "Age (years)")
Histograms of continuous variables

Figure 3.1: Histograms of continuous variables

Computations of common descriptive statistics are demonstrated below. The option na.rm = T is needed if there are any missing values, otherwise the functions will return NA (indicating “missing” or “unknown”).

# Mean
mean(nhanes$sbp, na.rm = T)
## [1] 123.5
# Standard deviation
sd(nhanes$sbp, na.rm = T)
## [1] 17.65
# Median
median(nhanes$sbp, na.rm = T)
## [1] 121
# Interquartile range
IQR(nhanes$sbp, na.rm = T)
## [1] 23
# 25th and 75th percentile
# (sometimes also referred to as the IQR)
quantile(nhanes$sbp,
         probs = c(0.25, 0.75),
         na.rm = T)
## 25% 75% 
## 111 134
# Minimum
min(nhanes$sbp, na.rm = T)
## [1] 83
# Maximum
max(nhanes$sbp, na.rm = T)
## [1] 234
# Number of missing values
sum(is.na(nhanes$sbp))
## [1] 42
# Number of non-missing values
sum(!is.na(nhanes$sbp))
## [1] 958

For categorical variables, use table() and prop.table() to examine the frequency and proportion of observations at each level (each possible value of the variable). The exclude = NULL option tells R to include the number of missing values in the frequency table. In prop.table(), exclude = NULL is omitted here, resulting in proportions of non-missing cases.

# Categorical variables
table(nhanes$income, exclude = NULL)
## 
##            < $25,000 $25,000 to < $55,000             $55,000+                 <NA> 
##                  156                  254                  480                  110
prop.table(table(nhanes$income))
## 
##            < $25,000 $25,000 to < $55,000             $55,000+ 
##               0.1753               0.2854               0.5393
table(nhanes$RIAGENDR, exclude = NULL)
## 
##   Male Female 
##    482    518
prop.table(table(nhanes$RIAGENDR))
## 
##   Male Female 
##  0.482  0.518

The upper income group is most common and there are 110 individuals with missing income values. Missing income values are common in survey data as some individuals are reluctant to disclose their income, even when the response options are ranges of values. Also, NHANES has gender response options “Male” and “Female” and, in this subset of the data, there are more females than males.

Options for visualizing the distribution of categorical variables include vertical and horizontal barcharts created with barplot(), as shown in Figure 3.2.

par(mfrow=c(1,2))
# barplot() expects frequencies, not the raw data, so use table inside barplot()
barplot(table(nhanes$RIAGENDR),
        ylab = "Frequency", xlab = "Gender")
barplot(table(nhanes$income), horiz=T, cex.names = 0.65, 
        xlab = "Frequency", ylab = "Annual Household Income")
Barcharts of categorical variable frequencies

Figure 3.2: Barcharts of categorical variable frequencies

Substituting prop.table(table()) for table() results in plotting proportions instead of frequencies, as shown in Figure 3.3.

barplot(prop.table(table(nhanes$income)),
        ylab = "Proportion", xlab = "Annual Household Income")
Barchart of categorical variable proportions

Figure 3.3: Barchart of categorical variable proportions

3.1.1 Detailed description of all variables in a dataset

summary() can be used on multiple variables all at once.

summary(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")])
##       sbp         RIDAGEYR      RIAGENDR                    income   
##  Min.   : 83   Min.   :20.0   Male  :482   < $25,000           :156  
##  1st Qu.:111   1st Qu.:32.0   Female:518   $25,000 to < $55,000:254  
##  Median :121   Median :47.0                $55,000+            :480  
##  Mean   :124   Mean   :47.7                NA's                :110  
##  3rd Qu.:134   3rd Qu.:61.0                                          
##  Max.   :234   Max.   :80.0                                          
##  NA's   :42

The describe() function in the Hmisc library (Harrell 2023) also summarizes multiple variables at once and, additionally, provides more detail than summary().

# Access the describe function without loading
# the entire Hmisc library using the :: syntax
nhanes %>% 
  select(sbp, RIDAGEYR, RIAGENDR, income) %>% 
  Hmisc::describe()
# (results not shown)

If desired, use write() to export the results to an external file.

write(Hmisc::html(Hmisc::describe(nhanes)), "nhanes_description.html")

References

———. 2023. Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/.
Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol, CA: O’Reilly Media.