Chapter 3 Data summarization

3.1 Examining the data

Before fitting a regression model, it is a good idea to look at all the variables individually. Based on what you see, you might modify some of the variables. How you decide what to do depends on the type of regression model you are fitting. For example, we might be looking to see if a continuous variable is symmetric vs. skewed, or if a categorical variable has levels that are very sparse. We will discuss what to look for when we discuss each model in future chapters.

  • Continuous variables: Create a numerical summary and plot a histogram using summary() and hist()
  • Categorical variables: Create a frequency table and plot a bar chart using table() and barplot()

For example, using data from a 20% subset of the NHANES 2017-2018 cycle, let’s summarize the data for systolic blood pressure (BPXSY1), age (RIDAGEYR), smoking status (smoker), gender (RIAGENDR), race/Hispanic origin (RIDRETH3), education (adults age 20y+) (DMDEDUC2), and annual household income (income). Notice the use of exclude = NULL in table() which tells R to include the number of missing values in the frequency table.

load("Data/NHANES-2017-2018-sub20.Rdata")
# Continuous variables
summary(nhanes$BPXSY1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      88     112     124     127     136     216 
##    NA's 
##     182
summary(nhanes$RIDAGEYR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    34.0    52.0    50.5    66.0    80.0
hist(nhanes$BPXSY1,   xlab = "", main = "Systolic Blood Pressure (mmHg)")

hist(nhanes$RIDAGEYR, xlab = "", main = "Age (years)")

# Categorical variables
table(nhanes$smoker,   exclude = NULL)
## 
##   Never    Past Current 
##     680     282     209
table(nhanes$RIAGENDR, exclude = NULL)
## 
##   Male Female 
##    578    593
table(nhanes$RIDRETH3, exclude = NULL)
## 
##   Mexican American     Other Hispanic 
##                173                105 
## Non-Hispanic White Non-Hispanic Black 
##                396                257 
## Non-Hispanic Asian        Other/Multi 
##                168                 72
table(nhanes$DMDEDUC2, exclude = NULL)
## 
##             < 9            9-11          HS/GED 
##              98             133             273 
## Some college/AA    College grad            <NA> 
##             352             252              63
table(nhanes$income,   exclude = NULL)
## 
##            < $25,000 $25,000 to < $55,000 
##                  271                  319 
##             $55,000+                 <NA> 
##                  408                  173
# barplot expects the frequencies, not the raw data, so use table inside barplot
barplot(table(nhanes$smoker),   ylab = "Frequency", xlab = "Smoking Status")

barplot(table(nhanes$RIAGENDR), ylab = "Frequency", xlab = "Gender")

barplot(table(nhanes$RIDRETH3), ylab = "Frequency", xlab = "Race/Hispanic Origin")

barplot(table(nhanes$DMDEDUC2), ylab = "Frequency", xlab = "Education Level (Adults 20y+)")

barplot(table(nhanes$income),   ylab = "Frequency", xlab = "Annual Household Income")

For categorical variables, you can also compute the sample proportions (instead of frequencies) using prop.table() and create a barplot with those proportions.

# Proportion of all cases
prop.table(table(nhanes$income, exclude = NULL))
## 
##            < $25,000 $25,000 to < $55,000 
##               0.2314               0.2724 
##             $55,000+                 <NA> 
##               0.3484               0.1477
# Proportion of non-missing cases
prop.table(table(nhanes$income))
## 
##            < $25,000 $25,000 to < $55,000 
##               0.2715               0.3196 
##             $55,000+ 
##               0.4088
barplot(prop.table(table(nhanes$income)), ylab = "Proportion", xlab = "Annual Household Income")

You can also use summary() to summarize multiple variables all at once:

library(tidyverse)
nhanes %>% 
  select(BPXSY1, smoker, RIDAGEYR, RIAGENDR, RIDRETH3, DMDEDUC2, income) %>% 
  summary()
##      BPXSY1        smoker       RIDAGEYR   
##  Min.   : 88   Never  :680   Min.   :18.0  
##  1st Qu.:112   Past   :282   1st Qu.:34.0  
##  Median :124   Current:209   Median :52.0  
##  Mean   :127                 Mean   :50.5  
##  3rd Qu.:136                 3rd Qu.:66.0  
##  Max.   :216                 Max.   :80.0  
##  NA's   :182                               
##    RIAGENDR                 RIDRETH3  
##  Male  :578   Mexican American  :173  
##  Female:593   Other Hispanic    :105  
##               Non-Hispanic White:396  
##               Non-Hispanic Black:257  
##               Non-Hispanic Asian:168  
##               Other/Multi       : 72  
##                                       
##             DMDEDUC2                    income   
##  < 9            : 98   < $25,000           :271  
##  9-11           :133   $25,000 to < $55,000:319  
##  HS/GED         :273   $55,000+            :408  
##  Some college/AA:352   NA's                :173  
##  College grad   :252                             
##  NA's           : 63                             
## 

3.2 Missing data options

If, when examining the data, you find there are any variables with missing data, then you either need to handle the missing data in some way (such as multiple imputation - see Chapter 11) or do a complete case analysis.

3.2.1 Complete case analysis

A complete case analysis is one in which all cases with missing data on any variable are removed. In R, it is easy to specify a complete case analysis within a regression function call. However it is a good idea to remove the cases with missing values from the dataset yourself so that you can create descriptive statistics to go along with your regression analysis that have the same sample size as your regression.

Using the same NHANES data and variables we summarized above, the following code demonstrates how to count the number of cases in a dataset (the sample size) and remove cases with missing data on a set of variables. The first method shown will remove cases but retain all the variables in a dataset. The second method will remove cases but only retain the variables of interest.

Here is a summary of the variables of interest prior to removing any missing data:

# This method removes cases with missing data but keeps all variables in the dataset
SUB <- !is.na(nhanes$BPXSY1)   &
       !is.na(nhanes$smoker)   &
       !is.na(nhanes$RIDAGEYR) &
       !is.na(nhanes$RIAGENDR) &
       !is.na(nhanes$RIDRETH3) &
       !is.na(nhanes$DMDEDUC2) &
       !is.na(nhanes$income) 
complete.dat <- subset(nhanes, subset = SUB)

# This method will remove cases with missing data but also remove other variables
library(tidyverse)
complete.dat <- nhanes %>% 
  select(BPXSY1, smoker, RIDAGEYR, RIAGENDR, RIDRETH3, DMDEDUC2, income) %>% 
  drop_na()

nrow(nhanes)
## [1] 1171
nrow(complete.dat)
## [1] 814
summary(complete.dat)
##      BPXSY1        smoker       RIDAGEYR   
##  Min.   : 88   Never  :442   Min.   :20.0  
##  1st Qu.:114   Past   :219   1st Qu.:37.0  
##  Median :124   Current:153   Median :53.0  
##  Mean   :127                 Mean   :52.0  
##  3rd Qu.:138                 3rd Qu.:66.8  
##  Max.   :216                 Max.   :80.0  
##    RIAGENDR                 RIDRETH3  
##  Male  :409   Mexican American  :118  
##  Female:405   Other Hispanic    : 64  
##               Non-Hispanic White:297  
##               Non-Hispanic Black:174  
##               Non-Hispanic Asian:112  
##               Other/Multi       : 49  
##             DMDEDUC2                    income   
##  < 9            : 58   < $25,000           :211  
##  9-11           : 93   $25,000 to < $55,000:259  
##  HS/GED         :205   $55,000+            :344  
##  Some college/AA:273                             
##  College grad   :185                             
## 

3.2.2 Multiple imputation

Multiple imputation (MI) of missing data is a method of filling in non-missing values for missing values based on the associations between variables. The filling in (imputation) is done randomly and multiple times since we do not want to pretend we know for sure what the missing values are. You then fit your regression model on each complete dataset and then combine the results according to certain mathematical rules.

Suppose your original dataset has 250 cases, of which only 200 have complete data on all the variables you are interested in. A complete case analysis would use just those 200 cases. An MI analysis, however, would use all 250 cases. However, we do not actually have a sample size of 250. The mathematical rules MI uses when combining the results appropriately account for the fact that we have more information than just the 200 complete cases, but less information than a dataset with 250 complete cases.

This topic will be discussed in more detail in Chapter 11.

3.3 Creating a “Table 1”

In most published articles, there is a “Table 1” which contains descriptive statistics for the sample. This may include, for example, the mean and standard deviation for continuous variables, the frequency and proportion for categorical variables, and perhaps also the number of missing values.

The brute force method of creating such a table would be to compute each statistic for each variable of interest and then copy and paste the results into the table. Having done this even once you will wish for an easier method! There are many possible solutions, but one is demonstrated here - create a function that produces the set of summaries you want, run the function on each variable, and bind the results together.

3.3.1 Function for summarizing a continuous variable

Here is a function for summarizing a continuous variable. The function reads in a variable (x) and computes certain descriptive statistics for that variable. There is also an option (SKEW) with a default value of FALSE that allows you to choose to a different set of statistics for the measures of center and spread (median and inter-quartile range (IQR)) if you think the variable is skewed.

summarize_cont <- function(x, LABEL, SKEW = F) {
  # Count the number of missing values
  NMISS <- sum(is.na(x))
  # Min and max
  MIN <- min(x, na.rm = T)
  MAX <- max(x, na.rm = T)
  if (!SKEW) {
    # Symmetric x
    # Mean
    CENTER <- mean(x, na.rm = T)
    # Standard deviation
    SPREAD <- sd(  x, na.rm = T)
    STATS <- "mean/sd"
  } else if (SKEW) {
    # Skewed x
    # Median
    CENTER <- median(x, na.rm = T)
    # Inter-quartile range
    SPREAD <- IQR(   x, na.rm = T)
    STATS <- "median/IQR"
  }
  # Putting it together
  OUT <- data.frame(LABEL, CENTER, SPREAD, MIN, MAX, NMISS, STATS)
  names(OUT) <- c("x", "center", "spread", "min", "max", "nmiss", "stats")
  return(OUT)
}

# Example
summarize_cont(nhanes$BPXSY1, "SBP")
##     x center spread min max nmiss   stats
## 1 SBP  126.8  19.86  88 216   182 mean/sd

3.3.2 Function for summarizing a categorical variable

Here is a function for summarizing a categorical variable.

summarize_cat <- function(x, LABEL) {
  if(!is.factor(x)) {
    return("Error: x must be a factor")
  } else {
  # Count the number of missing values
  NMISS <- sum(is.na(x))
  # Frequency
  N <- table(x)
  # Proportion
  P <- prop.table(N)
  # Pad NMISS to have the right number of rows
  NMISS <- c(NMISS, rep(NA, length(N)-1))
  OUT   <- data.frame(LABEL, names(N), as.numeric(N), as.numeric(P), NMISS)
  names(OUT) <- c("x", "level", "n", "p", "nmiss")
  return(OUT)
  }
}

# Example
summarize_cat(nhanes$income, "Income")
##        x                level   n      p nmiss
## 1 Income            < $25,000 271 0.2715   173
## 2 Income $25,000 to < $55,000 319 0.3196    NA
## 3 Income             $55,000+ 408 0.4088    NA

3.3.3 Overall

Let’s use the functions we just created to create summary statistics for all the variables we have been summarizing in this chapter, for the entire sample (overall).

Which dataset should we use - the dataset that includes missing values or the dataset that has missing values removed? If doing a complete case analysis, one would typically just summarize the dataset with no missing values as that represents the data to be used in your regression analysis. Depending on the extent of the missing data, one may also include information on the numbers of cases with missing values for each variable. If doing an MI analysis, one would typically summarize the dataset with missing values, including the number of missing values. Both ways are demonstrated here.

# Complete case analysis - Use complete.dat, not nhanes

TABLE1_CONT <- rbind(summarize_cont(complete.dat$BPXSY1,   "SBP"),
                     summarize_cont(complete.dat$RIDAGEYR, "Age"))

TABLE1_CAT <- rbind(summarize_cat(complete.dat$smoker,   "Smoker"),
                    summarize_cat(complete.dat$RIAGENDR, "Gender"),
                    summarize_cat(complete.dat$RIDRETH3, "Race"),
                    summarize_cat(complete.dat$DMDEDUC2, "Education"),
                    summarize_cat(complete.dat$income,   "Income"))

TABLE1_CONT
##     x center spread min max nmiss   stats
## 1 SBP 127.35  19.97  88 216     0 mean/sd
## 2 Age  51.99  17.88  20  80     0 mean/sd
TABLE1_CAT
##            x                level   n       p nmiss
## 1     Smoker                Never 442 0.54300     0
## 2     Smoker                 Past 219 0.26904    NA
## 3     Smoker              Current 153 0.18796    NA
## 4     Gender                 Male 409 0.50246     0
## 5     Gender               Female 405 0.49754    NA
## 6       Race     Mexican American 118 0.14496     0
## 7       Race       Other Hispanic  64 0.07862    NA
## 8       Race   Non-Hispanic White 297 0.36486    NA
## 9       Race   Non-Hispanic Black 174 0.21376    NA
## 10      Race   Non-Hispanic Asian 112 0.13759    NA
## 11      Race          Other/Multi  49 0.06020    NA
## 12 Education                  < 9  58 0.07125     0
## 13 Education                 9-11  93 0.11425    NA
## 14 Education               HS/GED 205 0.25184    NA
## 15 Education      Some college/AA 273 0.33538    NA
## 16 Education         College grad 185 0.22727    NA
## 17    Income            < $25,000 211 0.25921     0
## 18    Income $25,000 to < $55,000 259 0.31818    NA
## 19    Income             $55,000+ 344 0.42260    NA
# MI analysis - Use original dataset with all cases

TABLE1_CONT_MI <- rbind(summarize_cont(nhanes$BPXSY1,   "SBP"),
                        summarize_cont(nhanes$RIDAGEYR, "Age"))

TABLE1_CAT_MI <- rbind(summarize_cat(nhanes$smoker,   "Smoker"),
                       summarize_cat(nhanes$RIAGENDR, "Gender"),
                       summarize_cat(nhanes$RIDRETH3, "Race"),
                       summarize_cat(nhanes$DMDEDUC2, "Education"),
                       summarize_cat(nhanes$income,   "Income"))

TABLE1_CONT_MI
##     x center spread min max nmiss   stats
## 1 SBP 126.76  19.86  88 216   182 mean/sd
## 2 Age  50.49  18.94  18  80     0 mean/sd
TABLE1_CAT_MI
##            x                level   n       p nmiss
## 1     Smoker                Never 680 0.58070     0
## 2     Smoker                 Past 282 0.24082    NA
## 3     Smoker              Current 209 0.17848    NA
## 4     Gender                 Male 578 0.49360     0
## 5     Gender               Female 593 0.50640    NA
## 6       Race     Mexican American 173 0.14774     0
## 7       Race       Other Hispanic 105 0.08967    NA
## 8       Race   Non-Hispanic White 396 0.33817    NA
## 9       Race   Non-Hispanic Black 257 0.21947    NA
## 10      Race   Non-Hispanic Asian 168 0.14347    NA
## 11      Race          Other/Multi  72 0.06149    NA
## 12 Education                  < 9  98 0.08845    63
## 13 Education                 9-11 133 0.12004    NA
## 14 Education               HS/GED 273 0.24639    NA
## 15 Education      Some college/AA 352 0.31769    NA
## 16 Education         College grad 252 0.22744    NA
## 17    Income            < $25,000 271 0.27154   173
## 18    Income $25,000 to < $55,000 319 0.31964    NA
## 19    Income             $55,000+ 408 0.40882    NA

3.3.4 By outcome or exposure

In many published articles, the descriptives are presented not only “overall” (over the entire sample) but also by the outcome or exposure. If the “by” variable is continuous, then for the purpose of the descriptive table, it is converted to a categorical variable so there can be a column of descriptive statistics for each level of the by variable. A common method, which is demonstrated here, is to use a median split in which a binary variable is created based on whether the value of the continuous variable is below or at least as large as the median value.

Let’s summarize the variables by systolic blood pressure, using a median split to create a binary “by” variable. For simplicity, this will just be demonstrated for the complete data. Essentially, you create a dataset for each level of the by variable (complete.dat0 and complete.dat1), and compute descriptive statistics for each dataset.

MEDIAN <- median(complete.dat$BPXSY1)
MEDIAN
## [1] 124
require(tidyverse)
complete.dat0 <- complete.dat %>% filter(BPXSY1 <  MEDIAN)
complete.dat1 <- complete.dat %>% filter(BPXSY1 >= MEDIAN)
nrow(complete.dat)
## [1] 814
nrow(complete.dat0)
## [1] 376
nrow(complete.dat1)
## [1] 438
TABLE1_CONT0 <- rbind(summarize_cont(complete.dat0$BPXSY1,   "SBP"),
                      summarize_cont(complete.dat0$RIDAGEYR, "Age"))
TABLE1_CAT0 <- rbind(summarize_cat(complete.dat0$smoker,   "Smoker"),
                     summarize_cat(complete.dat0$RIAGENDR, "Gender"),
                     summarize_cat(complete.dat0$RIDRETH3, "Race"),
                     summarize_cat(complete.dat0$DMDEDUC2, "Education"),
                     summarize_cat(complete.dat0$income,   "Income"))
TABLE1_CONT1 <- rbind(summarize_cont(complete.dat1$BPXSY1,   "SBP"),
                      summarize_cont(complete.dat1$RIDAGEYR, "Age"))
TABLE1_CAT1 <- rbind(summarize_cat(complete.dat1$smoker,   "Smoker"),
                     summarize_cat(complete.dat1$RIAGENDR, "Gender"),
                     summarize_cat(complete.dat1$RIDRETH3, "Race"),
                     summarize_cat(complete.dat1$DMDEDUC2, "Education"),
                     summarize_cat(complete.dat1$income,   "Income"))
# Merge - but do not use a join function or merge() as that might put things in the wrong order
# Assuming you created all the above in the same row-ordering, the following will work
# The minus signs tell R to drop the columns that are in common from the latter 2 tables
TABLE1_CONT_ALL <- cbind(TABLE1_CONT, TABLE1_CONT0[,-1],    TABLE1_CONT1[,-1])
TABLE1_CAT_ALL  <- cbind(TABLE1_CAT,  TABLE1_CAT0[,-(1:2)], TABLE1_CAT1[,-(1:2)])

TABLE1_CONT_ALL
##     x center spread min max nmiss   stats center
## 1 SBP 127.35  19.97  88 216     0 mean/sd  111.2
## 2 Age  51.99  17.88  20  80     0 mean/sd   44.0
##   spread min max nmiss   stats center spread min max
## 1  7.605  88 122     0 mean/sd 141.21  16.58 124 216
## 2 15.989  20  80     0 mean/sd  58.84  16.55  20  80
##   nmiss   stats
## 1     0 mean/sd
## 2     0 mean/sd
TABLE1_CAT_ALL
##            x                level   n       p nmiss
## 1     Smoker                Never 442 0.54300     0
## 2     Smoker                 Past 219 0.26904    NA
## 3     Smoker              Current 153 0.18796    NA
## 4     Gender                 Male 409 0.50246     0
## 5     Gender               Female 405 0.49754    NA
## 6       Race     Mexican American 118 0.14496     0
## 7       Race       Other Hispanic  64 0.07862    NA
## 8       Race   Non-Hispanic White 297 0.36486    NA
## 9       Race   Non-Hispanic Black 174 0.21376    NA
## 10      Race   Non-Hispanic Asian 112 0.13759    NA
## 11      Race          Other/Multi  49 0.06020    NA
## 12 Education                  < 9  58 0.07125     0
## 13 Education                 9-11  93 0.11425    NA
## 14 Education               HS/GED 205 0.25184    NA
## 15 Education      Some college/AA 273 0.33538    NA
## 16 Education         College grad 185 0.22727    NA
## 17    Income            < $25,000 211 0.25921     0
## 18    Income $25,000 to < $55,000 259 0.31818    NA
## 19    Income             $55,000+ 344 0.42260    NA
##      n       p nmiss   n       p nmiss
## 1  207 0.55053     0 235 0.53653     0
## 2   87 0.23138    NA 132 0.30137    NA
## 3   82 0.21809    NA  71 0.16210    NA
## 4  175 0.46543     0 234 0.53425     0
## 5  201 0.53457    NA 204 0.46575    NA
## 6   63 0.16755     0  55 0.12557     0
## 7   28 0.07447    NA  36 0.08219    NA
## 8  137 0.36436    NA 160 0.36530    NA
## 9   63 0.16755    NA 111 0.25342    NA
## 10  63 0.16755    NA  49 0.11187    NA
## 11  22 0.05851    NA  27 0.06164    NA
## 12  24 0.06383     0  34 0.07763     0
## 13  33 0.08777    NA  60 0.13699    NA
## 14  92 0.24468    NA 113 0.25799    NA
## 15 134 0.35638    NA 139 0.31735    NA
## 16  93 0.24734    NA  92 0.21005    NA
## 17  83 0.22074     0 128 0.29224     0
## 18 110 0.29255    NA 149 0.34018    NA
## 19 183 0.48670    NA 161 0.36758    NA

3.3.5 Formatting

While these tables have all the information you want, they are not formatted nicely for inclusion in a publication. There are many options for how to get from an R table to a publication-ready table. A method that the author has used often is to write the R table to a CSV file, use a spreadsheet program to format the results (e.g., rounding, concatenating), paste the formatted table into a word processing document, and format the table (headers, lines, etc.) in the word processing program. The code below assumes you have a folder called “Results” in the same location as your R project.

write.csv(TABLE1_CONT_ALL, "Results/table1_cont_all.csv")
write.csv(TABLE1_CAT_ALL,  "Results/table1_cat_all.csv")

Tables 3.1 and 3.2 are nicely formatted “Table 1”s (for categorical and continuous variables, respectively) after splitting by systolic blood pressure.

Table 3.1: Descriptive statistics for categorical variables, overall and by systolic blood pressure (based on median split)
Systolic Blood Pressure (mmHg)
Overall
N = 814
< 124
N = 376 (46.2%)
>= 124
N = 438 (53.8%)
Variable Level N % N % N %
Smoking status Never 442 54.3 207 55.1 235 53.7
Past 219 26.9 87 23.1 132 30.1
Current 153 18.8 82 21.8 71 16.2
Gender Male 409 50.2 175 46.5 234 53.4
Female 405 49.8 201 53.5 204 46.6
Race/ethnicity Mexican American 118 14.5 63 16.8 55 12.6
Other Hispanic 64 7.9 28 7.4 36 8.2
Non-Hispanic White 297 36.5 137 36.4 160 36.5
Non-Hispanic Black 174 21.4 63 16.8 111 25.3
Non-Hispanic Asian 112 13.8 63 16.8 49 11.2
Other/Multi 49 6.0 22 5.9 27 6.2
Education < 9 58 7.1 24 6.4 34 7.8
9-11 93 11.4 33 8.8 60 13.7
HS/GED 205 25.2 92 24.5 113 25.8
Some college/AA 273 33.5 134 35.6 139 31.7
College grad 185 22.7 93 24.7 92 21.0
Income < $25,000 211 25.9 83 22.1 128 29.2
$25,000 to < $55,000 259 31.8 110 29.3 149 34.0
$55,000+ 344 42.3 183 48.7 161 36.8
Table 3.2: Descriptive statistics for continuous variables, overall and by systolic blood pressure (based on median split)
Systolic Blood Pressure (mmHg)
Overall
N = 814
< 124
N = 376 (46.2%)
>= 124
N = 438 (53.8%)
Variable Mean SD Mean SD Mean SD
SBP (mmHg) 127.3 20.0 111.2 7.6 141.2 16.6
Age (years) 52.0 17.9 44.0 16.0 58.8 16.5

3.4 To Do

  • Not all levels are showing in the x-axis for barplots