# Chapter 3 Data summarization

## 3.1 Examining the data

Before fitting a regression model, it is a good idea to look at all the variables individually. Based on what you see, you might modify some of the variables. How you decide what to do depends on the type of regression model you are fitting. For example, we might be looking to see if a continuous variable is symmetric vs. skewed, or if a categorical variable has levels that are very sparse. We will discuss what to look for when we discuss each model in future chapters.

• Continuous variables: Create a numerical summary and plot a histogram using summary() and hist()
• Categorical variables: Create a frequency table and plot a bar chart using table() and barplot()

For example, using data from a 20% subset of the NHANES 2017-2018 cycle, let’s summarize the data for systolic blood pressure (BPXSY1), age (RIDAGEYR), smoking status (smoker), gender (RIAGENDR), race/Hispanic origin (RIDRETH3), education (adults age 20y+) (DMDEDUC2), and annual household income (income). Notice the use of exclude = NULL in table() which tells R to include the number of missing values in the frequency table.

load("Data/NHANES-2017-2018-sub20.Rdata")
# Continuous variables
summary(nhanes$BPXSY1) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 88 112 124 127 136 216 ## NA's ## 182 summary(nhanes$RIDAGEYR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    18.0    34.0    52.0    50.5    66.0    80.0
hist(nhanes$BPXSY1, xlab = "", main = "Systolic Blood Pressure (mmHg)") hist(nhanes$RIDAGEYR, xlab = "", main = "Age (years)") # Categorical variables
table(nhanes$smoker, exclude = NULL) ## ## Never Past Current ## 680 282 209 table(nhanes$RIAGENDR, exclude = NULL)
##
##   Male Female
##    578    593
table(nhanes$RIDRETH3, exclude = NULL) ## ## Mexican American Other Hispanic ## 173 105 ## Non-Hispanic White Non-Hispanic Black ## 396 257 ## Non-Hispanic Asian Other/Multi ## 168 72 table(nhanes$DMDEDUC2, exclude = NULL)
##
##             < 9            9-11          HS/GED
##              98             133             273
## Some college/AA    College grad            <NA>
##             352             252              63
table(nhanes$income, exclude = NULL) ## ## <$25,000 $25,000 to <$55,000
##                  271                  319
##             $55,000+ <NA> ## 408 173 # barplot expects the frequencies, not the raw data, so use table inside barplot barplot(table(nhanes$smoker),   ylab = "Frequency", xlab = "Smoking Status") barplot(table(nhanes$RIAGENDR), ylab = "Frequency", xlab = "Gender") barplot(table(nhanes$RIDRETH3), ylab = "Frequency", xlab = "Race/Hispanic Origin") barplot(table(nhanes$DMDEDUC2), ylab = "Frequency", xlab = "Education Level (Adults 20y+)") barplot(table(nhanes$income),   ylab = "Frequency", xlab = "Annual Household Income") For categorical variables, you can also compute the sample proportions (instead of frequencies) using prop.table() and create a barplot with those proportions.

# Proportion of all cases
prop.table(table(nhanes$income, exclude = NULL)) ## ## <$25,000 $25,000 to <$55,000
##               0.2314               0.2724
##             $55,000+ <NA> ## 0.3484 0.1477 # Proportion of non-missing cases prop.table(table(nhanes$income))
##
##            < $25,000$25,000 to < $55,000 ## 0.2715 0.3196 ##$55,000+
##               0.4088
barplot(prop.table(table(nhanes$income)), ylab = "Proportion", xlab = "Annual Household Income") You can also use summary() to summarize multiple variables all at once: library(tidyverse) nhanes %>% select(BPXSY1, smoker, RIDAGEYR, RIAGENDR, RIDRETH3, DMDEDUC2, income) %>% summary() ## BPXSY1 smoker RIDAGEYR ## Min. : 88 Never :680 Min. :18.0 ## 1st Qu.:112 Past :282 1st Qu.:34.0 ## Median :124 Current:209 Median :52.0 ## Mean :127 Mean :50.5 ## 3rd Qu.:136 3rd Qu.:66.0 ## Max. :216 Max. :80.0 ## NA's :182 ## RIAGENDR RIDRETH3 ## Male :578 Mexican American :173 ## Female:593 Other Hispanic :105 ## Non-Hispanic White:396 ## Non-Hispanic Black:257 ## Non-Hispanic Asian:168 ## Other/Multi : 72 ## ## DMDEDUC2 income ## < 9 : 98 <$25,000           :271
##  9-11           :133   $25,000 to <$55,000:319
##  HS/GED         :273   $55,000+ :408 ## Some college/AA:352 NA's :173 ## College grad :252 ## NA's : 63 ##  ## 3.2 Missing data options If, when examining the data, you find there are any variables with missing data, then you either need to handle the missing data in some way (such as multiple imputation - see Chapter 11) or do a complete case analysis. ### 3.2.1 Complete case analysis A complete case analysis is one in which all cases with missing data on any variable are removed. In R, it is easy to specify a complete case analysis within a regression function call. However it is a good idea to remove the cases with missing values from the dataset yourself so that you can create descriptive statistics to go along with your regression analysis that have the same sample size as your regression. Using the same NHANES data and variables we summarized above, the following code demonstrates how to count the number of cases in a dataset (the sample size) and remove cases with missing data on a set of variables. The first method shown will remove cases but retain all the variables in a dataset. The second method will remove cases but only retain the variables of interest. Here is a summary of the variables of interest prior to removing any missing data: # This method removes cases with missing data but keeps all variables in the dataset SUB <- !is.na(nhanes$BPXSY1)   &
!is.na(nhanes$smoker) & !is.na(nhanes$RIDAGEYR) &
!is.na(nhanes$RIAGENDR) & !is.na(nhanes$RIDRETH3) &
!is.na(nhanes$DMDEDUC2) & !is.na(nhanes$income)
complete.dat <- subset(nhanes, subset = SUB)

# This method will remove cases with missing data but also remove other variables
library(tidyverse)
complete.dat <- nhanes %>%
select(BPXSY1, smoker, RIDAGEYR, RIAGENDR, RIDRETH3, DMDEDUC2, income) %>%
drop_na()

nrow(nhanes)
##  1171
nrow(complete.dat)
##  814
summary(complete.dat)
##      BPXSY1        smoker       RIDAGEYR
##  Min.   : 88   Never  :442   Min.   :20.0
##  1st Qu.:114   Past   :219   1st Qu.:37.0
##  Median :124   Current:153   Median :53.0
##  Mean   :127                 Mean   :52.0
##  3rd Qu.:138                 3rd Qu.:66.8
##  Max.   :216                 Max.   :80.0
##    RIAGENDR                 RIDRETH3
##  Male  :409   Mexican American  :118
##  Female:405   Other Hispanic    : 64
##               Non-Hispanic White:297
##               Non-Hispanic Black:174
##               Non-Hispanic Asian:112
##               Other/Multi       : 49
##             DMDEDUC2                    income
##  < 9            : 58   < $25,000 :211 ## 9-11 : 93$25,000 to < $55,000:259 ## HS/GED :205$55,000+            :344
##  Some college/AA:273
## 

### 3.2.2 Multiple imputation

Multiple imputation (MI) of missing data is a method of filling in non-missing values for missing values based on the associations between variables. The filling in (imputation) is done randomly and multiple times since we do not want to pretend we know for sure what the missing values are. You then fit your regression model on each complete dataset and then combine the results according to certain mathematical rules.

Suppose your original dataset has 250 cases, of which only 200 have complete data on all the variables you are interested in. A complete case analysis would use just those 200 cases. An MI analysis, however, would use all 250 cases. However, we do not actually have a sample size of 250. The mathematical rules MI uses when combining the results appropriately account for the fact that we have more information than just the 200 complete cases, but less information than a dataset with 250 complete cases.

This topic will be discussed in more detail in Chapter 11.

## 3.3 Creating a “Table 1”

In most published articles, there is a “Table 1” which contains descriptive statistics for the sample. This may include, for example, the mean and standard deviation for continuous variables, the frequency and proportion for categorical variables, and perhaps also the number of missing values.

The brute force method of creating such a table would be to compute each statistic for each variable of interest and then copy and paste the results into the table. Having done this even once you will wish for an easier method! There are many possible solutions, but one is demonstrated here - create a function that produces the set of summaries you want, run the function on each variable, and bind the results together.

### 3.3.1 Function for summarizing a continuous variable

Here is a function for summarizing a continuous variable. The function reads in a variable (x) and computes certain descriptive statistics for that variable. There is also an option (SKEW) with a default value of FALSE that allows you to choose to a different set of statistics for the measures of center and spread (median and inter-quartile range (IQR)) if you think the variable is skewed.

summarize_cont <- function(x, LABEL, SKEW = F) {
# Count the number of missing values
NMISS <- sum(is.na(x))
# Min and max
MIN <- min(x, na.rm = T)
MAX <- max(x, na.rm = T)
if (!SKEW) {
# Symmetric x
# Mean
CENTER <- mean(x, na.rm = T)
# Standard deviation
SPREAD <- sd(  x, na.rm = T)
STATS <- "mean/sd"
} else if (SKEW) {
# Skewed x
# Median
CENTER <- median(x, na.rm = T)
# Inter-quartile range
SPREAD <- IQR(   x, na.rm = T)
STATS <- "median/IQR"
}
# Putting it together
OUT <- data.frame(LABEL, CENTER, SPREAD, MIN, MAX, NMISS, STATS)
names(OUT) <- c("x", "center", "spread", "min", "max", "nmiss", "stats")
return(OUT)
}

# Example
summarize_cont(nhanes$BPXSY1, "SBP") ## x center spread min max nmiss stats ## 1 SBP 126.8 19.86 88 216 182 mean/sd ### 3.3.2 Function for summarizing a categorical variable Here is a function for summarizing a categorical variable. summarize_cat <- function(x, LABEL) { if(!is.factor(x)) { return("Error: x must be a factor") } else { # Count the number of missing values NMISS <- sum(is.na(x)) # Frequency N <- table(x) # Proportion P <- prop.table(N) # Pad NMISS to have the right number of rows NMISS <- c(NMISS, rep(NA, length(N)-1)) OUT <- data.frame(LABEL, names(N), as.numeric(N), as.numeric(P), NMISS) names(OUT) <- c("x", "level", "n", "p", "nmiss") return(OUT) } } # Example summarize_cat(nhanes$income, "Income")
##        x                level   n      p nmiss
## 1 Income            < $25,000 271 0.2715 173 ## 2 Income$25,000 to < $55,000 319 0.3196 NA ## 3 Income$55,000+ 408 0.4088    NA

### 3.3.3 Overall

Let’s use the functions we just created to create summary statistics for all the variables we have been summarizing in this chapter, for the entire sample (overall).

Which dataset should we use - the dataset that includes missing values or the dataset that has missing values removed? If doing a complete case analysis, one would typically just summarize the dataset with no missing values as that represents the data to be used in your regression analysis. Depending on the extent of the missing data, one may also include information on the numbers of cases with missing values for each variable. If doing an MI analysis, one would typically summarize the dataset with missing values, including the number of missing values. Both ways are demonstrated here.

# Complete case analysis - Use complete.dat, not nhanes

TABLE1_CONT <- rbind(summarize_cont(complete.dat$BPXSY1, "SBP"), summarize_cont(complete.dat$RIDAGEYR, "Age"))

TABLE1_CAT <- rbind(summarize_cat(complete.dat$smoker, "Smoker"), summarize_cat(complete.dat$RIAGENDR, "Gender"),
summarize_cat(complete.dat$RIDRETH3, "Race"), summarize_cat(complete.dat$DMDEDUC2, "Education"),
summarize_cat(complete.dat$income, "Income")) TABLE1_CONT ## x center spread min max nmiss stats ## 1 SBP 127.35 19.97 88 216 0 mean/sd ## 2 Age 51.99 17.88 20 80 0 mean/sd TABLE1_CAT ## x level n p nmiss ## 1 Smoker Never 442 0.54300 0 ## 2 Smoker Past 219 0.26904 NA ## 3 Smoker Current 153 0.18796 NA ## 4 Gender Male 409 0.50246 0 ## 5 Gender Female 405 0.49754 NA ## 6 Race Mexican American 118 0.14496 0 ## 7 Race Other Hispanic 64 0.07862 NA ## 8 Race Non-Hispanic White 297 0.36486 NA ## 9 Race Non-Hispanic Black 174 0.21376 NA ## 10 Race Non-Hispanic Asian 112 0.13759 NA ## 11 Race Other/Multi 49 0.06020 NA ## 12 Education < 9 58 0.07125 0 ## 13 Education 9-11 93 0.11425 NA ## 14 Education HS/GED 205 0.25184 NA ## 15 Education Some college/AA 273 0.33538 NA ## 16 Education College grad 185 0.22727 NA ## 17 Income <$25,000 211 0.25921     0
## 18    Income $25,000 to <$55,000 259 0.31818    NA
## 19    Income             $55,000+ 344 0.42260 NA # MI analysis - Use original dataset with all cases TABLE1_CONT_MI <- rbind(summarize_cont(nhanes$BPXSY1,   "SBP"),
summarize_cont(nhanes$RIDAGEYR, "Age")) TABLE1_CAT_MI <- rbind(summarize_cat(nhanes$smoker,   "Smoker"),
summarize_cat(nhanes$RIAGENDR, "Gender"), summarize_cat(nhanes$RIDRETH3, "Race"),
summarize_cat(nhanes$DMDEDUC2, "Education"), summarize_cat(nhanes$income,   "Income"))

TABLE1_CONT_MI
##     x center spread min max nmiss   stats
## 1 SBP 126.76  19.86  88 216   182 mean/sd
## 2 Age  50.49  18.94  18  80     0 mean/sd
TABLE1_CAT_MI
##            x                level   n       p nmiss
## 1     Smoker                Never 680 0.58070     0
## 2     Smoker                 Past 282 0.24082    NA
## 3     Smoker              Current 209 0.17848    NA
## 4     Gender                 Male 578 0.49360     0
## 5     Gender               Female 593 0.50640    NA
## 6       Race     Mexican American 173 0.14774     0
## 7       Race       Other Hispanic 105 0.08967    NA
## 8       Race   Non-Hispanic White 396 0.33817    NA
## 9       Race   Non-Hispanic Black 257 0.21947    NA
## 10      Race   Non-Hispanic Asian 168 0.14347    NA
## 11      Race          Other/Multi  72 0.06149    NA
## 12 Education                  < 9  98 0.08845    63
## 13 Education                 9-11 133 0.12004    NA
## 14 Education               HS/GED 273 0.24639    NA
## 15 Education      Some college/AA 352 0.31769    NA
## 16 Education         College grad 252 0.22744    NA
## 17    Income            < $25,000 271 0.27154 173 ## 18 Income$25,000 to < $55,000 319 0.31964 NA ## 19 Income$55,000+ 408 0.40882    NA

### 3.3.4 By outcome or exposure

In many published articles, the descriptives are presented not only “overall” (over the entire sample) but also by the outcome or exposure. If the “by” variable is continuous, then for the purpose of the descriptive table, it is converted to a categorical variable so there can be a column of descriptive statistics for each level of the by variable. A common method, which is demonstrated here, is to use a median split in which a binary variable is created based on whether the value of the continuous variable is below or at least as large as the median value.

Let’s summarize the variables by systolic blood pressure, using a median split to create a binary “by” variable. For simplicity, this will just be demonstrated for the complete data. Essentially, you create a dataset for each level of the by variable (complete.dat0 and complete.dat1), and compute descriptive statistics for each dataset.

MEDIAN <- median(complete.dat$BPXSY1) MEDIAN ##  124 require(tidyverse) complete.dat0 <- complete.dat %>% filter(BPXSY1 < MEDIAN) complete.dat1 <- complete.dat %>% filter(BPXSY1 >= MEDIAN) nrow(complete.dat) ##  814 nrow(complete.dat0) ##  376 nrow(complete.dat1) ##  438 TABLE1_CONT0 <- rbind(summarize_cont(complete.dat0$BPXSY1,   "SBP"),
summarize_cont(complete.dat0$RIDAGEYR, "Age")) TABLE1_CAT0 <- rbind(summarize_cat(complete.dat0$smoker,   "Smoker"),
summarize_cat(complete.dat0$RIAGENDR, "Gender"), summarize_cat(complete.dat0$RIDRETH3, "Race"),
summarize_cat(complete.dat0$DMDEDUC2, "Education"), summarize_cat(complete.dat0$income,   "Income"))
TABLE1_CONT1 <- rbind(summarize_cont(complete.dat1$BPXSY1, "SBP"), summarize_cont(complete.dat1$RIDAGEYR, "Age"))
TABLE1_CAT1 <- rbind(summarize_cat(complete.dat1$smoker, "Smoker"), summarize_cat(complete.dat1$RIAGENDR, "Gender"),
summarize_cat(complete.dat1$RIDRETH3, "Race"), summarize_cat(complete.dat1$DMDEDUC2, "Education"),
summarize_cat(complete.dat1$income, "Income")) # Merge - but do not use a join function or merge() as that might put things in the wrong order # Assuming you created all the above in the same row-ordering, the following will work # The minus signs tell R to drop the columns that are in common from the latter 2 tables TABLE1_CONT_ALL <- cbind(TABLE1_CONT, TABLE1_CONT0[,-1], TABLE1_CONT1[,-1]) TABLE1_CAT_ALL <- cbind(TABLE1_CAT, TABLE1_CAT0[,-(1:2)], TABLE1_CAT1[,-(1:2)]) TABLE1_CONT_ALL ## x center spread min max nmiss stats center ## 1 SBP 127.35 19.97 88 216 0 mean/sd 111.2 ## 2 Age 51.99 17.88 20 80 0 mean/sd 44.0 ## spread min max nmiss stats center spread min max ## 1 7.605 88 122 0 mean/sd 141.21 16.58 124 216 ## 2 15.989 20 80 0 mean/sd 58.84 16.55 20 80 ## nmiss stats ## 1 0 mean/sd ## 2 0 mean/sd TABLE1_CAT_ALL ## x level n p nmiss ## 1 Smoker Never 442 0.54300 0 ## 2 Smoker Past 219 0.26904 NA ## 3 Smoker Current 153 0.18796 NA ## 4 Gender Male 409 0.50246 0 ## 5 Gender Female 405 0.49754 NA ## 6 Race Mexican American 118 0.14496 0 ## 7 Race Other Hispanic 64 0.07862 NA ## 8 Race Non-Hispanic White 297 0.36486 NA ## 9 Race Non-Hispanic Black 174 0.21376 NA ## 10 Race Non-Hispanic Asian 112 0.13759 NA ## 11 Race Other/Multi 49 0.06020 NA ## 12 Education < 9 58 0.07125 0 ## 13 Education 9-11 93 0.11425 NA ## 14 Education HS/GED 205 0.25184 NA ## 15 Education Some college/AA 273 0.33538 NA ## 16 Education College grad 185 0.22727 NA ## 17 Income <$25,000 211 0.25921     0
## 18    Income $25,000 to <$55,000 259 0.31818    NA
## 19    Income             $55,000+ 344 0.42260 NA ## n p nmiss n p nmiss ## 1 207 0.55053 0 235 0.53653 0 ## 2 87 0.23138 NA 132 0.30137 NA ## 3 82 0.21809 NA 71 0.16210 NA ## 4 175 0.46543 0 234 0.53425 0 ## 5 201 0.53457 NA 204 0.46575 NA ## 6 63 0.16755 0 55 0.12557 0 ## 7 28 0.07447 NA 36 0.08219 NA ## 8 137 0.36436 NA 160 0.36530 NA ## 9 63 0.16755 NA 111 0.25342 NA ## 10 63 0.16755 NA 49 0.11187 NA ## 11 22 0.05851 NA 27 0.06164 NA ## 12 24 0.06383 0 34 0.07763 0 ## 13 33 0.08777 NA 60 0.13699 NA ## 14 92 0.24468 NA 113 0.25799 NA ## 15 134 0.35638 NA 139 0.31735 NA ## 16 93 0.24734 NA 92 0.21005 NA ## 17 83 0.22074 0 128 0.29224 0 ## 18 110 0.29255 NA 149 0.34018 NA ## 19 183 0.48670 NA 161 0.36758 NA ### 3.3.5 Formatting While these tables have all the information you want, they are not formatted nicely for inclusion in a publication. There are many options for how to get from an R table to a publication-ready table. A method that the author has used often is to write the R table to a CSV file, use a spreadsheet program to format the results (e.g., rounding, concatenating), paste the formatted table into a word processing document, and format the table (headers, lines, etc.) in the word processing program. The code below assumes you have a folder called “Results” in the same location as your R project. write.csv(TABLE1_CONT_ALL, "Results/table1_cont_all.csv") write.csv(TABLE1_CAT_ALL, "Results/table1_cat_all.csv") Tables 3.1 and 3.2 are nicely formatted “Table 1”s (for categorical and continuous variables, respectively) after splitting by systolic blood pressure. Table 3.1: Descriptive statistics for categorical variables, overall and by systolic blood pressure (based on median split) Systolic Blood Pressure (mmHg) Overall N = 814 < 124 N = 376 (46.2%) >= 124 N = 438 (53.8%) Variable Level N % N % N % Smoking status Never 442 54.3 207 55.1 235 53.7 Past 219 26.9 87 23.1 132 30.1 Current 153 18.8 82 21.8 71 16.2 Gender Male 409 50.2 175 46.5 234 53.4 Female 405 49.8 201 53.5 204 46.6 Race/ethnicity Mexican American 118 14.5 63 16.8 55 12.6 Other Hispanic 64 7.9 28 7.4 36 8.2 Non-Hispanic White 297 36.5 137 36.4 160 36.5 Non-Hispanic Black 174 21.4 63 16.8 111 25.3 Non-Hispanic Asian 112 13.8 63 16.8 49 11.2 Other/Multi 49 6.0 22 5.9 27 6.2 Education < 9 58 7.1 24 6.4 34 7.8 9-11 93 11.4 33 8.8 60 13.7 HS/GED 205 25.2 92 24.5 113 25.8 Some college/AA 273 33.5 134 35.6 139 31.7 College grad 185 22.7 93 24.7 92 21.0 Income <$25,000 211 25.9 83 22.1 128 29.2
$25,000 to <$55,000 259 31.8 110 29.3 149 34.0
\$55,000+ 344 42.3 183 48.7 161 36.8
Table 3.2: Descriptive statistics for continuous variables, overall and by systolic blood pressure (based on median split)
Systolic Blood Pressure (mmHg)
Overall
N = 814
< 124
N = 376 (46.2%)
>= 124
N = 438 (53.8%)
Variable Mean SD Mean SD Mean SD
SBP (mmHg) 127.3 20.0 111.2 7.6 141.2 16.6
Age (years) 52.0 17.9 44.0 16.0 58.8 16.5

## 3.4 To Do

• Not all levels are showing in the x-axis for barplots