5.3 Complete case analysis dataset

We will use a complete case analysis in this Chapter (excluding all individuals (cases) that have missing values for any of the analysis variables, see Section 3.2.1). Complete case analysis is already the default for the lm() function, but if we want all our results, in particular the descriptive statistics in our “Table 1”, to be based on the same sample as used in to fit the regression model then we must explicitly remove individuals with missing values ahead of time. See Chapter 9 for how to handle missing data using multiple imputation.

Example 5.1 (continued): Use summary() to assess the extent of missing data in the variables in our analysis and create a complete case analysis dataset.

load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")
nhanesf <- nhanes_adult_fast_sub
rm(nhanes_adult_fast_sub)

nhanesf %>%
  select(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
         RIAGENDR, RIDRETH3, income) %>% 
  summary()
##     LBDGLUSI        BMXWAIST         smoker       RIDAGEYR      RIAGENDR                 RIDRETH3  
##  Min.   : 2.61   Min.   : 63.2   Never  :579   Min.   :20.0   Male  :457   Mexican American  :120  
##  1st Qu.: 5.33   1st Qu.: 88.3   Past   :264   1st Qu.:34.0   Female:543   Other Hispanic    : 71  
##  Median : 5.72   Median : 97.8   Current:157   Median :47.0                Non-Hispanic White:602  
##  Mean   : 6.09   Mean   :100.5                 Mean   :47.9                Non-Hispanic Black:115  
##  3rd Qu.: 6.22   3rd Qu.:111.0                 3rd Qu.:61.0                Non-Hispanic Asian: 48  
##  Max.   :19.00   Max.   :169.5                 Max.   :80.0                Other/Multi       : 44  
##                  NA's   :35                                                                        
##                   income   
##  < $25,000           :164  
##  $25,000 to < $55,000:224  
##  $55,000+            :489  
##  NA's                :123  
##                            
##                            
## 

Waist circumference and income have missing values. The code below creates a complete case dataset using the drop_na() method that removes all other variables from our dataset. See Section 3.2.1 for an alternative method that retains all the other variables.

nhanesf.complete <- nhanesf %>% 
  select(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
         RIAGENDR, RIDRETH3, income) %>% 
  drop_na()
nrow(nhanesf)
## [1] 1000
nrow(nhanesf.complete)
## [1] 857

While the full dataset has 1000 observations, the complete case dataset has only 857.