4.3 Exclude observations with missing data

Many analyses use what is known as a complete case analysis in which you filter the dataset to only include observations with no missing values on any variable in your analysis. In base R, use na.omit() to remove all observations with missing data on ANY variable in the dataset, or use subset() to filter out cases that are missing on a subset of variables. An alternative to na.omit() is na.exclude(). They are slightly different in some special circumstances. See ?na.exclude for more information.

# Use summary() to see which variables have missing (NA) values
summary(mydat)
##        ID           Age           AgeGp         Sex             Yrs_From_Dx  
##  Min.   :  1   Min.   :42.0   Min.   :1.0   Length:530         Min.   : 0.0  
##  1st Qu.:162   1st Qu.:54.0   1st Qu.:2.0   Class :character   1st Qu.: 2.0  
##  Median :294   Median :59.0   Median :2.0   Mode  :character   Median : 6.0  
##  Mean   :291   Mean   :60.7   Mean   :2.6                      Mean   : 8.4  
##  3rd Qu.:427   3rd Qu.:66.0   3rd Qu.:3.0                      3rd Qu.:10.0  
##  Max.   :559   Max.   :90.0   Max.   :5.0                      Max.   :69.0  
##                                                                NA's   :15    
##       CDAI          CDAI_YN         DAS_28        DAS28_YN    Steroids_GT_5  
##  Min.   :  0.0   Min.   :1.00   Min.   : 0.0   Min.   :1.00   Min.   :0.000  
##  1st Qu.:  6.0   1st Qu.:1.00   1st Qu.: 1.8   1st Qu.:1.00   1st Qu.:0.000  
##  Median : 10.0   Median :1.00   Median : 2.5   Median :1.00   Median :0.000  
##  Mean   : 36.6   Mean   :1.39   Mean   : 2.9   Mean   :1.12   Mean   :0.234  
##  3rd Qu.: 18.0   3rd Qu.:2.00   3rd Qu.: 3.3   3rd Qu.:1.00   3rd Qu.:0.000  
##  Max.   :999.0   Max.   :2.00   Max.   :23.0   Max.   :2.00   Max.   :1.000  
##  NA's   :320                    NA's   :464                   NA's   :1      
##      DMARDs        Biologics        sDMARDS       OsteopScreen        FIPS      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   : 1043  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:17153  
##  Median :1.000   Median :0.000   Median :0.000   Median :1.000   Median :26149  
##  Mean   :0.718   Mean   :0.372   Mean   :0.051   Mean   :0.586   Mean   :28678  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:41025  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :55073  
##  NA's   :1       NA's   :1       NA's   :1       NA's   :8
# Number of rows in full dataset
nrow(mydat)
## [1] 530
# Exclude rows that have missing data in ANY variable
mydat_no_NA <- na.omit(mydat)
summary(mydat_no_NA)
##        ID           Age           AgeGp          Sex             Yrs_From_Dx   
##  Min.   :102   Min.   :43.0   Min.   :1.00   Length:13          Min.   : 0.00  
##  1st Qu.:109   1st Qu.:51.0   1st Qu.:2.00   Class :character   1st Qu.: 4.00  
##  Median :138   Median :56.0   Median :2.00   Mode  :character   Median : 7.00  
##  Mean   :192   Mean   :56.3   Mean   :2.23                      Mean   : 7.85  
##  3rd Qu.:199   3rd Qu.:59.0   3rd Qu.:2.00                      3rd Qu.: 8.00  
##  Max.   :481   Max.   :70.0   Max.   :4.00                      Max.   :24.00  
##       CDAI          CDAI_YN      DAS_28        DAS28_YN Steroids_GT_5  
##  Min.   : 0.00   Min.   :2   Min.   :0.77   Min.   :2   Min.   :0.000  
##  1st Qu.: 0.00   1st Qu.:2   1st Qu.:1.30   1st Qu.:2   1st Qu.:0.000  
##  Median : 2.00   Median :2   Median :2.30   Median :2   Median :0.000  
##  Mean   : 3.58   Mean   :2   Mean   :2.16   Mean   :2   Mean   :0.231  
##  3rd Qu.: 5.00   3rd Qu.:2   3rd Qu.:2.52   3rd Qu.:2   3rd Qu.:0.000  
##  Max.   :14.00   Max.   :2   Max.   :4.39   Max.   :2   Max.   :1.000  
##      DMARDs        Biologics        sDMARDS        OsteopScreen        FIPS      
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   : 5111  
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:13065  
##  Median :1.000   Median :0.000   Median :0.0000   Median :1.000   Median :17135  
##  Mean   :0.769   Mean   :0.385   Mean   :0.0769   Mean   :0.538   Mean   :21111  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:26149  
##  Max.   :1.000   Max.   :1.000   Max.   :1.0000   Max.   :1.000   Max.   :47187
nrow(mydat_no_NA)
## [1] 13

Using na.omit() left us with only 13 rows out of 530! In this example, missing values for CDAI and DAS_28 indicate that someone was not given a test. Let’s not exclude those cases. To exclude observations that have missing data for only a subset of variables, create a logical vector that is a series of calls to is.na(). Include all the variables that have NA values you want to exclude.

MISSING <- is.na(mydat$Yrs_From_Dx) |
           is.na(mydat$Steroids_GT_5) |
           is.na(mydat$DMARDs) |
           is.na(mydat$Biologics) |
           is.na(mydat$sDMARDS) |
           is.na(mydat$OsteopScreen)

# Count the number of rows flagged for deletion
sum(MISSING)
## [1] 21
# Use ! to include those that are NOT missing
mydat_no_NA <- subset(mydat, 
                      subset = !MISSING)

# Count the number of rows kept
nrow(mydat_no_NA)
## [1] 509
# Check for missing values again
summary(mydat_no_NA)
##        ID           Age           AgeGp          Sex             Yrs_From_Dx   
##  Min.   :  1   Min.   :42.0   Min.   :1.00   Length:509         Min.   : 0.00  
##  1st Qu.:171   1st Qu.:53.0   1st Qu.:2.00   Class :character   1st Qu.: 2.00  
##  Median :299   Median :59.0   Median :2.00   Mode  :character   Median : 6.00  
##  Mean   :296   Mean   :60.1   Mean   :2.55                      Mean   : 8.42  
##  3rd Qu.:428   3rd Qu.:65.0   3rd Qu.:3.00                      3rd Qu.:10.00  
##  Max.   :559   Max.   :90.0   Max.   :5.00                      Max.   :69.00  
##                                                                                
##       CDAI          CDAI_YN        DAS_28        DAS28_YN    Steroids_GT_5 
##  Min.   :  0.0   Min.   :1.0   Min.   : 0.0   Min.   :1.00   Min.   :0.00  
##  1st Qu.:  6.0   1st Qu.:1.0   1st Qu.: 1.8   1st Qu.:1.00   1st Qu.:0.00  
##  Median : 10.0   Median :1.0   Median : 2.5   Median :1.00   Median :0.00  
##  Mean   : 37.1   Mean   :1.4   Mean   : 2.9   Mean   :1.13   Mean   :0.24  
##  3rd Qu.: 18.0   3rd Qu.:2.0   3rd Qu.: 3.3   3rd Qu.:1.00   3rd Qu.:0.00  
##  Max.   :999.0   Max.   :2.0   Max.   :23.0   Max.   :2.00   Max.   :1.00  
##  NA's   :303                   NA's   :443                                 
##      DMARDs        Biologics        sDMARDS        OsteopScreen        FIPS      
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   : 1043  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:17153  
##  Median :1.000   Median :0.000   Median :0.0000   Median :1.000   Median :26149  
##  Mean   :0.731   Mean   :0.383   Mean   :0.0511   Mean   :0.586   Mean   :28548  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:40025  
##  Max.   :1.000   Max.   :1.000   Max.   :1.0000   Max.   :1.000   Max.   :55073  
## 

Now only CDAI and DAS_28 have NA values (since we listed all the other variables that had NA values in MISSING).

The tidyverse function drop_na() is easier to use as you can omit cases with missing data on all variables or a subset.

# Exclude rows that have missing data in ANY variable
mydat_tibble_no_NA <- mydat_tibble %>% 
  drop_na()

summary(mydat_tibble_no_NA)

nrow(mydat_tibble_no_NA)

# Exclude rows that have missing data in a SUBSET of variables
mydat_tibble_no_NA <- mydat_tibble %>% 
  drop_na(Yrs_From_Dx, Steroids_GT_5, DMARDs,
          Biologics, sDMARDS, OsteopScreen)

summary(mydat_tibble_no_NA)

nrow(mydat_tibble_no_NA)