4.3 Exclude observations with missing data
Many analyses use what is known as a complete case analysis in which you filter the dataset to only include observations with no missing values on any variable in your analysis. In base R, use na.omit()
to remove all observations with missing data on ANY variable in the dataset, or use subset()
to filter out cases that are missing on a subset of variables. An alternative to na.omit()
is na.exclude()
. They are slightly different in some special circumstances. See ?na.exclude
for more information.
## ID Age AgeGp Sex Yrs_From_Dx
## Min. : 1 Min. :42.0 Min. :1.0 Length:530 Min. : 0.0
## 1st Qu.:162 1st Qu.:54.0 1st Qu.:2.0 Class :character 1st Qu.: 2.0
## Median :294 Median :59.0 Median :2.0 Mode :character Median : 6.0
## Mean :291 Mean :60.7 Mean :2.6 Mean : 8.4
## 3rd Qu.:427 3rd Qu.:66.0 3rd Qu.:3.0 3rd Qu.:10.0
## Max. :559 Max. :90.0 Max. :5.0 Max. :69.0
## NA's :15
## CDAI CDAI_YN DAS_28 DAS28_YN Steroids_GT_5
## Min. : 0.0 Min. :1.00 Min. : 0.0 Min. :1.00 Min. :0.000
## 1st Qu.: 6.0 1st Qu.:1.00 1st Qu.: 1.8 1st Qu.:1.00 1st Qu.:0.000
## Median : 10.0 Median :1.00 Median : 2.5 Median :1.00 Median :0.000
## Mean : 36.6 Mean :1.39 Mean : 2.9 Mean :1.12 Mean :0.234
## 3rd Qu.: 18.0 3rd Qu.:2.00 3rd Qu.: 3.3 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :999.0 Max. :2.00 Max. :23.0 Max. :2.00 Max. :1.000
## NA's :320 NA's :464 NA's :1
## DMARDs Biologics sDMARDS OsteopScreen FIPS
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. : 1043
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:17153
## Median :1.000 Median :0.000 Median :0.000 Median :1.000 Median :26149
## Mean :0.718 Mean :0.372 Mean :0.051 Mean :0.586 Mean :28678
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:41025
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000 Max. :55073
## NA's :1 NA's :1 NA's :1 NA's :8
## [1] 530
# Exclude rows that have missing data in ANY variable
mydat_no_NA <- na.omit(mydat)
summary(mydat_no_NA)
## ID Age AgeGp Sex Yrs_From_Dx
## Min. :102 Min. :43.0 Min. :1.00 Length:13 Min. : 0.00
## 1st Qu.:109 1st Qu.:51.0 1st Qu.:2.00 Class :character 1st Qu.: 4.00
## Median :138 Median :56.0 Median :2.00 Mode :character Median : 7.00
## Mean :192 Mean :56.3 Mean :2.23 Mean : 7.85
## 3rd Qu.:199 3rd Qu.:59.0 3rd Qu.:2.00 3rd Qu.: 8.00
## Max. :481 Max. :70.0 Max. :4.00 Max. :24.00
## CDAI CDAI_YN DAS_28 DAS28_YN Steroids_GT_5
## Min. : 0.00 Min. :2 Min. :0.77 Min. :2 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:2 1st Qu.:1.30 1st Qu.:2 1st Qu.:0.000
## Median : 2.00 Median :2 Median :2.30 Median :2 Median :0.000
## Mean : 3.58 Mean :2 Mean :2.16 Mean :2 Mean :0.231
## 3rd Qu.: 5.00 3rd Qu.:2 3rd Qu.:2.52 3rd Qu.:2 3rd Qu.:0.000
## Max. :14.00 Max. :2 Max. :4.39 Max. :2 Max. :1.000
## DMARDs Biologics sDMARDS OsteopScreen FIPS
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. : 5111
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:13065
## Median :1.000 Median :0.000 Median :0.0000 Median :1.000 Median :17135
## Mean :0.769 Mean :0.385 Mean :0.0769 Mean :0.538 Mean :21111
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:26149
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.000 Max. :47187
## [1] 13
Using na.omit()
left us with only 13 rows out of 530! In this example, missing values for CDAI
and DAS_28
indicate that someone was not given a test. Let’s not exclude those cases. To exclude observations that have missing data for only a subset of variables, create a logical vector that is a series of calls to is.na()
. Include all the variables that have NA
values you want to exclude.
MISSING <- is.na(mydat$Yrs_From_Dx) |
is.na(mydat$Steroids_GT_5) |
is.na(mydat$DMARDs) |
is.na(mydat$Biologics) |
is.na(mydat$sDMARDS) |
is.na(mydat$OsteopScreen)
# Count the number of rows flagged for deletion
sum(MISSING)
## [1] 21
# Use ! to include those that are NOT missing
mydat_no_NA <- subset(mydat,
subset = !MISSING)
# Count the number of rows kept
nrow(mydat_no_NA)
## [1] 509
## ID Age AgeGp Sex Yrs_From_Dx
## Min. : 1 Min. :42.0 Min. :1.00 Length:509 Min. : 0.00
## 1st Qu.:171 1st Qu.:53.0 1st Qu.:2.00 Class :character 1st Qu.: 2.00
## Median :299 Median :59.0 Median :2.00 Mode :character Median : 6.00
## Mean :296 Mean :60.1 Mean :2.55 Mean : 8.42
## 3rd Qu.:428 3rd Qu.:65.0 3rd Qu.:3.00 3rd Qu.:10.00
## Max. :559 Max. :90.0 Max. :5.00 Max. :69.00
##
## CDAI CDAI_YN DAS_28 DAS28_YN Steroids_GT_5
## Min. : 0.0 Min. :1.0 Min. : 0.0 Min. :1.00 Min. :0.00
## 1st Qu.: 6.0 1st Qu.:1.0 1st Qu.: 1.8 1st Qu.:1.00 1st Qu.:0.00
## Median : 10.0 Median :1.0 Median : 2.5 Median :1.00 Median :0.00
## Mean : 37.1 Mean :1.4 Mean : 2.9 Mean :1.13 Mean :0.24
## 3rd Qu.: 18.0 3rd Qu.:2.0 3rd Qu.: 3.3 3rd Qu.:1.00 3rd Qu.:0.00
## Max. :999.0 Max. :2.0 Max. :23.0 Max. :2.00 Max. :1.00
## NA's :303 NA's :443
## DMARDs Biologics sDMARDS OsteopScreen FIPS
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. : 1043
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:17153
## Median :1.000 Median :0.000 Median :0.0000 Median :1.000 Median :26149
## Mean :0.731 Mean :0.383 Mean :0.0511 Mean :0.586 Mean :28548
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:40025
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.000 Max. :55073
##
Now only CDAI and DAS_28 have NA
values (since we listed all the other variables that had NA
values in MISSING
).
The tidyverse function drop_na()
is easier to use as you can omit cases with missing data on all variables or a subset.
# Exclude rows that have missing data in ANY variable
mydat_tibble_no_NA <- mydat_tibble %>%
drop_na()
summary(mydat_tibble_no_NA)
nrow(mydat_tibble_no_NA)
# Exclude rows that have missing data in a SUBSET of variables
mydat_tibble_no_NA <- mydat_tibble %>%
drop_na(Yrs_From_Dx, Steroids_GT_5, DMARDs,
Biologics, sDMARDS, OsteopScreen)
summary(mydat_tibble_no_NA)
nrow(mydat_tibble_no_NA)