3.7 Set values to missing
In our modified dataset, the variable CDAI
has some observations with a value of 999. In many public datasets, 999 is used as a missing value code and should be set to missing. Assume here that 999 is a missing value code. In general, refer to the documentation for the dataset you are working with to determine what are missing value codes and what are valid values.
Failure to set missing value codes to missing can seriously bias an analysis. For example, if you are computing the mean of a variable that ranges from 0 to 100, then including missing value codes with values of 999 will upwardly bias your results.
##
## 0 1 1.5 2 3 4 5 6 7 7.5 8 8.5 9 9.5 10 11 12
## 14 4 1 7 4 8 8 12 8 1 15 1 15 1 11 5 7
## 13 13.5 14 14.5 15 16 17 18 19 20 21 22 23 24 25 28 31
## 5 1 8 1 4 6 8 4 6 5 3 3 3 2 1 3 1
## 32 34 35 36 37 39 39.5 40 40.5 41 51 71 999 <NA>
## 2 1 2 1 2 4 1 2 1 1 1 1 5 320
# Set 999 to NA
mydat$CDAI[mydat$CDAI == 999] <- NA
# Examine the results
table(mydat$CDAI, useNA = "ifany") # 325 missing values, no 999s
##
## 0 1 1.5 2 3 4 5 6 7 7.5 8 8.5 9 9.5 10 11 12
## 14 4 1 7 4 8 8 12 8 1 15 1 15 1 11 5 7
## 13 13.5 14 14.5 15 16 17 18 19 20 21 22 23 24 25 28 31
## 5 1 8 1 4 6 8 4 6 5 3 3 3 2 1 3 1
## 32 34 35 36 37 39 39.5 40 40.5 41 51 71 <NA>
## 2 1 2 1 2 4 1 2 1 1 1 1 325
We can see that the 5 999 values are now NA
values.
A logical condition using ==
will change one value to NA
. If you have multiple missing value codes, use %in%
followed by the vector of values you want to replace with NA
. For example, if CDAI
had multiple missing value codes 777, 888, and 999, you would use the following:
In tidyverse, use na_if()
to convert one value to NA
and case_when()
for multiple values:
# Use print to view more than just a few rows
mydat_tibble %>%
count(CDAI) %>%
print(n = Inf) # 320 missing values, 5 999s
# Use na_if() to set a single value to missing
mydat_tibble <- mydat_tibble %>%
mutate(CDAI = na_if(CDAI, 999))
mydat_tibble %>%
count(CDAI) %>%
print(n = Inf) # 325 missing values, no 999s
# Use case_when() to set multiple values to missing
mydat_tibble <- mydat_tibble %>%
mutate(CDAI = case_when(CDAI %in% c(777, 888, 999) ~ as.double(NA),
TRUE ~ CDAI))
NOTE: The first row of case_when()
sets the values to NA
, but you have to specify the type, which is usually double
or character
. If you get an error, try a different type such as as.integer
if the variable is integer
or as.logical
if the variable contains TRUE
and FALSE
values.
An alternative is the replace_with_na()
function in the naniar
library (Tierney et al. 2021). See Replacing values with NA (Tierney 2021) for more information.