3.7 Set values to missing

In our modified dataset, the variable CDAI has some observations with a value of 999. In many public datasets, 999 is used as a missing value code and should be set to missing. Assume here that 999 is a missing value code. In general, refer to the documentation for the dataset you are working with to determine what are missing value codes and what are valid values.

Failure to set missing value codes to missing can seriously bias an analysis. For example, if you are computing the mean of a variable that ranges from 0 to 100, then including missing value codes with values of 999 will upwardly bias your results.

# Examine the variable values
table(mydat$CDAI, exclude = NULL) # 320 missing values, 5 999s
## 
##    0    1  1.5    2    3    4    5    6    7  7.5    8  8.5    9  9.5   10   11   12 
##   14    4    1    7    4    8    8   12    8    1   15    1   15    1   11    5    7 
##   13 13.5   14 14.5   15   16   17   18   19   20   21   22   23   24   25   28   31 
##    5    1    8    1    4    6    8    4    6    5    3    3    3    2    1    3    1 
##   32   34   35   36   37   39 39.5   40 40.5   41   51   71  999 <NA> 
##    2    1    2    1    2    4    1    2    1    1    1    1    5  320
# Set 999 to NA
mydat$CDAI[mydat$CDAI == 999] <- NA

# Examine the results
table(mydat$CDAI, exclude = NULL) # 325 missing values, no 999s
## 
##    0    1  1.5    2    3    4    5    6    7  7.5    8  8.5    9  9.5   10   11   12 
##   14    4    1    7    4    8    8   12    8    1   15    1   15    1   11    5    7 
##   13 13.5   14 14.5   15   16   17   18   19   20   21   22   23   24   25   28   31 
##    5    1    8    1    4    6    8    4    6    5    3    3    3    2    1    3    1 
##   32   34   35   36   37   39 39.5   40 40.5   41   51   71 <NA> 
##    2    1    2    1    2    4    1    2    1    1    1    1  325

We can see that the 5 999 values are now NA values.

A logical condition using == will change one value to NA. If you have multiple missing value codes, use %in% followed by the vector of values you want to replace with NA. For example, if CDAI had multiple missing value codes 777, 888, and 999, you would use the following:

mydat$CDAI[mydat$CDAI %in% c(777, 888, 999)] <- NA

In tidyverse, use na_if() to convert one value to NA and case_when() for multiple values:

# Use print to view more than just a few rows
mydat_tibble %>% 
  count(CDAI) %>% 
  print(n = Inf) # 320 missing values, 5 999s

# Use na_if() to set a single value to missing
mydat_tibble <- mydat_tibble %>% 
  mutate(CDAI = na_if(CDAI, 999))

mydat_tibble %>% 
  count(CDAI) %>% 
  print(n = Inf) # 325 missing values, no 999s
# Use case_when() to set multiple values to missing
mydat_tibble <- mydat_tibble %>% 
  mutate(CDAI = case_when(CDAI %in% c(777, 888, 999) ~ as.double(NA),
                          TRUE                       ~ CDAI))

NOTE: The first row of case_when() sets the values to NA, but you have to specify the type, which is usually double or character. If you get an error, try a different type such as as.integer if the variable is integer or as.logical if the variable contains TRUE and FALSE values.

An alternative is the replace_with_na() function in the naniar library (Tierney et al. 2021). See Replacing values with NA (Tierney 2021) for more information.

References

Tierney, Nicholas. 2021. “Replacing Values with NA.” https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html.
Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2021. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.