3.1 Fix mis-spellings or other invalid values

You typically find invalid values by listing out the values, deciding which need fixing, subsetting the variable to isolate the invalid values, and finally replacing the invalid values with valid ones.

In base R, you can subset with brackets [] and logical expressions.

# Create a copy of the original variable
mydat$Sex_orig <- mydat$Sex

# List all the possible values
table(mydat$Sex, useNA = "ifany")
## 
##      f      F female      m      M 
##      2    420      6      3     99
# Fix any mis-spellings / invalid values
mydat$Sex[mydat$Sex %in% c("f", "female")] <- "F"
mydat$Sex[mydat$Sex ==  "m"              ] <- "M"

# Compare before and after
table(mydat$Sex_orig, mydat$Sex, useNA = "ifany")
##         
##            F   M
##   f        2   0
##   F      420   0
##   female   6   0
##   m        0   3
##   M        0  99

In tidyverse, you can use case_when(), which has the form of a series of “if-then”, “else if-then” statements. For example, in the code below, if the value of Sex is “f” or “female”, then replace the value with “F”, otherwise if the value is “m” then replace with “M”, otherwise leave the value as-is. The statement TRUE ~ Sex serves to ensure that all cases (since TRUE is always true) that were not covered by the previously listed conditions are left with their current values (without this row, all other values would be set to NA).

# Create a copy of the original variable
mydat_tibble <- mydat_tibble %>% 
  mutate(Sex_orig = Sex)

# List all the possible values
mydat_tibble %>%
  count(Sex)

# Fix any mis-spellings / invalid values
mydat_tibble <- mydat_tibble %>% 
  mutate(Sex = case_when(Sex %in% c("f", "female") ~ "F",
                         Sex ==   "m"              ~ "M",
                         TRUE                      ~ Sex))

# Compare before and after
mydat_tibble %>% 
  count(Sex_orig, Sex)