# 第 44 章 tidyverse中的缺失值

## 44.1 什么是缺失值?

library(tidyverse)

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <dbl>

• “NA” 有引号的是字符串
• NA 是R里的特殊标记

NA is a logical constant of length 1 which contains a missing value indicator.

• 它是逻辑值
• 代表着缺失值

## 44.2 有关NA的计算

• 数值运算时

c(NA, 1) + 2
## [1] NA  3
• 逻辑运算时

isTRUE(NA)
## [1] FALSE
isFALSE(NA)
## [1] FALSE

TRUE NA FALSE

# Some logical operations do not return NA
c(TRUE, FALSE) & NA
## [1]    NA FALSE
c(TRUE, FALSE) | NA
## [1] TRUE   NA

## 44.3 如何判断NA?

c(1, 2, NA, 4) %>% is.na()
## [1] FALSE FALSE  TRUE FALSE

## 44.4 强制转换

c(TRUE, FALSE, NA) %>% class()
## [1] "logical"

c(1, 2, NA, 4) %>% class()
## [1] "numeric"

c("1", "2", NA, "4") %>% class()
## [1] "character"

c(TRUE, NA, FALSE) 
## [1]  TRUE    NA FALSE
c(TRUE, NA, FALSE) %>% class()
## [1] "logical"

c(1, 2, TRUE, 4) 
## [1] 1 2 1 4
c(1, 2, TRUE, 4) %>% class()
## [1] "numeric"

TRUE会转换成1，FALSE会转换成0. 那么此时逻辑型的 NA 会转换成数值型的 NA_real_

TRUE 1
NA NA_real_
FALSE 0
c(1, 2, NA, 4) 
## [1]  1  2 NA  4
c(1, 2, NA, 4) %>% class()
## [1] "numeric"
c(1, 2, NA_real_, 4)
## [1]  1  2 NA  4
c(1, 2, NA_real_, 4) %>% class()
## [1] "numeric"

TRUE “TRUE”
NA NA_character_
FALSE “FALSE”
c("1", "2", TRUE, "4")
## [1] "1"    "2"    "TRUE" "4"
c("1", "2", NA, "4")
## [1] "1" "2" NA  "4"
c("1", "2", NA_character_, "4") 
## [1] "1" "2" NA  "4"

c(TRUE, NA) %>%
purrr::map(., ~is.logical(.))
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
c("a", NA, NA_character_) %>%
purrr::map(., ~is.character(.))
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
c(123, NA, NA_real_) %>%
purrr::map(., ~is.numeric(.))
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
c(NA_real_, NA_complex_, NA_character_, NA_integer_, NA) %>% # coercion to character type
purrr::map(., ~is.character(.))
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE

## 44.5 如果统计有多少NA?

c(1, 2, NA, 4) %>% is.na() %>% as.integer() %>% sum()
## [1] 1

c(1, 2, NA, 4) %>% is.na() %>% sum()
## [1] 1

sum_of_na <- function(x){
sum(is.na(x))
}

c(1, 2, NA, 4) %>% sum_of_na()
## [1] 1

## 44.6 应用到tidyverse中

penguins\$bill_length_mm %>% sum_of_na()
## [1] 2

penguins %>% summarise(
N1 = sum_of_na(bill_length_mm),
N2 = sum_of_na(bill_depth_mm)
)
## # A tibble: 1 × 2
##      N1    N2
##   <int> <int>
## 1     2     2

penguins %>% summarise(
across(everything(), sum_of_na)
)
## # A tibble: 1 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##     <int>  <int>          <int>         <int>             <int>       <int>
## 1       0      0              2             2                 2           2
## # ℹ 2 more variables: sex <int>, year <int>

penguins %>% summarise(
across(everything(), ~sum(is.na(.x)))
)
## # A tibble: 1 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##     <int>  <int>          <int>         <int>             <int>       <int>
## 1       0      0              2             2                 2           2
## # ℹ 2 more variables: sex <int>, year <int>

d <- tibble(x = c(1, 3, 6, NA, 8, NA))
d
## # A tibble: 6 × 1
##       x
##   <dbl>
## 1     1
## 2     3
## 3     6
## 4    NA
## 5     8
## 6    NA
d %>% mutate(
is_even = case_when(
x %% 2 == 0 ~ "even",
x %% 2 == 1 ~ "not even",
TRUE ~ NA                 # wrong
)
)

d %>% mutate(
is_even = case_when(
x %% 2 == 0 ~ "even",
x %% 2 == 1 ~ "not even",
TRUE ~ NA_character_
)
)
## # A tibble: 6 × 2
##       x is_even
##   <dbl> <chr>
## 1     1 not even
## 2     3 not even
## 3     6 even
## 4    NA <NA>
## 5     8 even
## 6    NA <NA>

## 44.7 思考

• 上面例子中的dplyr::case_when()换做dplyr::if_else()函数，应该怎么写?

• 企鹅数据中，找出有缺失值的行，有一个NA也算。

## 44.8 更多

• Inf = 无穷大，比如 pi / 0 %>% is.infinite()
• NaN = 不是一个数（Not a Number）, 比如 0 / 0 %>% is.nan(), sqrt(-1) %>% is.nan()
• NULL = 空值，比如 c() %>% is.null()