6.6 Handling missing values
数据整理改变了数据的呈现方式,随之而来的一个话题便是缺失值。通常当我们泛泛地使用“缺失值 (missing value)” 这个名词的时候,其实是指以下两种“缺失”方式中的某一种:
显式缺失(Explicitly missing): 在数据中用
NA
标识隐式缺失(Implicitly missing): 未出现在数据中的值
R for Data Science 中对这两种缺失的概括:
> An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
通过一个简单的数据框区分两种数据缺失的方式:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks
#> # A tibble: 7 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 2 0.92
#> 6 2016 3 0.17
#> # ... with 1 more row
我们很容易找到 stocks
第四条观测在 return
上的一个 NA
,因为它是显式缺失的。另一个隐式缺失的值是 (year = 2016, qtr = 1)
对应的观测,它没有出现在数据集中。
数据呈现方式上的改变可以将隐式缺失值变成显式。比如,用 pivot_wider()
函数构造以 year
为行字段,以 return
为值的透视表,这样就会产生一个属于水平 (year = 2016, qtr = 1)
的单元格:
stocks %>%
pivot_wider(names_from = year, values_from = return)
#> pivot_wider: reorganized (year, return) into (2015, 2016) [was 7x3, now 4x3]
#> # A tibble: 4 x 3
#> qtr `2015` `2016`
#> <dbl> <dbl> <dbl>
#> 1 1 1.88 NA
#> 2 2 0.59 0.92
#> 3 3 0.35 0.17
#> 4 4 NA 2.66
现在,再使用 pivot_longer()
不能得到原来的数据框,因为将比原来多出一行显示的缺失值
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(-qtr, names_to = "year", values_to = "return")
#> pivot_wider: reorganized (year, return) into (2015, 2016) [was 7x3, now 4x3]
#> pivot_longer: reorganized (2015, 2016) into (year, return) [was 4x3, now 8x3]
#> # A tibble: 8 x 3
#> qtr year return
#> <dbl> <chr> <dbl>
#> 1 1 2015 1.88
#> 2 1 2016 NA
#> 3 2 2015 0.59
#> 4 2 2016 0.92
#> 5 3 2015 0.35
#> 6 3 2016 0.17
#> # ... with 2 more rows
如果研究者认为这些缺失值是无足轻重的,values_drop_na = TRUE
将在 pivot_longer()
生成的数据框中移除含有缺失值的行,这会同时移除显式和隐式缺失值:
## 现在输出数据框比原来少一行
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(-qtr, names_to = "year", values_to = "return",
values_drop_na = TRUE)
#> pivot_wider: reorganized (year, return) into (2015, 2016) [was 7x3, now 4x3]
#> pivot_longer: reorganized (2015, 2016) into (year, return) [was 4x3, now 6x3]
#> # A tibble: 6 x 3
#> qtr year return
#> <dbl> <chr> <dbl>
#> 1 1 2015 1.88
#> 2 2 2015 0.59
#> 3 2 2016 0.92
#> 4 3 2015 0.35
#> 5 3 2016 0.17
#> 6 4 2016 2.66
fill()
专门用来填充缺失值,它接受一些需要填充缺失值的列,并用最近的值调换 NA
,.direction
参数控制用填充的方向:direction = “up"
将由下往上填充,NA
将被替换为它下面那一列的值;direction = "donw"
反之
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment %>%
fill(person, .direction = "up")
#> fill: changed 2 values (50%) of 'person' (2 fewer NA)
#> # A tibble: 4 x 3
#> person treatment response
#> <chr> <dbl> <dbl>
#> 1 Derrick Whitmore 1 7
#> 2 Katherine Burke 2 10
#> 3 Katherine Burke 3 9
#> 4 Katherine Burke 1 4
treatment %>%
fill(person, .direction = "down")
#> fill: changed 2 values (50%) of 'person' (2 fewer NA)
#> # A tibble: 4 x 3
#> person treatment response
#> <chr> <dbl> <dbl>
#> 1 Derrick Whitmore 1 7
#> 2 Derrick Whitmore 2 10
#> 3 Derrick Whitmore 3 9
#> 4 Katherine Burke 1 4
More useful methods dealing with missing values (in tidyr
and other packages) are discussed in 17