第 96 章 Pandas vs. dplyr谁是数据分析中最好用的宏包
本章,我们用一些例子来比较数据科学领域Pandas vs. dplyr 两个宏包的用法
首先,需要加载这个宏包
library(reticulate)
use_python("F:/Anaconda3/python.exe")
#use_condaenv("Anaconda3", required = TRUE)
#py_config()
96.2 过滤
96.2.1 问题1 – 找出2007年的所用记录.
这是用 Pandas 方法,
## country continent year lifeExp pop gdpPercap
## 11 Afghanistan Asia 2007 43.828 31889923 974.580338
## 23 Albania Europe 2007 76.423 3600523 5937.029526
## 35 Algeria Africa 2007 72.301 33333216 6223.367465
## 47 Angola Africa 2007 42.731 12420476 4797.231267
## 59 Argentina Americas 2007 75.320 40301927 12779.379640
## ... ... ... ... ... ... ...
## 1655 Vietnam Asia 2007 74.249 85262356 2441.576404
## 1667 West Bank and Gaza Asia 2007 73.422 4018332 3025.349798
## 1679 Yemen, Rep. Asia 2007 62.698 22211743 2280.769906
## 1691 Zambia Africa 2007 42.384 11746035 1271.211593
## 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
##
## [142 rows x 6 columns]
这是用 dplyr 方法,
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ℹ 132 more rows
96.2.2 问题2 – 找出2007年中 continent为 Americas 的记录
这是用 Pandas 方法,
## country continent year lifeExp pop gdpPercap
## 59 Argentina Americas 2007 75.320 40301927 12779.379640
## 143 Bolivia Americas 2007 65.554 9119152 3822.137084
## 179 Brazil Americas 2007 72.390 190010647 9065.800825
## 251 Canada Americas 2007 80.653 33390141 36319.235010
## 287 Chile Americas 2007 78.553 16284741 13171.638850
## 311 Colombia Americas 2007 72.889 44227550 7006.580419
## 359 Costa Rica Americas 2007 78.782 4133884 9645.061420
## 395 Cuba Americas 2007 78.273 11416987 8948.102923
## 443 Dominican Republic Americas 2007 72.235 9319622 6025.374752
## 455 Ecuador Americas 2007 74.994 13755680 6873.262326
## 479 El Salvador Americas 2007 71.878 6939688 5728.353514
## 611 Guatemala Americas 2007 70.259 12572928 5186.050003
## 647 Haiti Americas 2007 60.916 8502814 1201.637154
## 659 Honduras Americas 2007 70.198 7483763 3548.330846
## 791 Jamaica Americas 2007 72.567 2780132 7320.880262
## 995 Mexico Americas 2007 76.195 108700891 11977.574960
## 1115 Nicaragua Americas 2007 72.899 5675356 2749.320965
## 1187 Panama Americas 2007 75.537 3242173 9809.185636
## 1199 Paraguay Americas 2007 71.752 6667147 4172.838464
## 1211 Peru Americas 2007 71.421 28674757 7408.905561
## 1259 Puerto Rico Americas 2007 78.746 3942491 19328.709010
## 1559 Trinidad and Tobago Americas 2007 69.819 1056608 18008.509240
## 1619 United States Americas 2007 78.242 301139947 42951.653090
## 1631 Uruguay Americas 2007 76.384 3447496 10611.462990
## 1643 Venezuela Americas 2007 73.747 26084662 11415.805690
这是用 dplyr 方法
## # A tibble: 25 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Argentina Americas 2007 75.3 40301927 12779.
## 2 Bolivia Americas 2007 65.6 9119152 3822.
## 3 Brazil Americas 2007 72.4 190010647 9066.
## 4 Canada Americas 2007 80.7 33390141 36319.
## 5 Chile Americas 2007 78.6 16284741 13172.
## 6 Colombia Americas 2007 72.9 44227550 7007.
## 7 Costa Rica Americas 2007 78.8 4133884 9645.
## 8 Cuba Americas 2007 78.3 11416987 8948.
## 9 Dominican Republic Americas 2007 72.2 9319622 6025.
## 10 Ecuador Americas 2007 75.0 13755680 6873.
## # ℹ 15 more rows
96.2.3 问题3 – 找出2007年中 continent为 Americas,且只包含美国 的记录
这是用 Pandas 方法,
gapminder[(gapminder['year'] == 2007) &
(gapminder['continent'] == 'Americas') &
(gapminder['country'] == 'United States')]
## country continent year lifeExp pop gdpPercap
## 1619 United States Americas 2007 78.242 301139947 42951.65309
这是用 dplyr 方法
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
96.3 统计
我们再看看Pandas 和 dplyr 在常用的统计分析方面的表现。
96.3.1 问题1 – 计算2007年全球寿命均值
这是用 Pandas 方法,
## 67.00742253521126
这是用 dplyr 方法
## # A tibble: 1 × 1
## `mean(lifeExp)`
## <dbl>
## 1 67.0
96.3.2 问题2 – 计算2007年每一个洲的寿命均值
这是用 Pandas 方法,
这是用 dplyr 方法
## # A tibble: 5 × 2
## continent `mean(lifeExp)`
## <chr> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
96.3.3 问题3 – 计算2007年每一个洲的人口总数然后降序输出。
这是用 Pandas 方法,
gapminder[gapminder['year'] == 2007].groupby(by='continent').sum()['pop'].sort_values(ascending=False)
## continent
## Asia 3811953827
## Africa 929539692
## Americas 898871184
## Europe 586098529
## Oceania 24549947
## Name: pop, dtype: int64
这是用 dplyr 方法
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(total_pop = sum(pop)) %>%
arrange(desc(total_pop))
## # A tibble: 5 × 2
## continent total_pop
## <chr> <dbl>
## 1 Asia 3811953827
## 2 Africa 929539692
## 3 Americas 898871184
## 4 Europe 586098529
## 5 Oceania 24549947
96.4 创建新变量
96.4.1 问题1 – 创建国内生产总值(GDP)变量,即人口数量乘以人均GDP
这是用 Pandas 方法,
## country continent year lifeExp pop gdpPercap GDP
## 0 Afghanistan Asia 1952 28.801 8425333 779.445314 6.567086e+09
## 1 Afghanistan Asia 1957 30.332 9240934 820.853030 7.585449e+09
## 2 Afghanistan Asia 1962 31.997 10267083 853.100710 8.758856e+09
## 3 Afghanistan Asia 1967 34.020 11537966 836.197138 9.648014e+09
## 4 Afghanistan Asia 1972 36.088 13079460 739.981106 9.678553e+09
这是用 dplyr 方法
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap GDP
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ℹ 1,694 more rows
96.4.2 问题2 – 人均GDP排名前90%的前十个国家
这是用 Pandas 方法,
gapminder_2007 = gapminder[gapminder['year'] == 2007]
gapminder_2007['percentile'] = gapminder_2007['gdpPercap'].rank(pct=True)
gapminder_2007.sort_values(by='percentile', ascending=False)[:10]
## country continent year ... gdpPercap GDP percentile
## 1151 Norway Europe 2007 ... 49357.19017 2.284214e+11 1.000000
## 863 Kuwait Asia 2007 ... 47306.98978 1.185305e+11 0.992958
## 1367 Singapore Asia 2007 ... 47143.17964 2.146433e+11 0.985915
## 1619 United States Americas 2007 ... 42951.65309 1.293446e+13 0.978873
## 755 Ireland Europe 2007 ... 40675.99635 1.671412e+11 0.971831
## 671 Hong Kong, China Asia 2007 ... 39724.97867 2.772967e+11 0.964789
## 1487 Switzerland Europe 2007 ... 37506.41907 2.833483e+11 0.957746
## 1091 Netherlands Europe 2007 ... 36797.93332 6.097643e+11 0.950704
## 251 Canada Americas 2007 ... 36319.23501 1.212704e+12 0.943662
## 695 Iceland Europe 2007 ... 36180.78919 1.092410e+10 0.936620
##
## [10 rows x 8 columns]
这是用 dplyr 方法
gapminder %>%
filter(year == 2007) %>%
mutate(percentile = ntile(gdpPercap, 100)) %>%
arrange(desc(percentile)) %>%
top_n(10, wt = percentile)
## # A tibble: 10 × 7
## country continent year lifeExp pop gdpPercap percentile
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Norway Europe 2007 80.2 4627926 49357. 100
## 2 Kuwait Asia 2007 77.6 2505559 47307. 99
## 3 Singapore Asia 2007 80.0 4553009 47143. 98
## 4 United States Americas 2007 78.2 301139947 42952. 97
## 5 Ireland Europe 2007 78.9 4109086 40676. 96
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725. 95
## 7 Switzerland Europe 2007 81.7 7554661 37506. 94
## 8 Netherlands Europe 2007 79.8 16570613 36798. 93
## 9 Canada Americas 2007 80.7 33390141 36319. 92
## 10 Iceland Europe 2007 81.8 301931 36181. 91