第 94 章 Pandas vs. dplyr谁是数据分析中最好用的宏包
本章,我们用一些例子来比较数据科学领域Pandas vs. dplyr 两个宏包的用法
首先,需要加载这个宏包
library(reticulate)
use_python("E:/Anaconda3/python.exe")
#use_condaenv("Anaconda3", required = TRUE)
#py_config()
94.1 加载数据
这是用 Pandas 方法,
import pandas as pd
= pd.read_csv("./demo_data/gapminder.csv") gapminder
这是用 dplyr 方法
这个过程,两者没什么区别。再往下看
94.2 过滤
94.2.1 问题1 – 找出2007年的所用记录.
这是用 Pandas 方法,
'year'] == 2007] gapminder[gapminder[
## country continent year lifeExp pop gdpPercap
## 11 Afghanistan Asia 2007 43.828 31889923 974.580338
## 23 Albania Europe 2007 76.423 3600523 5937.029526
## 35 Algeria Africa 2007 72.301 33333216 6223.367465
## 47 Angola Africa 2007 42.731 12420476 4797.231267
## 59 Argentina Americas 2007 75.320 40301927 12779.379640
## ... ... ... ... ... ... ...
## 1655 Vietnam Asia 2007 74.249 85262356 2441.576404
## 1667 West Bank and Gaza Asia 2007 73.422 4018332 3025.349798
## 1679 Yemen, Rep. Asia 2007 62.698 22211743 2280.769906
## 1691 Zambia Africa 2007 42.384 11746035 1271.211593
## 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
##
## [142 rows x 6 columns]
这是用 dplyr 方法,
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 2007 43.8 3.19e7 975.
## 2 Albania Europe 2007 76.4 3.60e6 5937.
## 3 Algeria Africa 2007 72.3 3.33e7 6223.
## 4 Angola Africa 2007 42.7 1.24e7 4797.
## 5 Argentina Americas 2007 75.3 4.03e7 12779.
## 6 Australia Oceania 2007 81.2 2.04e7 34435.
## # … with 136 more rows
94.2.2 问题2 – 找出2007年中 continent为 Americas 的记录
这是用 Pandas 方法,
'year'] == 2007) & (gapminder['continent'] == 'Americas')] gapminder[(gapminder[
## country continent year lifeExp pop gdpPercap
## 59 Argentina Americas 2007 75.320 40301927 12779.379640
## 143 Bolivia Americas 2007 65.554 9119152 3822.137084
## 179 Brazil Americas 2007 72.390 190010647 9065.800825
## 251 Canada Americas 2007 80.653 33390141 36319.235010
## 287 Chile Americas 2007 78.553 16284741 13171.638850
## 311 Colombia Americas 2007 72.889 44227550 7006.580419
## 359 Costa Rica Americas 2007 78.782 4133884 9645.061420
## 395 Cuba Americas 2007 78.273 11416987 8948.102923
## 443 Dominican Republic Americas 2007 72.235 9319622 6025.374752
## 455 Ecuador Americas 2007 74.994 13755680 6873.262326
## 479 El Salvador Americas 2007 71.878 6939688 5728.353514
## 611 Guatemala Americas 2007 70.259 12572928 5186.050003
## 647 Haiti Americas 2007 60.916 8502814 1201.637154
## 659 Honduras Americas 2007 70.198 7483763 3548.330846
## 791 Jamaica Americas 2007 72.567 2780132 7320.880262
## 995 Mexico Americas 2007 76.195 108700891 11977.574960
## 1115 Nicaragua Americas 2007 72.899 5675356 2749.320965
## 1187 Panama Americas 2007 75.537 3242173 9809.185636
## 1199 Paraguay Americas 2007 71.752 6667147 4172.838464
## 1211 Peru Americas 2007 71.421 28674757 7408.905561
## 1259 Puerto Rico Americas 2007 78.746 3942491 19328.709010
## 1559 Trinidad and Tobago Americas 2007 69.819 1056608 18008.509240
## 1619 United States Americas 2007 78.242 301139947 42951.653090
## 1631 Uruguay Americas 2007 76.384 3447496 10611.462990
## 1643 Venezuela Americas 2007 73.747 26084662 11415.805690
这是用 dplyr 方法
## # A tibble: 25 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Argentina Americas 2007 75.3 40301927 12779.
## 2 Bolivia Americas 2007 65.6 9119152 3822.
## 3 Brazil Americas 2007 72.4 190010647 9066.
## 4 Canada Americas 2007 80.7 33390141 36319.
## 5 Chile Americas 2007 78.6 16284741 13172.
## 6 Colombia Americas 2007 72.9 44227550 7007.
## # … with 19 more rows
94.2.3 问题3 – 找出2007年中 continent为 Americas,且只包含美国 的记录
这是用 Pandas 方法,
'year'] == 2007) &
gapminder[(gapminder['continent'] == 'Americas') &
(gapminder['country'] == 'United States')] (gapminder[
## country continent year lifeExp pop gdpPercap
## 1619 United States Americas 2007 78.242 301139947 42951.65309
这是用 dplyr 方法
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United Stat… Americas 2007 78.2 3.01e8 42952.
94.3 统计
我们再看看Pandas 和 dplyr 在常用的统计分析方面的表现。
94.3.1 问题1 – 计算2007年全球寿命均值
这是用 Pandas 方法,
'year'] == 2007]['lifeExp'].mean() gapminder[gapminder[
## 67.00742253521126
这是用 dplyr 方法
## # A tibble: 1 × 1
## `mean(lifeExp)`
## <dbl>
## 1 67.0
94.3.2 问题2 – 计算2007年每一个洲的寿命均值
这是用 Pandas 方法,
'year'] == 2007].groupby(by='continent').mean()['lifeExp'] gapminder[gapminder[
## continent
## Africa 54.806038
## Americas 73.608120
## Asia 70.728485
## Europe 77.648600
## Oceania 80.719500
## Name: lifeExp, dtype: float64
这是用 dplyr 方法
## # A tibble: 5 × 2
## continent `mean(lifeExp)`
## <chr> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
94.3.3 问题3 – 计算2007年每一个洲的人口总数然后降序输出。
这是用 Pandas 方法,
'year'] == 2007].groupby(by='continent').sum()['pop'].sort_values(ascending=False) gapminder[gapminder[
## continent
## Asia 3811953827
## Africa 929539692
## Americas 898871184
## Europe 586098529
## Oceania 24549947
## Name: pop, dtype: int64
这是用 dplyr 方法
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(total_pop = sum(pop)) %>%
arrange(desc(total_pop))
## # A tibble: 5 × 2
## continent total_pop
## <chr> <dbl>
## 1 Asia 3811953827
## 2 Africa 929539692
## 3 Americas 898871184
## 4 Europe 586098529
## 5 Oceania 24549947
94.4 创建新变量
94.4.1 问题1 – 创建国内生产总值(GDP)变量,即人口数量乘以人均GDP
这是用 Pandas 方法,
'GDP'] = gapminder['pop'] * gapminder['gdpPercap']
gapminder[ gapminder.head()
## country continent year lifeExp pop gdpPercap GDP
## 0 Afghanistan Asia 1952 28.801 8425333 779.445314 6.567086e+09
## 1 Afghanistan Asia 1957 30.332 9240934 820.853030 7.585449e+09
## 2 Afghanistan Asia 1962 31.997 10267083 853.100710 8.758856e+09
## 3 Afghanistan Asia 1967 34.020 11537966 836.197138 9.648014e+09
## 4 Afghanistan Asia 1972 36.088 13079460 739.981106 9.678553e+09
这是用 dplyr 方法
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43e6 779.
## 2 Afghanistan Asia 1957 30.3 9.24e6 821.
## 3 Afghanistan Asia 1962 32.0 1.03e7 853.
## 4 Afghanistan Asia 1967 34.0 1.15e7 836.
## 5 Afghanistan Asia 1972 36.1 1.31e7 740.
## 6 Afghanistan Asia 1977 38.4 1.49e7 786.
## # … with 1,698 more rows, and 1 more variable:
## # GDP <dbl>
94.4.2 问题2 – 人均GDP排名前90%的前十个国家
这是用 Pandas 方法,
= gapminder[gapminder['year'] == 2007]
gapminder_2007 'percentile'] = gapminder_2007['gdpPercap'].rank(pct=True) gapminder_2007[
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
='percentile', ascending=False)[:10] gapminder_2007.sort_values(by
## country continent year ... gdpPercap GDP percentile
## 1151 Norway Europe 2007 ... 49357.19017 2.284214e+11 1.000000
## 863 Kuwait Asia 2007 ... 47306.98978 1.185305e+11 0.992958
## 1367 Singapore Asia 2007 ... 47143.17964 2.146433e+11 0.985915
## 1619 United States Americas 2007 ... 42951.65309 1.293446e+13 0.978873
## 755 Ireland Europe 2007 ... 40675.99635 1.671412e+11 0.971831
## 671 Hong Kong, China Asia 2007 ... 39724.97867 2.772967e+11 0.964789
## 1487 Switzerland Europe 2007 ... 37506.41907 2.833483e+11 0.957746
## 1091 Netherlands Europe 2007 ... 36797.93332 6.097643e+11 0.950704
## 251 Canada Americas 2007 ... 36319.23501 1.212704e+12 0.943662
## 695 Iceland Europe 2007 ... 36180.78919 1.092410e+10 0.936620
##
## [10 rows x 8 columns]
这是用 dplyr 方法
gapminder %>%
filter(year == 2007) %>%
mutate(percentile = ntile(gdpPercap, 100)) %>%
arrange(desc(percentile)) %>%
top_n(10, wt = percentile)
## # A tibble: 10 × 7
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Norway Europe 2007 80.2 4.63e6 49357.
## 2 Kuwait Asia 2007 77.6 2.51e6 47307.
## 3 Singapore Asia 2007 80.0 4.55e6 47143.
## 4 United Stat… Americas 2007 78.2 3.01e8 42952.
## 5 Ireland Europe 2007 78.9 4.11e6 40676.
## 6 Hong Kong, … Asia 2007 82.2 6.98e6 39725.
## # … with 4 more rows, and 1 more variable:
## # percentile <int>