第 94 章 Pandas vs. dplyr谁是数据分析中最好用的宏包

本章,我们用一些例子来比较数据科学领域Pandas vs. dplyr 两个宏包的用法

首先,需要加载这个宏包

library(reticulate)
use_python("E:/Anaconda3/python.exe") 
#use_condaenv("Anaconda3", required = TRUE)
#py_config()

94.1 加载数据

这是用 Pandas 方法,

import pandas as pd

gapminder = pd.read_csv("./demo_data/gapminder.csv")

这是用 dplyr 方法

library(dplyr)

gapminder <- readr::read_csv("./demo_data/gapminder.csv")

这个过程,两者没什么区别。再往下看

94.2 过滤

94.2.1 问题1 – 找出2007年的所用记录.

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007]
##                  country continent  year  lifeExp       pop     gdpPercap
## 11           Afghanistan      Asia  2007   43.828  31889923    974.580338
## 23               Albania    Europe  2007   76.423   3600523   5937.029526
## 35               Algeria    Africa  2007   72.301  33333216   6223.367465
## 47                Angola    Africa  2007   42.731  12420476   4797.231267
## 59             Argentina  Americas  2007   75.320  40301927  12779.379640
## ...                  ...       ...   ...      ...       ...           ...
## 1655             Vietnam      Asia  2007   74.249  85262356   2441.576404
## 1667  West Bank and Gaza      Asia  2007   73.422   4018332   3025.349798
## 1679         Yemen, Rep.      Asia  2007   62.698  22211743   2280.769906
## 1691              Zambia    Africa  2007   42.384  11746035   1271.211593
## 1703            Zimbabwe    Africa  2007   43.487  12311143    469.709298
## 
## [142 rows x 6 columns]

这是用 dplyr 方法,

gapminder %>%
  filter(year == 2007)
## # A tibble: 142 × 6
##   country     continent  year lifeExp     pop gdpPercap
##   <chr>       <chr>     <dbl>   <dbl>   <dbl>     <dbl>
## 1 Afghanistan Asia       2007    43.8  3.19e7      975.
## 2 Albania     Europe     2007    76.4  3.60e6     5937.
## 3 Algeria     Africa     2007    72.3  3.33e7     6223.
## 4 Angola      Africa     2007    42.7  1.24e7     4797.
## 5 Argentina   Americas   2007    75.3  4.03e7    12779.
## 6 Australia   Oceania    2007    81.2  2.04e7    34435.
## # … with 136 more rows

94.2.2 问题2 – 找出2007年中 continent为 Americas 的记录

这是用 Pandas 方法,

gapminder[(gapminder['year'] == 2007) & (gapminder['continent'] == 'Americas')]
##                   country continent  year  lifeExp        pop     gdpPercap
## 59              Argentina  Americas  2007   75.320   40301927  12779.379640
## 143               Bolivia  Americas  2007   65.554    9119152   3822.137084
## 179                Brazil  Americas  2007   72.390  190010647   9065.800825
## 251                Canada  Americas  2007   80.653   33390141  36319.235010
## 287                 Chile  Americas  2007   78.553   16284741  13171.638850
## 311              Colombia  Americas  2007   72.889   44227550   7006.580419
## 359            Costa Rica  Americas  2007   78.782    4133884   9645.061420
## 395                  Cuba  Americas  2007   78.273   11416987   8948.102923
## 443    Dominican Republic  Americas  2007   72.235    9319622   6025.374752
## 455               Ecuador  Americas  2007   74.994   13755680   6873.262326
## 479           El Salvador  Americas  2007   71.878    6939688   5728.353514
## 611             Guatemala  Americas  2007   70.259   12572928   5186.050003
## 647                 Haiti  Americas  2007   60.916    8502814   1201.637154
## 659              Honduras  Americas  2007   70.198    7483763   3548.330846
## 791               Jamaica  Americas  2007   72.567    2780132   7320.880262
## 995                Mexico  Americas  2007   76.195  108700891  11977.574960
## 1115            Nicaragua  Americas  2007   72.899    5675356   2749.320965
## 1187               Panama  Americas  2007   75.537    3242173   9809.185636
## 1199             Paraguay  Americas  2007   71.752    6667147   4172.838464
## 1211                 Peru  Americas  2007   71.421   28674757   7408.905561
## 1259          Puerto Rico  Americas  2007   78.746    3942491  19328.709010
## 1559  Trinidad and Tobago  Americas  2007   69.819    1056608  18008.509240
## 1619        United States  Americas  2007   78.242  301139947  42951.653090
## 1631              Uruguay  Americas  2007   76.384    3447496  10611.462990
## 1643            Venezuela  Americas  2007   73.747   26084662  11415.805690

这是用 dplyr 方法

gapminder %>%
  filter(
    year == 2007,
    continent == "Americas"
  )
## # A tibble: 25 × 6
##   country   continent  year lifeExp       pop gdpPercap
##   <chr>     <chr>     <dbl>   <dbl>     <dbl>     <dbl>
## 1 Argentina Americas   2007    75.3  40301927    12779.
## 2 Bolivia   Americas   2007    65.6   9119152     3822.
## 3 Brazil    Americas   2007    72.4 190010647     9066.
## 4 Canada    Americas   2007    80.7  33390141    36319.
## 5 Chile     Americas   2007    78.6  16284741    13172.
## 6 Colombia  Americas   2007    72.9  44227550     7007.
## # … with 19 more rows

94.2.3 问题3 – 找出2007年中 continent为 Americas,且只包含美国 的记录

这是用 Pandas 方法,

gapminder[(gapminder['year'] == 2007) & 
          (gapminder['continent'] == 'Americas') &
          (gapminder['country'] == 'United States')]
##             country continent  year  lifeExp        pop    gdpPercap
## 1619  United States  Americas  2007   78.242  301139947  42951.65309

这是用 dplyr 方法

gapminder %>%
  filter(
    year == 2007,
    continent == "Americas",
    country == "United States"
  )
## # A tibble: 1 × 6
##   country      continent  year lifeExp    pop gdpPercap
##   <chr>        <chr>     <dbl>   <dbl>  <dbl>     <dbl>
## 1 United Stat… Americas   2007    78.2 3.01e8    42952.

94.3 统计

我们再看看Pandas 和 dplyr 在常用的统计分析方面的表现。

94.3.1 问题1 – 计算2007年全球寿命均值

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007]['lifeExp'].mean()
## 67.00742253521126

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  summarize(mean(lifeExp))
## # A tibble: 1 × 1
##   `mean(lifeExp)`
##             <dbl>
## 1            67.0

94.3.2 问题2 – 计算2007年每一个洲的寿命均值

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007].groupby(by='continent').mean()['lifeExp']
## continent
## Africa      54.806038
## Americas    73.608120
## Asia        70.728485
## Europe      77.648600
## Oceania     80.719500
## Name: lifeExp, dtype: float64

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(mean(lifeExp))
## # A tibble: 5 × 2
##   continent `mean(lifeExp)`
##   <chr>               <dbl>
## 1 Africa               54.8
## 2 Americas             73.6
## 3 Asia                 70.7
## 4 Europe               77.6
## 5 Oceania              80.7

94.3.3 问题3 – 计算2007年每一个洲的人口总数然后降序输出。

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007].groupby(by='continent').sum()['pop'].sort_values(ascending=False)
## continent
## Asia        3811953827
## Africa       929539692
## Americas     898871184
## Europe       586098529
## Oceania       24549947
## Name: pop, dtype: int64

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(total_pop = sum(pop)) %>%
  arrange(desc(total_pop))
## # A tibble: 5 × 2
##   continent  total_pop
##   <chr>          <dbl>
## 1 Asia      3811953827
## 2 Africa     929539692
## 3 Americas   898871184
## 4 Europe     586098529
## 5 Oceania     24549947

94.4 创建新变量

94.4.1 问题1 – 创建国内生产总值(GDP)变量,即人口数量乘以人均GDP

这是用 Pandas 方法,

gapminder['GDP'] = gapminder['pop'] * gapminder['gdpPercap']
gapminder.head()
##        country continent  year  lifeExp       pop   gdpPercap           GDP
## 0  Afghanistan      Asia  1952   28.801   8425333  779.445314  6.567086e+09
## 1  Afghanistan      Asia  1957   30.332   9240934  820.853030  7.585449e+09
## 2  Afghanistan      Asia  1962   31.997  10267083  853.100710  8.758856e+09
## 3  Afghanistan      Asia  1967   34.020  11537966  836.197138  9.648014e+09
## 4  Afghanistan      Asia  1972   36.088  13079460  739.981106  9.678553e+09

这是用 dplyr 方法

gapminder %>%
  mutate(GDP = pop * gdpPercap)
## # A tibble: 1,704 × 7
##   country     continent  year lifeExp     pop gdpPercap
##   <chr>       <chr>     <dbl>   <dbl>   <dbl>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8.43e6      779.
## 2 Afghanistan Asia       1957    30.3  9.24e6      821.
## 3 Afghanistan Asia       1962    32.0  1.03e7      853.
## 4 Afghanistan Asia       1967    34.0  1.15e7      836.
## 5 Afghanistan Asia       1972    36.1  1.31e7      740.
## 6 Afghanistan Asia       1977    38.4  1.49e7      786.
## # … with 1,698 more rows, and 1 more variable:
## #   GDP <dbl>

94.4.2 问题2 – 人均GDP排名前90%的前十个国家

这是用 Pandas 方法,

gapminder_2007 = gapminder[gapminder['year'] == 2007]
gapminder_2007['percentile'] = gapminder_2007['gdpPercap'].rank(pct=True)
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
gapminder_2007.sort_values(by='percentile', ascending=False)[:10]
##                country continent  year  ...    gdpPercap           GDP  percentile
## 1151            Norway    Europe  2007  ...  49357.19017  2.284214e+11    1.000000
## 863             Kuwait      Asia  2007  ...  47306.98978  1.185305e+11    0.992958
## 1367         Singapore      Asia  2007  ...  47143.17964  2.146433e+11    0.985915
## 1619     United States  Americas  2007  ...  42951.65309  1.293446e+13    0.978873
## 755            Ireland    Europe  2007  ...  40675.99635  1.671412e+11    0.971831
## 671   Hong Kong, China      Asia  2007  ...  39724.97867  2.772967e+11    0.964789
## 1487       Switzerland    Europe  2007  ...  37506.41907  2.833483e+11    0.957746
## 1091       Netherlands    Europe  2007  ...  36797.93332  6.097643e+11    0.950704
## 251             Canada  Americas  2007  ...  36319.23501  1.212704e+12    0.943662
## 695            Iceland    Europe  2007  ...  36180.78919  1.092410e+10    0.936620
## 
## [10 rows x 8 columns]

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  mutate(percentile = ntile(gdpPercap, 100)) %>%
  arrange(desc(percentile)) %>%
  top_n(10, wt = percentile)
## # A tibble: 10 × 7
##   country      continent  year lifeExp    pop gdpPercap
##   <chr>        <chr>     <dbl>   <dbl>  <dbl>     <dbl>
## 1 Norway       Europe     2007    80.2 4.63e6    49357.
## 2 Kuwait       Asia       2007    77.6 2.51e6    47307.
## 3 Singapore    Asia       2007    80.0 4.55e6    47143.
## 4 United Stat… Americas   2007    78.2 3.01e8    42952.
## 5 Ireland      Europe     2007    78.9 4.11e6    40676.
## 6 Hong Kong, … Asia       2007    82.2 6.98e6    39725.
## # … with 4 more rows, and 1 more variable:
## #   percentile <int>

94.5 小节

根据以上有限的对比,感觉dplyr语法要简练和清晰些。但是,这并不意味我们一定在Pandas 和 dplyr做取舍,因为解决现实问题,往往取决于我们对工具的熟练程度。以上只是个人观点(哈哈哈)。