第 36 章 探索性数据分析-诺奖获得者

探索性数据分析(exporatory data analysis)是各种知识的综合运用。本章通过一个案例,讲解探索性数据分析的基本思路,也算是对前面几章内容的一次总结复习。

36.1 探索性

  • 数据准备(对数据要做到心中有数)

    • 描述变量
    • 数据结构
    • 缺失值及其处理
  • 数据探索(围绕探索的目标)

    • 数据规整
    • 可视化
    • 建模

36.2 数据集

这是一个诺贝尔奖获得者的数据集,

36.3 导入数据

library(tidyverse)
library(lubridate)
df <- read_csv("./demo_data/nobel_winners.csv")
df
## # A tibble: 969 x 18
##    prize_year category prize motivation prize_share
##         <dbl> <chr>    <chr> <chr>      <chr>      
##  1       1901 Chemist~ The ~ "\"in rec~ 1/1        
##  2       1901 Literat~ The ~ "\"in spe~ 1/1        
##  3       1901 Medicine The ~ "\"for hi~ 1/1        
##  4       1901 Peace    The ~  <NA>      1/2        
##  5       1901 Peace    The ~  <NA>      1/2        
##  6       1901 Physics  The ~ "\"in rec~ 1/1        
##  7       1902 Chemist~ The ~ "\"in rec~ 1/1        
##  8       1902 Literat~ The ~ "\"the gr~ 1/1        
##  9       1902 Medicine The ~ "\"for hi~ 1/1        
## 10       1902 Peace    The ~  <NA>      1/2        
## # ... with 959 more rows, and 13 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>
# 如果是xlsx格式
readxl::read_excel("myfile.xlsx")

# 如果是csv格式
readr::read_csv("myfile.csv")

这里有个小小的提示:

  • 路径(包括文件名), 不要用中文和空格
  • 数据框中变量,也不要有中文和空格(可用下划线代替空格)

36.4 数据结构

一行就是一个诺奖获得者的记录? 确定?

缺失值及其处理

df %>% map_df(~ sum(is.na(.)))
## # A tibble: 1 x 18
##   prize_year category prize motivation prize_share
##        <int>    <int> <int>      <int>       <int>
## 1          0        0     0         88           0
## # ... with 13 more variables: laureate_id <int>,
## #   laureate_type <int>, full_name <int>,
## #   birth_date <int>, birth_city <int>,
## #   birth_country <int>, gender <int>,
## #   organization_name <int>, organization_city <int>,
## #   organization_country <int>, death_date <int>,
## #   death_city <int>, death_country <int>

性别缺失怎么造成的?

df %>% count(laureate_type)
## # A tibble: 2 x 2
##   laureate_type     n
##   <chr>         <int>
## 1 Individual      939
## 2 Organization     30

36.5 我们想探索哪些问题?

你想关心哪些问题,可能是

  • 每个学科颁过多少次奖?
  • 这些大神都是哪个年代的人?
  • 性别比例
  • 平均年龄和获奖数量
  • 最年轻的诺奖获得者是谁?
  • 中国诺奖获得者有哪些?
  • 得奖的时候多大年龄?
  • 获奖者所在国家的经济情况?
  • 有大神多次获得诺贝尔奖,而且在不同科学领域获奖?
  • 出生地分布?工作地分布?迁移模式?
  • GDP经济与诺奖模型?
  • 诺奖分享情况?

36.6 每个学科颁过多少次奖

df %>% count(category)
## # A tibble: 6 x 2
##   category       n
##   <chr>      <int>
## 1 Chemistry    194
## 2 Economics     83
## 3 Literature   113
## 4 Medicine     227
## 5 Peace        130
## 6 Physics      222
df %>%
  count(category) %>%
  ggplot(aes(x = category, y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  labs(title = "不同学科诺贝奖获奖次数对比", x = "学科", y = "数量") +
  theme(legend.position = "none")

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  labs(title = "不同学科诺贝奖获奖次数对比", x = "学科", y = "数量") +
  theme(legend.position = "none")

也可以使用别人定义好的配色方案

library(ggthemr) # install.packages("devtools")
# devtools::install_github('cttobin/ggthemr')
ggthemr("dust")

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  labs(title = "不同学科诺贝奖获奖次数对比", x = "学科", y = "数量") +
  theme(legend.position = "none")

这个配色方案感觉挺好看的呢,比较适合我这种又挑剔又懒惰的人。

当然,也可以自己DIY,或者使用配色网站的主题方案(https://learnui.design/tools/data-color-picker.html#palette)

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n)) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "不同学科诺贝奖获奖次数对比", x = "学科", y = "数量") +
  theme(legend.position = "none")

让图骚动起来吧

library(gganimate) # install.packages("gganimate", dependencies = T)

df %>%
  count(category) %>%
  mutate(category = fct_reorder(category, n)) %>%
  ggplot(aes(x = category, y = n)) +
  geom_text(aes(label = n), vjust = -0.25) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "不同学科诺贝奖获奖次数对比", x = "学科", y = "数量") +
  theme(legend.position = "none") +
  transition_states(category) +
  shadow_mark(past = TRUE)

和ggplot2的分面一样,动态图可以增加数据展示的维度。

36.7 看看我们伟大的祖国

df %>%
  dplyr::filter(birth_country == "China") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 12 x 3
##    full_name              prize_year category  
##    <chr>                       <dbl> <chr>     
##  1 Walter Houser Brattain       1956 Physics   
##  2 Chen Ning Yang               1957 Physics   
##  3 Tsung-Dao (T.D.) Lee         1957 Physics   
##  4 Edmond H. Fischer            1992 Medicine  
##  5 Daniel C. Tsui               1998 Physics   
##  6 Gao Xingjian                 2000 Literature
##  7 Charles Kuen Kao             2009 Physics   
##  8 Charles Kuen Kao             2009 Physics   
##  9 Ei-ichi Negishi              2010 Chemistry 
## 10 Liu Xiaobo                   2010 Peace     
## 11 Mo Yan                       2012 Literature
## 12 Youyou Tu                    2015 Medicine

我们发现获奖者有多个地址,就会有重复的情况,比如 Charles Kuen Kao在2009年Physics有两次,为什么重复计数了呢?

下面我们去重吧, 去重可以用distinct()函数

dt <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, "a",
  1, 1, "b",
  1, 2, "c",
  1, 2, "d"
)

dt
## # A tibble: 4 x 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 c    
## 4     1     2 d
dt %>% distinct_at(vars(x), .keep_all = T)
## # A tibble: 1 x 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a
dt %>% distinct_at(vars(x, y), .keep_all = T)
## # A tibble: 2 x 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     2 c
nobel_winners <- df %>%
  mutate_if(is.character, tolower) %>%
  distinct_at(vars(full_name, prize_year, category), .keep_all = TRUE) %>%
  mutate(
    decade = 10 * (prize_year %/% 10),
    prize_age = prize_year - year(birth_date)
  )

nobel_winners
## # A tibble: 911 x 20
##    prize_year category prize motivation prize_share
##         <dbl> <chr>    <chr> <chr>      <chr>      
##  1       1901 chemist~ the ~ "\"in rec~ 1/1        
##  2       1901 literat~ the ~ "\"in spe~ 1/1        
##  3       1901 medicine the ~ "\"for hi~ 1/1        
##  4       1901 peace    the ~  <NA>      1/2        
##  5       1901 peace    the ~  <NA>      1/2        
##  6       1901 physics  the ~ "\"in rec~ 1/1        
##  7       1902 chemist~ the ~ "\"in rec~ 1/1        
##  8       1902 literat~ the ~ "\"the gr~ 1/1        
##  9       1902 medicine the ~ "\"for hi~ 1/1        
## 10       1902 peace    the ~  <NA>      1/2        
## # ... with 901 more rows, and 15 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>

这是时候,我们才对数据有了一个初步的了解

再来看看我的祖国

nobel_winners %>%
  dplyr::filter(birth_country == "china") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 11 x 3
##    full_name              prize_year category  
##    <chr>                       <dbl> <chr>     
##  1 walter houser brattain       1956 physics   
##  2 chen ning yang               1957 physics   
##  3 tsung-dao (t.d.) lee         1957 physics   
##  4 edmond h. fischer            1992 medicine  
##  5 daniel c. tsui               1998 physics   
##  6 gao xingjian                 2000 literature
##  7 charles kuen kao             2009 physics   
##  8 ei-ichi negishi              2010 chemistry 
##  9 liu xiaobo                   2010 peace     
## 10 mo yan                       2012 literature
## 11 youyou tu                    2015 medicine

36.8 哪些大神多次获得诺贝尔奖

nobel_winners %>% count(full_name, sort = T)
## # A tibble: 904 x 2
##    full_name                                          n
##    <chr>                                          <int>
##  1 "comité international de la croix rouge (inte~     3
##  2 "frederick sanger"                                 2
##  3 "john bardeen"                                     2
##  4 "linus carl pauling"                               2
##  5 "marie curie, née sklodowska"                      2
##  6 "office of the united nations high commission~     2
##  7 " lie ducommun"                                    1
##  8 "a. michael spence"                                1
##  9 "aage niels bohr"                                  1
## 10 "aaron ciechanover"                                1
## # ... with 894 more rows
nobel_winners %>%
  group_by(full_name) %>%
  mutate(
    number_prize = n(),
    number_cateory = n_distinct(category)
  ) %>%
  arrange(desc(number_prize), full_name) %>%
  dplyr::filter(number_cateory == 2)
## # A tibble: 4 x 22
## # Groups:   full_name [2]
##   prize_year category prize motivation prize_share
##        <dbl> <chr>    <chr> <chr>      <chr>      
## 1       1954 chemist~ the ~ "\"for hi~ 1/1        
## 2       1962 peace    the ~  <NA>      1/1        
## 3       1903 physics  the ~ "\"in rec~ 1/4        
## 4       1911 chemist~ the ~ "\"in rec~ 1/1        
## # ... with 17 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>, number_prize <int>,
## #   number_cateory <int>

36.9 大神在得奖的时候是多大年龄?

nobel_winners %>%
  count(prize_age) %>%
  ggplot(aes(x = prize_age, y = n)) +
  geom_col()

nobel_winners %>%
  group_by(category) %>%
  summarise(mean_prize_age = mean(prize_age, na.rm = T))
## # A tibble: 6 x 2
##   category   mean_prize_age
##   <chr>               <dbl>
## 1 chemistry            58.0
## 2 economics            67.2
## 3 literature           64.7
## 4 medicine             58.0
## 5 peace                61.4
## 6 physics              55.4
nobel_winners %>%
  mutate(category = fct_reorder(category, prize_age, median, na.rm = TRUE)) %>%
  ggplot(aes(category, prize_age)) +
  geom_point() +
  geom_boxplot() +
  coord_flip()

nobel_winners %>%
  dplyr::filter(!is.na(prize_age)) %>%
  group_by(decade, category) %>%
  summarize(
    average_age = mean(prize_age),
    median_age = median(prize_age)
  ) %>%
  ggplot(aes(decade, average_age, color = category)) +
  geom_line()

library(ggridges)

nobel_winners %>%
  ggplot(aes(
    x = prize_age,
    y = category,
    fill = category
  )) +
  geom_density_ridges()

他们60多少岁才得诺奖,大家才23或24岁,还年轻,不用焦虑喔。

nobel_winners %>%

  ggplot(aes(x = prize_age, fill = category, color = category)) +
  geom_density() +
  facet_wrap(vars(category)) +
  theme(legend.position = "none")

有同学说要一个个的画,至于group_split()函数,下次课在讲

nobel_winners %>%
  group_split(category) %>%
  map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.x$category)
  )
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

也可以用强大的group_by() + group_map()组合,我们会在第 21 章讲到

nobel_winners %>%
  group_by(category) %>%
  group_map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.y)
  )

36.10 性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  count(category, gender) %>%
  group_by(category) %>%
  mutate(prop = n / sum(n))
## # A tibble: 12 x 4
## # Groups:   category [6]
##    category   gender     n    prop
##    <chr>      <chr>  <int>   <dbl>
##  1 chemistry  female     4 0.0229 
##  2 chemistry  male     171 0.977  
##  3 economics  female     1 0.0128 
##  4 economics  male      77 0.987  
##  5 literature female    14 0.124  
##  6 literature male      99 0.876  
##  7 medicine   female    12 0.0569 
##  8 medicine   male     199 0.943  
##  9 peace      female    14 0.14   
## 10 peace      male      86 0.86   
## 11 physics    female     2 0.00980
## 12 physics    male     202 0.990

各年代性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  # mutate(decade = glue::glue("{round(prize_year - 1, -1)}s")) %>%
  count(decade, category, gender) %>%
  group_by(decade, category) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(decade, category, fill = prop)) +
  geom_tile(size = 0.7) +
  # geom_text(aes(label = scales::percent(prop, accuracy = .01))) +
  geom_text(aes(label = scales::number(prop, accuracy = .01))) +
  facet_grid(vars(gender)) +
  scale_fill_gradient(low = "#FDF4E9", high = "#834C0D")

library(ggbeeswarm) # install.packages("ggbeeswarm")

nobel_winners %>%
  ggplot(aes(
    x = category,
    y = prize_age,
    colour = gender,
    alpha = gender
  )) +
  ggbeeswarm::geom_beeswarm() +
  coord_flip() +
  scale_color_manual(values = c("#BB1288", "#5867A6")) +
  scale_alpha_manual(values = c(1, .4)) +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(
    title = "诺奖获得者性别不平衡",
    subtitle = "1901年-2016年数据",
    colour = "Gender",
    alpha = "Gender",
    x = "学科",
    y = "获奖年龄"
  )

nobel_winners %>%
  count(decade,
    category,
    gender = coalesce(gender, laureate_type)
  ) %>%
  group_by(decade, category) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(decade, n, fill = gender)) +
  geom_col() +
  facet_wrap(~category) +
  labs(
    x = "Decade",
    y = "# of nobel prize winners",
    fill = "Gender",
    title = "Nobel Prize gender distribution over time"
  )

36.11 这些大神都是哪个年代出生的人?

nobel_winners %>%
  select(category, birth_date) %>%
  mutate(year = floor(year(birth_date) / 10) * 10) %>%
  count(category, year) %>%
  dplyr::filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = n)) +
  geom_col() +
  scale_x_continuous(breaks = seq(1810, 1990, 20)) +
  geom_text(aes(label = n), vjust = -0.25) +
  facet_wrap(vars(category))

课堂练习,哪位同学能把图弄得好看些?

36.12 最年轻的诺奖获得者?

nobel_winners %>%
  dplyr::filter(prize_age == min(prize_age, na.rm = T))
## # A tibble: 1 x 20
##   prize_year category prize motivation prize_share
##        <dbl> <chr>    <chr> <chr>      <chr>      
## 1       2014 peace    the ~ "\"for th~ 1/2        
## # ... with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>
nobel_winners %>%
  dplyr::filter(
    rank(prize_year - year(birth_date)) == 1
  )
## # A tibble: 1 x 20
##   prize_year category prize motivation prize_share
##        <dbl> <chr>    <chr> <chr>      <chr>      
## 1       2014 peace    the ~ "\"for th~ 1/2        
## # ... with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>
nobel_winners %>%
  arrange(
    prize_year - year(birth_date)
  )
## # A tibble: 911 x 20
##    prize_year category prize motivation prize_share
##         <dbl> <chr>    <chr> <chr>      <chr>      
##  1       2014 peace    the ~ "\"for th~ 1/2        
##  2       1915 physics  the ~ "\"for th~ 1/2        
##  3       1932 physics  the ~ "\"for th~ 1/1        
##  4       1933 physics  the ~ "\"for th~ 1/2        
##  5       1936 physics  the ~ "\"for hi~ 1/2        
##  6       1957 physics  the ~ "\"for th~ 1/2        
##  7       1923 medicine the ~ "\"for th~ 1/2        
##  8       1961 physics  the ~ "\"for hi~ 1/2        
##  9       1976 peace    the ~  <NA>      1/2        
## 10       2011 peace    the ~ "\"for th~ 1/3        
## # ... with 901 more rows, and 15 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>
nobel_winners %>%
  top_n(1, year(birth_date) - prize_year)
## # A tibble: 1 x 20
##   prize_year category prize motivation prize_share
##        <dbl> <chr>    <chr> <chr>      <chr>      
## 1       2014 peace    the ~ "\"for th~ 1/2        
## # ... with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>,
## #   decade <dbl>, prize_age <dbl>

36.13 平均年龄和获奖数量

df1 <- nobel_winners %>%
  group_by(category) %>%
  summarise(
    mean_prise_age = mean(prize_age, na.rm = T),
    total_num = n()
  )
df1
## # A tibble: 6 x 3
##   category   mean_prise_age total_num
##   <chr>               <dbl>     <int>
## 1 chemistry            58.0       175
## 2 economics            67.2        78
## 3 literature           64.7       113
## 4 medicine             58.0       211
## 5 peace                61.4       130
## 6 physics              55.4       204
df1 %>%
  ggplot(aes(mean_prise_age, total_num)) +
  geom_point(aes(color = category)) +
  geom_smooth(method = lm, se = FALSE)

36.14 出生地与工作地分布

nobel_winners_clean <- nobel_winners %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ ifelse(str_detect(., "\\("), str_extract(., "(?<=\\().*?(?=\\))"), .)
  ) %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ case_when(
      . == "scotland" ~ "united kingdom",
      . == "northern ireland" ~ "united kingdom",
      str_detect(., "czech") ~ "czechia",
      str_detect(., "germany") ~ "germany",
      TRUE ~ .
    )
  ) %>%
  select(full_name, prize_year, category, birth_date, birth_country, gender, organization_name, organization_country, death_country)
nobel_winners_clean %>% count(death_country, sort = TRUE)
## # A tibble: 45 x 2
##    death_country                n
##    <chr>                    <int>
##  1 <NA>                       329
##  2 united states of america   203
##  3 united kingdom              79
##  4 germany                     56
##  5 france                      51
##  6 sweden                      28
##  7 switzerland                 26
##  8 italy                       14
##  9 russia                      11
## 10 spain                       10
## # ... with 35 more rows

36.15 迁移模式

nobel_winners_clean %>%
  mutate(
    colour = case_when(
      death_country == "united states of america" ~ "#FF2B4F",
      death_country == "germany" ~ "#fcab27",
      death_country == "united kingdom" ~ "#3686d3",
      death_country == "france" ~ "#88398a",
      death_country == "switzerland" ~ "#20d4bc",
      TRUE ~ "gray60"
    )
  ) %>%
  ggplot(aes(
    x = 0,
    y = fct_rev(factor(birth_country)),
    xend = death_country,
    yend = 1,
    colour = colour,
    alpha = (colour != "gray60")
  )) +
  geom_curve(
    curvature = -0.5,
    arrow = arrow(length = unit(0.01, "npc"))
  ) +
  scale_x_discrete() +
  scale_y_discrete() +
  scale_color_identity() +
  scale_alpha_manual(values = c(0.1, 0.2), guide = F) +
  scale_size_manual(values = c(0.1, 0.4), guide = F) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#F0EFF1", colour = "#F0EFF1"),
    legend.position = "none",
    axis.text.x = element_text(angle = 40, hjust = 1)
  )

36.16 地图

library(here)
library(sf)
library(countrycode)

# countrycode('Albania', 'country.name', 'iso3c')

nobel_winners_birth_country <- nobel_winners_clean %>%
  count(birth_country) %>%
  filter(!is.na(birth_country)) %>%
  mutate(ISO3 = countrycode(birth_country,
    origin = "country.name", destination = "iso3c"
  ))


global <-
  sf::st_read("./demo_data/worldmap/TM_WORLD_BORDERS_SIMPL-0.3.shp") %>%
  st_transform(4326)
## Reading layer `TM_WORLD_BORDERS_SIMPL-0.3' from data source `G:\R_for_Data_Science\demo_data\worldmap\TM_WORLD_BORDERS_SIMPL-0.3.shp' using driver `ESRI Shapefile'
## Simple feature collection with 246 features and 11 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.57
## geographic CRS: WGS 84
global %>%
  full_join(nobel_winners_birth_country, by = "ISO3") %>%
  ggplot() +
  geom_sf(aes(fill = n),
    color = "white",
    size = 0.1
  ) +
  labs(
    x = NULL, y = NULL,
    title = "Nobel Winners by country",
    subtitle = "color of map indicates number of Nobel lauretes",
    fill = "num of Nobel lauretes",
    caption = "Made: wang_minjie"
  ) +
  scale_fill_gradientn(colors = c("royalblue1", "magenta", "orange", "gold"), na.value = "white") +
  # scale_fill_gradient(low = "wheat1", high = "red") +
  theme_void() +
  theme(
    legend.position = c(0.1, 0.3),
    plot.background = element_rect(fill = "gray")
  )

# Determine to 10 Countries
topCountries <- nobel_winners_clean %>%
  count(birth_country, sort = TRUE) %>%
  na.omit() %>%
  top_n(8)

topCountries
## # A tibble: 8 x 2
##   birth_country                n
##   <chr>                    <int>
## 1 united states of america   259
## 2 united kingdom              99
## 3 germany                     80
## 4 france                      54
## 5 sweden                      29
## 6 poland                      26
## 7 russia                      26
## 8 japan                       24
df4 <- nobel_winners_clean %>%
  filter(birth_country %in% topCountries$birth_country) %>%
  group_by(birth_country, category, prize_year) %>%
  summarise(prizes = n()) %>%
  mutate(cumPrizes = cumsum(prizes))

df4
## # A tibble: 489 x 5
## # Groups:   birth_country, category [47]
##    birth_country category  prize_year prizes cumPrizes
##    <chr>         <chr>          <dbl>  <int>     <int>
##  1 france        chemistry       1906      1         1
##  2 france        chemistry       1912      2         3
##  3 france        chemistry       1913      1         4
##  4 france        chemistry       1935      2         6
##  5 france        chemistry       1970      1         7
##  6 france        chemistry       1987      1         8
##  7 france        chemistry       2016      1         9
##  8 france        economics       1983      1         1
##  9 france        economics       1988      1         2
## 10 france        economics       2014      1         3
## # ... with 479 more rows
library(gganimate)
df4 %>%
  mutate(prize_year = as.integer(prize_year)) %>%
  ggplot(aes(x = birth_country, y = category, color = birth_country)) +
  geom_point(aes(size = cumPrizes), alpha = 0.6) +
  # geom_text(aes(label = cumPrizes)) +
  scale_size_continuous(range = c(2, 30)) +
  transition_reveal(prize_year) +
  labs(
    title = "诺奖获得者最多的10个国家",
    subtitle = "Year: {frame_along}",
    y = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 22),
    axis.title = element_blank()
  ) +
  scale_color_brewer(palette = "RdYlBu") +
  theme(legend.position = "none") +
  theme(plot.margin = margin(5.5, 5.5, 5.5, 5.5))

36.17 出生地和工作地不一样的占比

nobel_winners_clean %>%
  select(category, birth_country, death_country) %>%
  mutate(immigration = if_else(birth_country == death_country, 0, 1))
## # A tibble: 911 x 4
##    category   birth_country death_country  immigration
##    <chr>      <chr>         <chr>                <dbl>
##  1 chemistry  netherlands   germany                  1
##  2 literature france        france                   0
##  3 medicine   poland        germany                  1
##  4 peace      switzerland   switzerland              0
##  5 peace      france        france                   0
##  6 physics    germany       germany                  0
##  7 chemistry  germany       germany                  0
##  8 literature germany       germany                  0
##  9 medicine   india         united kingdom           1
## 10 peace      switzerland   switzerland              0
## # ... with 901 more rows

36.18 诺奖分享者

nobel_winners %>%
  separate(prize_share, into = c("num", "deno"), sep = "/", remove = FALSE)
nobel_winners %>%
  filter(category == "medicine") %>%
  mutate(
    num_a = as.numeric(str_sub(prize_share, 1, 1)),
    num_b = as.numeric(str_sub(prize_share, -1)),
    share = num_a / num_b,
    year = prize_year %% 10,
    decade = 10 * (prize_year %/% 10)
  ) %>%
  group_by(prize_year) %>%
  mutate(n = row_number()) %>%
  ggplot() +
  geom_col(aes(x = "", y = share, fill = as.factor(n)),
    show.legend = FALSE
  ) +
  coord_polar("y") +
  facet_grid(decade ~ year, switch = "both") +
  labs(title = "每年诺贝尔奖分享情况") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", vjust = 8),
    strip.text.x = element_text(
      size = 7,
      margin = margin(t = 5)
    ),
    strip.text.y = element_text(
      size = 7,
      angle = 180, hjust = 1, margin = margin(r = 10)
    )
  )

36.19 其它

没有回答的问题,大家自己花时间探索下。

36.20 延伸阅读

  • 有些图可以再美化下