第 76 章 探索性数据分析-诺奖获得者

探索性数据分析(exporatory data analysis)是各种知识的综合运用。本章通过一个案例,讲解探索性数据分析的基本思路,也算是对前面几章内容的一次总结复习。

76.1 探索性

  • 数据准备(对数据要做到心中有数)

    • 描述变量
    • 数据结构
    • 缺失值及其处理
  • 数据探索(围绕探索的目标)

    • 数据规整
    • 可视化
    • 建模

76.2 数据集

这是一个诺贝尔奖获得者的数据集,

76.3 导入数据

df <- read_csv("./demo_data/nobel_winners.csv")
df
## # A tibble: 969 × 18
##   prize_year category   prize    motivation prize_share
##        <dbl> <chr>      <chr>    <chr>      <chr>      
## 1       1901 Chemistry  The Nob… "\"in rec… 1/1        
## 2       1901 Literature The Nob… "\"in spe… 1/1        
## 3       1901 Medicine   The Nob… "\"for hi… 1/1        
## 4       1901 Peace      The Nob…  <NA>      1/2        
## 5       1901 Peace      The Nob…  <NA>      1/2        
## 6       1901 Physics    The Nob… "\"in rec… 1/1        
## # … with 963 more rows, and 13 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>, …

如果是xlsx格式

readxl::read_excel("myfile.xlsx")

如果是csv格式

readr::read_csv("myfile.csv")

这里有个小小的提示:

  • 路径(包括文件名), 不要用中文和空格
  • 数据框中变量,也不要有中文和空格(可用下划线代替空格)

76.4 数据结构

一行就是一个诺奖获得者的记录? 确定?

缺失值及其处理

df %>% map_df(~ sum(is.na(.)))
## # A tibble: 1 × 18
##   prize_year category prize motivation prize_share
##        <int>    <int> <int>      <int>       <int>
## 1          0        0     0         88           0
## # … with 13 more variables: laureate_id <int>,
## #   laureate_type <int>, full_name <int>,
## #   birth_date <int>, birth_city <int>,
## #   birth_country <int>, gender <int>,
## #   organization_name <int>, organization_city <int>,
## #   organization_country <int>, death_date <int>,
## #   death_city <int>, death_country <int>

性别缺失怎么造成的?

df %>% count(laureate_type)
## # A tibble: 2 × 2
##   laureate_type     n
##   <chr>         <int>
## 1 Individual      939
## 2 Organization     30

76.5 我们想探索哪些问题?

你想关心哪些问题,可能是

  • 每个学科颁过多少次奖?
  • 这些大神都是哪个年代的人?
  • 性别比例
  • 平均年龄和获奖数量
  • 最年轻的诺奖获得者是谁?
  • 中国诺奖获得者有哪些?
  • 得奖的时候多大年龄?
  • 获奖者所在国家的经济情况?
  • 有大神多次获得诺贝尔奖,而且在不同科学领域获奖?
  • 出生地分布?工作地分布?迁移模式?
  • GDP经济与诺奖模型?
  • 诺奖分享情况?

76.6 每个学科颁过多少次奖

df %>% count(category)
## # A tibble: 6 × 2
##   category       n
##   <chr>      <int>
## 1 Chemistry    194
## 2 Economics     83
## 3 Literature   113
## 4 Medicine     227
## 5 Peace        130
## 6 Physics      222
df %>%
  count(category) %>%
  ggplot(aes(x = category, y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  theme(legend.position = "none")
df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

也可以使用别人定义好的配色方案

library(ggthemr) # install.packages("devtools")
# devtools::install_github('cttobin/ggthemr')
ggthemr("dust")

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

这个配色方案感觉挺好看的呢,比较适合我这种又挑剔又懒惰的人。

当然,也可以自己DIY,或者使用配色网站的主题方案(https://learnui.design/tools/data-color-picker.html#palette)

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n)) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

让图骚动起来吧

library(gganimate) # install.packages("gganimate", dependencies = T)

df %>%
  count(category) %>%
  mutate(category = fct_reorder(category, n)) %>%
  ggplot(aes(x = category, y = n)) +
  geom_text(aes(label = n), vjust = -0.25) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none") +
  transition_states(category) +
  shadow_mark(past = TRUE)

和ggplot2的分面一样,动态图可以增加数据展示的维度。

76.7 看看我们伟大的祖国

df %>%
  dplyr::filter(birth_country == "China") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 12 × 3
##   full_name              prize_year category  
##   <chr>                       <dbl> <chr>     
## 1 Walter Houser Brattain       1956 Physics   
## 2 Chen Ning Yang               1957 Physics   
## 3 Tsung-Dao (T.D.) Lee         1957 Physics   
## 4 Edmond H. Fischer            1992 Medicine  
## 5 Daniel C. Tsui               1998 Physics   
## 6 Gao Xingjian                 2000 Literature
## # … with 6 more rows

我们发现获奖者有多个地址,就会有重复的情况,比如 Charles Kuen Kao在2009年Physics有两次,为什么重复计数了呢?

下面我们去重吧, 去重可以用distinct()函数

dt <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, "a",
  1, 1, "b",
  1, 2, "c",
  1, 2, "d"
)

dt
## # A tibble: 4 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 c    
## 4     1     2 d
dt %>% distinct_at(vars(x), .keep_all = T)
## # A tibble: 1 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a
dt %>% distinct_at(vars(x, y), .keep_all = T)
## # A tibble: 2 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     2 c
nobel_winners <- df %>%
  mutate_if(is.character, tolower) %>%
  distinct_at(vars(full_name, prize_year, category), .keep_all = TRUE) %>%
  mutate(
    decade = 10 * (prize_year %/% 10),
    prize_age = prize_year - year(birth_date)
  )

nobel_winners
## # A tibble: 911 × 20
##   prize_year category   prize    motivation prize_share
##        <dbl> <chr>      <chr>    <chr>      <chr>      
## 1       1901 chemistry  the nob… "\"in rec… 1/1        
## 2       1901 literature the nob… "\"in spe… 1/1        
## 3       1901 medicine   the nob… "\"for hi… 1/1        
## 4       1901 peace      the nob…  <NA>      1/2        
## 5       1901 peace      the nob…  <NA>      1/2        
## 6       1901 physics    the nob… "\"in rec… 1/1        
## # … with 905 more rows, and 15 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>, …

这是时候,我们才对数据有了一个初步的了解

再来看看我的祖国

nobel_winners %>%
  dplyr::filter(birth_country == "china") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 11 × 3
##   full_name              prize_year category  
##   <chr>                       <dbl> <chr>     
## 1 walter houser brattain       1956 physics   
## 2 chen ning yang               1957 physics   
## 3 tsung-dao (t.d.) lee         1957 physics   
## 4 edmond h. fischer            1992 medicine  
## 5 daniel c. tsui               1998 physics   
## 6 gao xingjian                 2000 literature
## # … with 5 more rows

76.8 哪些大神多次获得诺贝尔奖

nobel_winners %>% count(full_name, sort = T)
## # A tibble: 904 × 2
##   full_name                                           n
##   <chr>                                           <int>
## 1 comité international de la croix rouge (intern…     3
## 2 frederick sanger                                    2
## 3 john bardeen                                        2
## 4 linus carl pauling                                  2
## 5 marie curie, née sklodowska                         2
## 6 office of the united nations high commissioner…     2
## # … with 898 more rows
nobel_winners %>%
  group_by(full_name) %>%
  mutate(
    number_prize = n(),
    number_cateory = n_distinct(category)
  ) %>%
  arrange(desc(number_prize), full_name) %>%
  dplyr::filter(number_cateory == 2)
## # A tibble: 4 × 22
## # Groups:   full_name [2]
##   prize_year category  prize     motivation prize_share
##        <dbl> <chr>     <chr>     <chr>      <chr>      
## 1       1954 chemistry the nobe… "\"for hi… 1/1        
## 2       1962 peace     the nobe…  <NA>      1/1        
## 3       1903 physics   the nobe… "\"in rec… 1/4        
## 4       1911 chemistry the nobe… "\"in rec… 1/1        
## # … with 17 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, …

76.9 大神在得奖的时候是多大年龄?

nobel_winners %>%
  count(prize_age) %>%
  ggplot(aes(x = prize_age, y = n)) +
  geom_col()
nobel_winners %>%
  group_by(category) %>%
  summarise(mean_prize_age = mean(prize_age, na.rm = T))
## # A tibble: 6 × 2
##   category   mean_prize_age
##   <chr>               <dbl>
## 1 chemistry            58.0
## 2 economics            67.2
## 3 literature           64.7
## 4 medicine             58.0
## 5 peace                61.4
## 6 physics              55.4
nobel_winners %>%
  mutate(category = fct_reorder(category, prize_age, median, na.rm = TRUE)) %>%
  ggplot(aes(category, prize_age)) +
  geom_point() +
  geom_boxplot() +
  coord_flip()
nobel_winners %>%
  dplyr::filter(!is.na(prize_age)) %>%
  group_by(decade, category) %>%
  summarize(
    average_age = mean(prize_age),
    median_age = median(prize_age)
  ) %>%
  ggplot(aes(decade, average_age, color = category)) +
  geom_line()
library(ggridges)

nobel_winners %>%
  ggplot(aes(
    x = prize_age,
    y = category,
    fill = category
  )) +
  geom_density_ridges()

他们60多少岁才得诺奖,大家才23或24岁,还年轻,不用焦虑喔。

nobel_winners %>%

  ggplot(aes(x = prize_age, fill = category, color = category)) +
  geom_density() +
  facet_wrap(vars(category)) +
  theme(legend.position = "none")

有同学说要一个个的画,至于group_split()函数,下次课在讲

nobel_winners %>%
  group_split(category) %>%
  map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.x$category)
  )
## [[1]]
## 
## [[2]]
## 
## [[3]]
## 
## [[4]]
## 
## [[5]]
## 
## [[6]]

也可以用强大的group_by() + group_map()组合,我们会在第 37 章讲到

nobel_winners %>%
  group_by(category) %>%
  group_map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.y)
  )

76.10 性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  count(category, gender) %>%
  group_by(category) %>%
  mutate(prop = n / sum(n))
## # A tibble: 12 × 4
## # Groups:   category [6]
##   category   gender     n   prop
##   <chr>      <chr>  <int>  <dbl>
## 1 chemistry  female     4 0.0229
## 2 chemistry  male     171 0.977 
## 3 economics  female     1 0.0128
## 4 economics  male      77 0.987 
## 5 literature female    14 0.124 
## 6 literature male      99 0.876 
## # … with 6 more rows

各年代性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  # mutate(decade = glue::glue("{round(prize_year - 1, -1)}s")) %>%
  count(decade, category, gender) %>%
  group_by(decade, category) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(decade, category, fill = prop)) +
  geom_tile(size = 0.7) +
  # geom_text(aes(label = scales::percent(prop, accuracy = .01))) +
  geom_text(aes(label = scales::number(prop, accuracy = .01))) +
  facet_grid(vars(gender)) +
  scale_fill_gradient(low = "#FDF4E9", high = "#834C0D")
library(ggbeeswarm) # install.packages("ggbeeswarm")

nobel_winners %>%
  ggplot(aes(
    x = category,
    y = prize_age,
    colour = gender,
    alpha = gender
  )) +
  ggbeeswarm::geom_beeswarm() +
  coord_flip() +
  scale_color_manual(values = c("#BB1288", "#5867A6")) +
  scale_alpha_manual(values = c(1, .4)) +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(
    title = "Gender imbalance of Nobel laureates",
    subtitle = "data frome 1901-2016",
    colour = "Gender",
    alpha = "Gender",
    y = "age in prize"
  )
nobel_winners %>%
  count(decade,
    category,
    gender = coalesce(gender, laureate_type)
  ) %>%
  group_by(decade, category) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(decade, n, fill = gender)) +
  geom_col() +
  facet_wrap(~category) +
  labs(
    x = "Decade",
    y = "# of nobel prize winners",
    fill = "Gender",
    title = "Nobel Prize gender distribution over time"
  )

76.11 这些大神都是哪个年代出生的人?

nobel_winners %>%
  select(category, birth_date) %>%
  mutate(year = floor(year(birth_date) / 10) * 10) %>%
  count(category, year) %>%
  dplyr::filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = n)) +
  geom_col() +
  scale_x_continuous(breaks = seq(1810, 1990, 20)) +
  geom_text(aes(label = n), vjust = -0.25) +
  facet_wrap(vars(category))

课堂练习,哪位同学能把图弄得好看些?

76.12 最年轻的诺奖获得者?

nobel_winners %>%
  dplyr::filter(prize_age == min(prize_age, na.rm = T))
## # A tibble: 1 × 20
##   prize_year category prize      motivation prize_share
##        <dbl> <chr>    <chr>      <chr>      <chr>      
## 1       2014 peace    the nobel… "\"for th… 1/2        
## # … with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, …
nobel_winners %>%
  dplyr::filter(
    rank(prize_year - year(birth_date)) == 1
  )
## # A tibble: 1 × 20
##   prize_year category prize      motivation prize_share
##        <dbl> <chr>    <chr>      <chr>      <chr>      
## 1       2014 peace    the nobel… "\"for th… 1/2        
## # … with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, …
nobel_winners %>%
  arrange(
    prize_year - year(birth_date)
  )
## # A tibble: 911 × 20
##   prize_year category prize      motivation prize_share
##        <dbl> <chr>    <chr>      <chr>      <chr>      
## 1       2014 peace    the nobel… "\"for th… 1/2        
## 2       1915 physics  the nobel… "\"for th… 1/2        
## 3       1932 physics  the nobel… "\"for th… 1/1        
## 4       1933 physics  the nobel… "\"for th… 1/2        
## 5       1936 physics  the nobel… "\"for hi… 1/2        
## 6       1957 physics  the nobel… "\"for th… 1/2        
## # … with 905 more rows, and 15 more variables:
## #   laureate_id <dbl>, laureate_type <chr>,
## #   full_name <chr>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>,
## #   gender <chr>, organization_name <chr>,
## #   organization_city <chr>,
## #   organization_country <chr>, death_date <date>, …
nobel_winners %>%
  top_n(1, year(birth_date) - prize_year)
## # A tibble: 1 × 20
##   prize_year category prize      motivation prize_share
##        <dbl> <chr>    <chr>      <chr>      <chr>      
## 1       2014 peace    the nobel… "\"for th… 1/2        
## # … with 15 more variables: laureate_id <dbl>,
## #   laureate_type <chr>, full_name <chr>,
## #   birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>,
## #   organization_name <chr>, organization_city <chr>,
## #   organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, …

76.13 平均年龄和获奖数量

df1 <- nobel_winners %>%
  group_by(category) %>%
  summarise(
    mean_prise_age = mean(prize_age, na.rm = T),
    total_num = n()
  )
df1
## # A tibble: 6 × 3
##   category   mean_prise_age total_num
##   <chr>               <dbl>     <int>
## 1 chemistry            58.0       175
## 2 economics            67.2        78
## 3 literature           64.7       113
## 4 medicine             58.0       211
## 5 peace                61.4       130
## 6 physics              55.4       204
df1 %>%
  ggplot(aes(mean_prise_age, total_num)) +
  geom_point(aes(color = category)) +
  geom_smooth(method = lm, se = FALSE)

76.14 出生地与工作地分布

nobel_winners_clean <- nobel_winners %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ ifelse(str_detect(., "\\("), str_extract(., "(?<=\\().*?(?=\\))"), .)
  ) %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ case_when(
      . == "scotland" ~ "united kingdom",
      . == "northern ireland" ~ "united kingdom",
      str_detect(., "czech") ~ "czechia",
      str_detect(., "germany") ~ "germany",
      TRUE ~ .
    )
  ) %>%
  select(full_name, prize_year, category, birth_date, birth_country, gender, organization_name, organization_country, death_country)
nobel_winners_clean %>% count(death_country, sort = TRUE)
## # A tibble: 45 × 2
##   death_country                n
##   <chr>                    <int>
## 1 <NA>                       329
## 2 united states of america   203
## 3 united kingdom              79
## 4 germany                     56
## 5 france                      51
## 6 sweden                      28
## # … with 39 more rows

76.15 迁移模式

nobel_winners_clean %>%
  mutate(
    colour = case_when(
      death_country == "united states of america" ~ "#FF2B4F",
      death_country == "germany" ~ "#fcab27",
      death_country == "united kingdom" ~ "#3686d3",
      death_country == "france" ~ "#88398a",
      death_country == "switzerland" ~ "#20d4bc",
      TRUE ~ "gray60"
    )
  ) %>%
  ggplot(aes(
    x = 0,
    y = fct_rev(factor(birth_country)),
    xend = death_country,
    yend = 1,
    colour = colour,
    alpha = (colour != "gray60")
  )) +
  geom_curve(
    curvature = -0.5,
    arrow = arrow(length = unit(0.01, "npc"))
  ) +
  scale_x_discrete() +
  scale_y_discrete() +
  scale_color_identity() +
  scale_alpha_manual(values = c(0.1, 0.2), guide = F) +
  scale_size_manual(values = c(0.1, 0.4), guide = F) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#F0EFF1", colour = "#F0EFF1"),
    legend.position = "none",
    axis.text.x = element_text(angle = 40, hjust = 1)
  )

76.16 地图

library(here)
library(sf)
library(countrycode)

# countrycode('Albania', 'country.name', 'iso3c')

nobel_winners_birth_country <- nobel_winners_clean %>%
  count(birth_country) %>%
  filter(!is.na(birth_country)) %>%
  mutate(ISO3 = countrycode(birth_country,
    origin = "country.name", destination = "iso3c"
  ))


global <-
  sf::st_read("./demo_data/worldmap/TM_WORLD_BORDERS_SIMPL-0.3.shp") %>%
  st_transform(4326)
## Reading layer `TM_WORLD_BORDERS_SIMPL-0.3' from data source `E:\R_for_Data_Science\demo_data\worldmap\TM_WORLD_BORDERS_SIMPL-0.3.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.57
## Geodetic CRS:  WGS 84
global %>%
  full_join(nobel_winners_birth_country, by = "ISO3") %>%
  ggplot() +
  geom_sf(aes(fill = n),
    color = "white",
    size = 0.1
  ) +
  labs(
    x = NULL, y = NULL,
    title = "Nobel Winners by country",
    subtitle = "color of map indicates number of Nobel lauretes",
    fill = "num of Nobel lauretes",
    caption = "Made: wang_minjie"
  ) +
  scale_fill_gradientn(colors = c("royalblue1", "magenta", "orange", "gold"), na.value = "white") +
  # scale_fill_gradient(low = "wheat1", high = "red") +
  theme_void() +
  theme(
    legend.position = c(0.1, 0.3),
    plot.background = element_rect(fill = "gray")
  )
# Determine to 10 Countries
topCountries <- nobel_winners_clean %>%
  count(birth_country, sort = TRUE) %>%
  na.omit() %>%
  top_n(8)

topCountries
## # A tibble: 8 × 2
##   birth_country                n
##   <chr>                    <int>
## 1 united states of america   259
## 2 united kingdom              99
## 3 germany                     80
## 4 france                      54
## 5 sweden                      29
## 6 poland                      26
## # … with 2 more rows
df4 <- nobel_winners_clean %>%
  filter(birth_country %in% topCountries$birth_country) %>%
  group_by(birth_country, category, prize_year) %>%
  summarise(prizes = n()) %>%
  mutate(cumPrizes = cumsum(prizes))

df4
## # A tibble: 489 × 5
## # Groups:   birth_country, category [47]
##   birth_country category  prize_year prizes cumPrizes
##   <chr>         <chr>          <dbl>  <int>     <int>
## 1 france        chemistry       1906      1         1
## 2 france        chemistry       1912      2         3
## 3 france        chemistry       1913      1         4
## 4 france        chemistry       1935      2         6
## 5 france        chemistry       1970      1         7
## 6 france        chemistry       1987      1         8
## # … with 483 more rows
library(gganimate)
df4 %>%
  mutate(prize_year = as.integer(prize_year)) %>%
  ggplot(aes(x = birth_country, y = category, color = birth_country)) +
  geom_point(aes(size = cumPrizes), alpha = 0.6) +
  # geom_text(aes(label = cumPrizes)) +
  scale_size_continuous(range = c(2, 30)) +
  transition_reveal(prize_year) +
  labs(
    title = "Top 10 countries with Nobel Prize winners",
    subtitle = "Year: {frame_along}",
    y = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 22),
    axis.title = element_blank()
  ) +
  scale_color_brewer(palette = "RdYlBu") +
  theme(legend.position = "none") +
  theme(plot.margin = margin(5.5, 5.5, 5.5, 5.5))

76.17 出生地和工作地不一样的占比

nobel_winners_clean %>%
  select(category, birth_country, death_country) %>%
  mutate(immigration = if_else(birth_country == death_country, 0, 1))
## # A tibble: 911 × 4
##   category   birth_country death_country immigration
##   <chr>      <chr>         <chr>               <dbl>
## 1 chemistry  netherlands   germany                 1
## 2 literature france        france                  0
## 3 medicine   poland        germany                 1
## 4 peace      switzerland   switzerland             0
## 5 peace      france        france                  0
## 6 physics    germany       germany                 0
## # … with 905 more rows

76.18 诺奖分享者

nobel_winners %>%
  separate(prize_share, into = c("num", "deno"), sep = "/", remove = FALSE)
nobel_winners %>%
  filter(category == "medicine") %>%
  mutate(
    num_a = as.numeric(str_sub(prize_share, 1, 1)),
    num_b = as.numeric(str_sub(prize_share, -1)),
    share = num_a / num_b,
    year = prize_year %% 10,
    decade = 10 * (prize_year %/% 10)
  ) %>%
  group_by(prize_year) %>%
  mutate(n = row_number()) %>%
  ggplot() +
  geom_col(aes(x = "", y = share, fill = as.factor(n)),
    show.legend = FALSE
  ) +
  coord_polar("y") +
  facet_grid(decade ~ year, switch = "both") +
  labs(title = "Annual Nobel Prize sharing") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", vjust = 8),
    strip.text.x = element_text(
      size = 7,
      margin = margin(t = 5)
    ),
    strip.text.y = element_text(
      size = 7,
      angle = 180, hjust = 1, margin = margin(r = 10)
    )
  )

76.19 其它

没有回答的问题,大家自己花时间探索下。

76.20 延伸阅读

  • 有些图可以再美化下