第 40 章 探索性数据分析-身高体重

library(tidyverse)

40.1 案例分析

这是一份身高和体重的数据集

d <- read_csv("./demo_data/weight-height.csv")
d
## # A tibble: 10,000 x 3
##    Gender Height Weight
##    <chr>   <dbl>  <dbl>
##  1 Male     73.8   242.
##  2 Male     68.8   162.
##  3 Male     74.1   213.
##  4 Male     71.7   220.
##  5 Male     69.9   206.
##  6 Male     67.3   152.
##  7 Male     68.8   184.
##  8 Male     68.3   168.
##  9 Male     67.0   176.
## 10 Male     63.5   156.
## # ... with 9,990 more rows
d %>% summarise(
  across(everything(), ~ sum(is.na(.)))
)
## # A tibble: 1 x 3
##   Gender Height Weight
##    <int>  <int>  <int>
## 1      0      0      0

40.2 可视化

40.2.1 画出不同性别的身高分布

常规答案

d %>%
  ggplot(aes(x = Height, fill = Gender)) +
  geom_density(alpha = 0.5)

d %>%
  ggplot(aes(x = Height, fill = Gender)) +
  geom_density(alpha = 0.5) +
  facet_wrap(vars(Gender))

40.3 来点高级的

刚才我们看到了分面的操作,全局数据按照某个变量分组后,形成的若干个子集在不同的面板中分别展示出来。

这种方法很适合子集之间对比。事实上,我们看到每个子集的情况后,还很想知道全局的情况,以及子集在全局中的分布、状态或者位置。也就说,想对比子集和全局的情况。

所以我们期望(子集之间对比,子集与全局对比)。

具体方法:用分面的方法高亮展示子集,同时在每个分面上添加全局(灰色背景)

  • 第一步,先把子集用分面的方法,分别画出来
d %>%
  ggplot(aes(x = Height)) +
  geom_density() +
  facet_wrap(vars(Gender))
  • 第二步,添加整体的情况作为背景图层。因为第一步用到了分面,也就说会分组,但我们希望整体的背景图层不受分面信息影响,或者叫背景图层不需要分组,而是显示全部。也就说,要保证每个分面面板中的背景图都是一样的,因此,在这个geom_denstiy()图层中,构建不受facet_wrap()影响的数据,即删掉data的分组列。
d %>%
  ggplot(aes(x = Height)) +
  geom_density(
    data = d %>% select(-Gender)
  ) +
  geom_density() +
  facet_wrap(vars(Gender))
  • 第三步,y轴的调整,我们希望保持密度的形状,同时希望y轴不用比例值而是用具体的count个数,这样整体和局部能放在一个标度下,
d %>%
  ggplot(aes(x = Height, y = after_stat(count))) +
  geom_density(
    data = d %>% select(-Gender)
  ) +
  geom_density() +
  facet_wrap(vars(Gender))

“Male”, “Female” 是Gender已经存在的分组。另外,我们在背景图层,新增了一个组“all people”,这样,整个图就有三个分组(三个color组),那么,我们可以在scale_fill_manual中统一设置和指定。

density_colors <- c(
  "Male" = "#247BA0",
  "Female" = "#F25F5C",
  "all people" = "grey85"
)
d %>%
  ggplot(aes(x = Height, y = after_stat(count))) +
  geom_density(
    data = df %>% select(-Gender),
    aes(fill = "all people", color = "all people")
  ) +
  geom_density(aes(color = Gender, fill = Gender)) +
  facet_wrap(vars(Gender)) +
  scale_fill_manual(name = NULL, values = density_colors) +
  scale_color_manual(name = NULL, values = density_colors) +
  theme_minimal() +
  theme(legend.position = "bottom")

40.3.1 完整代码

density_colors <- c(
  "Male" = "#247BA0",
  "Female" = "#F25F5C",
  "all people" = "grey80"
)

scales::show_col(density_colors)

d %>%
  ggplot(aes(x = Height, y = after_stat(count))) +
  geom_density(
    data = d %>% dplyr::select(-Gender),
    aes(fill = "all people", color = "all people")
  ) +
  geom_density(aes(color = Gender, fill = Gender)) +
  facet_wrap(vars(Gender)) +
  scale_fill_manual(name = NULL, values = density_colors) +
  scale_color_manual(name = NULL, values = density_colors) +
  theme_minimal() +
  theme(legend.position = "bottom")

或者,用不同的主题风格

density_colors <- c(
  "Male" = "#56B4E9",
  "Female" = "#EF8A17",
  "all participants" = "grey85"
)

d %>%
  ggplot(aes(x = Height, y = after_stat(count))) +
  geom_density(
    data = function(x) dplyr::select(x, -Gender),
    aes(fill = "all participants", color = "all participants")
  ) +
  geom_density(aes(fill = Gender, color = Gender)) +
  facet_wrap(vars(Gender)) +
  scale_color_manual(name = NULL, values = density_colors) +
  scale_fill_manual(name = NULL, values = density_colors) +
  cowplot::theme_minimal_hgrid(16) +
  theme(legend.position = "bottom", legend.justification = "center")

40.3.2 画出不同性别的体重分布

d %>%
  ggplot(aes(x = Weight, fill = Gender)) +
  geom_density(alpha = 0.5)

40.4 建模

40.4.1 身高与体重的散点图

d %>%
  ggplot(aes(x = Height, y = Weight, color = Gender)) +
  geom_point()

40.4.2 建立身高与体重的线性模型

fit <- lm(Weight ~ 1 + Height, data = d)
summary(fit)
## 
## Call:
## lm(formula = Weight ~ 1 + Height, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.93  -8.24  -0.12   8.26  46.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.7372     2.1115    -166   <2e-16 ***
## Height         7.7173     0.0318     243   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 9998 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.855 
## F-statistic: 5.9e+04 on 1 and 9998 DF,  p-value: <2e-16
broom::tidy(fit)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  -351.      2.11       -166.       0
## 2 Height          7.72    0.0318      243.       0

40.4.3 建立不同性别下的身高与体重的线性模型

d %>%
  group_by(Gender) %>%
  group_modify(
    ~ broom::tidy(lm(Weight ~ 1 + Height, data = .))
  )
## # A tibble: 4 x 6
## # Groups:   Gender [2]
##   Gender term      estimate std.error statistic p.value
##   <chr>  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
## 1 Female (Interce~  -246.      3.36       -73.3       0
## 2 Female Height        5.99    0.0526     114.        0
## 3 Male   (Interce~  -224.      3.41       -65.8       0
## 4 Male   Height        5.96    0.0494     121.        0
d %>%
  ggplot(aes(x = Height, y = Weight, group = Gender)) +
  geom_point(aes(color = Gender)) +
  geom_smooth(method = lm)