第 83 章 探索性数据分析-大学生职业决策

83.1 预备知识

library(tidyverse)

example <- 
 tibble::tribble(
   ~name, ~english, ~chinese, ~math, ~sport, ~psy, ~edu,
     "A",     133,    100,    102,     56,    89,   89,
     "B",     120,    120,     86,     88,    45,   75,
     "C",      98,    109,    114,     87,    NA,   84,
     "D",     120,     78,    106,     68,    86,   69,
     "E",     110,     99,    134,     98,    75,   70,
     "F",      NA,    132,    130,     NA,    68,   88
   )

example
## # A tibble: 6 × 7
##   name  english chinese  math sport   psy   edu
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A         133     100   102    56    89    89
## 2 B         120     120    86    88    45    75
## 3 C          98     109   114    87    NA    84
## 4 D         120      78   106    68    86    69
## 5 E         110      99   134    98    75    70
## 6 F          NA     132   130    NA    68    88

83.1.1 缺失值检查

我们需要判断每一列的缺失值

example %>% 
  summarise(
    na_in_english = sum(is.na(english)),
    na_in_chinese = sum(is.na(chinese)),
    na_in_math    = sum(is.na(math)),
    na_in_sport   = sum(is.na(sport)),
    na_in_psy     = sum(is.na(math)),   # tpyo here
    na_in_edu     = sum(is.na(edu))
  )
## # A tibble: 1 × 6
##   na_in_english na_in_chinese na_in_math na_in_sport
##           <int>         <int>      <int>       <int>
## 1             1             0          0           1
## # … with 2 more variables: na_in_psy <int>,
## #   na_in_edu <int>

我们发现,这种写法比较笨,而且容易出错,比如na_in_psy = sum(is.na(math)) 就写错了。那么有没有既偷懒又安全的方法呢?有的。但代价是需要学会across()函数,大家可以在Console中输入?dplyr::across查看帮助文档,或者看第 39 章。

example %>% 
  summarise(
    across(everything(), mean)
  )
## # A tibble: 1 × 7
##    name english chinese  math sport   psy   edu
##   <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    NA      NA    106.   112    NA    NA  79.2
example %>% 
  summarise(
    across(everything(), function(x) sum(is.na(x)) )
  )
## # A tibble: 1 × 7
##    name english chinese  math sport   psy   edu
##   <int>   <int>   <int> <int> <int> <int> <int>
## 1     0       1       0     0     1     1     0

83.1.2 数据预处理

  • 直接丢弃缺失值所在的行
example %>% drop_na()
## # A tibble: 4 × 7
##   name  english chinese  math sport   psy   edu
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A         133     100   102    56    89    89
## 2 B         120     120    86    88    45    75
## 3 D         120      78   106    68    86    69
## 4 E         110      99   134    98    75    70
  • 均值代替缺失值
d <- example %>% 
  mutate(
    across(where(is.numeric), ~ if_else(is.na(.), mean(., na.rm = T), .))
  )
d
## # A tibble: 6 × 7
##   name  english chinese  math sport   psy   edu
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A        133      100   102  56    89      89
## 2 B        120      120    86  88    45      75
## 3 C         98      109   114  87    72.6    84
## 4 D        120       78   106  68    86      69
## 5 E        110       99   134  98    75      70
## 6 F        116.     132   130  79.4  68      88
  • 计算总分/均值
d %>% 
  rowwise() %>% 
  mutate(
    total = sum(c_across(-name))
  )
## # A tibble: 6 × 8
## # Rowwise: 
##   name  english chinese  math sport   psy   edu total
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A        133      100   102  56    89      89  569 
## 2 B        120      120    86  88    45      75  534 
## 3 C         98      109   114  87    72.6    84  565.
## 4 D        120       78   106  68    86      69  527 
## 5 E        110       99   134  98    75      70  586 
## 6 F        116.     132   130  79.4  68      88  614.
d %>% 
  rowwise() %>% 
  mutate(
    mean = mean(c_across(-name))
  )
## # A tibble: 6 × 8
## # Rowwise: 
##   name  english chinese  math sport   psy   edu  mean
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A        133      100   102  56    89      89  94.8
## 2 B        120      120    86  88    45      75  89  
## 3 C         98      109   114  87    72.6    84  94.1
## 4 D        120       78   106  68    86      69  87.8
## 5 E        110       99   134  98    75      70  97.7
## 6 F        116.     132   130  79.4  68      88 102.
  • 数据标准化处理
standard <- function(x) {
  (x - mean(x)) / sd(x)
}
d %>% 
  mutate(
    across(where(is.numeric), standard)
  )
## # A tibble: 6 × 7
##   name  english chinese   math  sport    psy    edu
##   <chr>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 A       1.44   -0.339 -0.555 -1.54   1.04   1.10 
## 2 B       0.326   0.731 -1.44   0.566 -1.75  -0.464
## 3 C      -1.56    0.143  0.111  0.500  0      0.538
## 4 D       0.326  -1.51  -0.333 -0.750  0.852 -1.13 
## 5 E      -0.531  -0.392  1.22   1.22   0.153 -1.02 
## 6 F       0       1.37   0.999  0     -0.292  0.984

83.2 开始

83.2.1 文件管理中需要注意的地方

感谢康钦虹同学提供的数据,但这里有几点需要注意的地方:

事项 问题 解决办法
文件名 excel的文件名是中文 用英文,比如 data.xlsx
列名 列名中有-号,大小写不统一 规范列名,或用janitor::clean_names()偷懒
预处理 直接在原始数据中新增 不要在原始数据上改动,统计工作可以在R里实现
文件管理 没有层级 新建data文件夹装数据,与code.Rmd并列
data <- readxl::read_excel("demo_data/career-decision.xlsx", skip = 1) %>% 
        janitor::clean_names()

#glimpse(data)
d <- data %>% select(1:61)
#glimpse(d)

83.2.2 缺失值检查

## # A tibble: 1 × 61
##     sex majoy grade  from    z1    z2    z3    z4    z5
##   <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1     0     0     0     0     0     0     0     0     0
## # … with 52 more variables: z6 <int>, z7 <int>,
## #   z8 <int>, z9 <int>, z10 <int>, z11 <int>,
## #   z12 <int>, z13 <int>, z14 <int>, z15 <int>,
## #   z16 <int>, z17 <int>, z18 <int>, j1 <int>,
## #   j2 <int>, j3 <int>, j4 <int>, j5 <int>, j6 <int>,
## #   j7 <int>, j8 <int>, j9 <int>, j10 <int>,
## #   j11 <int>, j12 <int>, j13 <int>, j14 <int>, …

没有缺失值,挺好

83.2.3 数据预处理

采用利克特式 5 点计分… (这方面你们懂得比我多)

d <- d %>%
  rowwise() %>%
  mutate(
    environment_exploration          = sum(c_across(z1:z5)),
    self_exploration                 = sum(c_across(z6:z9)),
    objective_system_exploration     = sum(c_across(z10:z15)),
    info_quantity_exploration        = sum(c_across(z16:z18)),
    
    self_evaluation                  = sum(c_across(j1:j6)),
    information_collection           = sum(c_across(j7:j15)),
    target_select                    = sum(c_across(j16:j24)),
    formulate                        = sum(c_across(j25:j32)),
    problem_solving                  = sum(c_across(j33:j39)),

    career_exploration               = sum(c_across(z1:z18)),
    career_decision_making           = sum(c_across(j1:j39))
  ) %>% 
  select(-starts_with("z"), -starts_with("j")) %>% 
  ungroup() %>% 
  mutate(pid = 1:n(), .before = sex) %>%
  mutate(
    across(c(pid, sex, majoy, grade, from), as_factor)
  )

#glimpse(d)

83.2.4 标准化

standard <- function(x) {
  (x - mean(x)) / sd(x)
}

d <- d %>% 
  mutate(
    across(where(is.numeric), standard)
  )
d
## # A tibble: 304 × 16
##   pid   sex   majoy grade from  environment_exploration
##   <fct> <fct> <fct> <fct> <fct>                   <dbl>
## 1 1     1     4     4     2                     -1.63  
## 2 2     1     4     4     1                     -1.87  
## 3 3     2     4     4     2                      0.0802
## 4 4     2     4     4     1                     -1.87  
## 5 5     2     4     4     1                     -0.895 
## 6 6     1     1     4     3                     -0.651 
## # … with 298 more rows, and 10 more variables:
## #   self_exploration <dbl>,
## #   objective_system_exploration <dbl>,
## #   info_quantity_exploration <dbl>,
## #   self_evaluation <dbl>,
## #   information_collection <dbl>, target_select <dbl>,
## #   formulate <dbl>, problem_solving <dbl>, …

83.3 探索

83.3.1 想探索的问题

  • 不同性别(或者年级,生源地,专业)下,各指标分值的差异性
  • 两个变量的相关分析和回归分析
  • 更多(欢迎大家提出了喔)

83.3.2 男生女生在职业探索上有所不同?

以性别为例。因为性别变量是男女,仅仅2组,所以检查男女在各自指标上的均值差异,可以用T检验。

d %>% 
  group_by(sex) %>% 
  summarise(
   across(where(is.numeric), mean)
)
## # A tibble: 2 × 12
##   sex   environment_exploration self_exploration
##   <fct>                   <dbl>            <dbl>
## 1 1                      -0.147          -0.0829
## 2 2                       0.165           0.0933
## # … with 9 more variables:
## #   objective_system_exploration <dbl>,
## #   info_quantity_exploration <dbl>,
## #   self_evaluation <dbl>,
## #   information_collection <dbl>, target_select <dbl>,
## #   formulate <dbl>, problem_solving <dbl>,
## #   career_exploration <dbl>, …

你可以给这个图颜色弄得更好看点?

library(ggridges)
d %>% 
  ggplot(aes(x = career_exploration, y = sex, fill = sex)) +
  geom_density_ridges()
t_test_eq <- t.test(career_exploration ~ sex, data = d, var.equal = TRUE) %>% 
  broom::tidy()
t_test_eq
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>
## 1   -0.367    -0.173     0.194     -3.24 0.00132
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>
t_test_uneq <- t.test(career_exploration ~ sex, data = d, var.equal = FALSE) %>% 
  broom::tidy()
t_test_uneq 
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>
## 1   -0.367    -0.173     0.194     -3.27 0.00121
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>

当然,也可以用第 55 章介绍的统计推断的方法

library(infer)

obs_diff <- d %>% 
  specify(formula = career_exploration ~ sex) %>% 
  calculate("diff in means", order = c("1", "2"))
obs_diff
## Response: career_exploration (numeric)
## Explanatory: sex (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -0.367
null_dist <- d %>% 
  specify(formula = career_exploration ~ sex) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("1", "2"))
null_dist
## Response: career_exploration (numeric)
## Explanatory: sex (factor)
## Null Hypothesis: independence
## # A tibble: 5,000 × 2
##   replicate    stat
##       <int>   <dbl>
## 1         1 -0.0321
## 2         2  0.0912
## 3         3 -0.0333
## 4         4  0.0363
## 5         5 -0.135 
## 6         6 -0.0321
## # … with 4,994 more rows
null_dist %>%  
  visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two_sided")
null_dist %>%  
  get_p_value(obs_stat = obs_diff, direction = "two_sided") %>% 
  #get_p_value(obs_stat = obs_diff, direction = "less") %>% 
  mutate(p_value_clean = scales::pvalue(p_value))
## # A tibble: 1 × 2
##   p_value p_value_clean
##     <dbl> <chr>        
## 1  0.0012 0.001

也可以用tidyverse的方法一次性的搞定所有指标

d %>%
  pivot_longer(
    cols = -c(pid, sex, majoy, grade, from),
    names_to = "index",
    values_to = "value"
  ) %>% 
  group_by(index) %>% 
  summarise(
    broom::tidy( t.test(value ~ sex, data = cur_data()))
  ) %>% 
  select(index, estimate, statistic, p.value) %>% 
  arrange(p.value)
## # A tibble: 11 × 4
##   index                  estimate statistic    p.value
##   <chr>                     <dbl>     <dbl>      <dbl>
## 1 career_decision_making   -0.494     -4.53 0.00000862
## 2 problem_solving          -0.470     -4.26 0.0000270 
## 3 target_select            -0.449     -4.07 0.0000609 
## 4 formulate                -0.411     -3.72 0.000235  
## 5 information_collection   -0.411     -3.70 0.000253  
## 6 self_evaluation          -0.404     -3.65 0.000315  
## # … with 5 more rows

83.3.3 来自不同地方的学生在职业探索上有所不同?

以生源地为例。因为生源地有3类,所以可以使用方差分析。

aov(career_exploration ~ from, data = d) %>%
  TukeyHSD(which = "from") %>%
  broom::tidy()
## # A tibble: 3 × 7
##   term  contrast null.value estimate conf.low conf.high
##   <chr> <chr>         <dbl>    <dbl>    <dbl>     <dbl>
## 1 from  2-1               0   0.382    0.0623     0.701
## 2 from  3-1               0   0.287   -0.0386     0.613
## 3 from  3-2               0  -0.0943  -0.446      0.257
## # … with 1 more variable: adj.p.value <dbl>
library(ggridges)
d %>% 
  ggplot(aes(x = career_exploration, y = from, fill = from)) +
  geom_density_ridges()

也可以一次性的搞定所有指标

d %>%
  pivot_longer(
    cols = -c(pid, sex, majoy, grade, from),
    names_to = "index",
    values_to = "value"
  ) %>% 
  group_by(index) %>% 
  summarise(
    broom::tidy( aov(value ~ from, data = cur_data()))
  ) %>% 
  select(index, term, statistic, p.value) %>% 
  filter(term != "Residuals") %>% 
  arrange(p.value)
## # A tibble: 11 × 4
## # Groups:   index [11]
##   index                  term  statistic     p.value
##   <chr>                  <chr>     <dbl>       <dbl>
## 1 problem_solving        from      14.6  0.000000918
## 2 career_decision_making from      14.2  0.00000126 
## 3 formulate              from      12.2  0.00000781 
## 4 information_collection from      10.2  0.0000527  
## 5 self_evaluation        from       8.91 0.000174   
## 6 target_select          from       8.45 0.000270   
## # … with 5 more rows

83.3.4 职业探索和决策之间有关联?

可以用第 50 章线性模型来探索

lm(career_decision_making  ~ career_exploration, data = d)
## 
## Call:
## lm(formula = career_decision_making ~ career_exploration, data = d)
## 
## Coefficients:
##        (Intercept)  career_exploration  
##           2.15e-15            7.83e-01

不要因为我讲课讲的很垃圾,就错过了R的美,瑕不掩瑜啦。要相信自己,你们是川师研究生中最聪明的。