第 48 章 模拟与抽样2


penguins <- palmerpenguins::penguins %>% drop_na()

penguins %>%
  specify(formula = bill_length_mm ~ sex) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>% 
    stat = "diff in means",
    order = c("male", "female")
  ) %>%

48.1 重复infer中diff in means的抽样过程


  • 独立性假设。假设有 y 和 x 两者独立,即y 与 x 无关,一个怎么变,都不会影响另一个。
  • 零假设,x下有两个组(male, female),每组对应的y的均值是相等的,均值之差为0.


48.1.1 抽样

  • 解释变量x列不动
  • 响应变量y这一列,洗牌后放回。


tbl <- tibble(
  y = 1:4,
  x = c("a", "a", "b", "b")
y <- tbl[[1]]
y_prime <- sample(y, size = length(y), replace = FALSE)
tbl[1] <- y_prime 
permute_once <- function(df) {
  y <- df[[1]]
  y_prime <- sample(y, size = length(y), replace = FALSE)
  df[1] <- y_prime 

tbl %>% permute_once()
1:100 %>% 
   purrr::map(~ permute_once(tbl)) 


permuate_repeat <- function(df, reps = 30){
  df_out <- 
    purrr::map_dfr(.x = 1:reps, .f = ~ permute_once(df)) %>% 
    dplyr::mutate(replicate = rep(1:reps, each = nrow(df))) %>% 

tbl %>% permuate_repeat(reps = 1000)
48.1.2 计算null假设分布

计算每次抽样中,a 组和 b 组的均值,以及这两个均值的差

null_dist <- tbl %>% 
  permuate_repeat(reps = 1000) %>% 

  group_by(replicate, x) %>% 
  summarise(ybar = mean(y)) %>% 
  group_by(replicate) %>% 
    stat = ybar[x == "a"] - ybar[x == "b"]
48.1.3 可视化

null_dist %>% 
  ggplot(aes(x = stat)) +
  geom_histogram(bins = 15, color = "white")

48.1.4 应用penguins

samples <- penguins %>% 
  select(bill_length_mm, sex) %>% 
  permuate_repeat(reps = 5000)

null_dist <- samples %>% 
  group_by(replicate, sex) %>% 
  summarise(ybar = mean(bill_length_mm)) %>% 
  group_by(replicate) %>% 
    stat = ybar[sex == "male"] - ybar[sex == "female"]
null_dist %>% 
  ggplot(aes(x = stat)) +
  geom_histogram(bins = 15, color = "white")
penguins %>%
  group_by(sex) %>%
  summarize(avg_rating = mean(bill_length_mm, na.rm = TRUE)) %>%
  mutate(diff_means = avg_rating - lag(avg_rating))
p_value <- sum(null_dist$stat > 3.757792) / length(null_dist$stat)
## [1] 0