第 56 章 多层线性模型

56.1 分组数据

在实验设计和数据分析中,我们可能经常会遇到分组的数据结构。所谓的分组,就是每一次观察,属于某个特定的组,比如考察学生的成绩,这些学生属于某个班级,班级又属于某个学校。有时候发现这种分组的数据,会给数据分析带来很多有意思的内容。

56.2 案例

我们从一个有意思的案例开始。

不同院系教职员工的收入

一般情况下,不同的院系,制定教师收入的依据和标准可能是不同的。我们假定有一份大学教职员的收入清单,这个学校包括信息学院、外国语学院、社会政治学、生物学院、统计学院共五个机构,我们通过数据建模,探索这个学校的薪酬制定规则

create_data <- function() {
  df <- tibble(
    ids = 1:100,
    department = rep(c("sociology", "biology", "english", "informatics", "statistics"), 20),
    bases = rep(c(40000, 50000, 60000, 70000, 80000), 20) * runif(100, .9, 1.1),
    experience = floor(runif(100, 0, 10)),
    raises = rep(c(2000, 500, 500, 1700, 500), 20) * runif(100, .9, 1.1)
  )


  df <- df %>% mutate(
    salary = bases + experience * raises
  )
  df
}
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.3
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Warning: package 'modelr' was built under R version 4.2.3
## Warning: package 'broom' was built under R version 4.2.3
## 
## Attaching package: 'broom'
## 
## The following object is masked from 'package:modelr':
## 
##     bootstrap
library(broom.mixed)

df <- create_data()
df
## # A tibble: 100 × 6
##      ids department   bases experience raises salary
##    <int> <chr>        <dbl>      <dbl>  <dbl>  <dbl>
##  1     1 sociology   43472.          6  1861. 54636.
##  2     2 biology     49262.          5   545. 51989.
##  3     3 english     56677.          4   482. 58605.
##  4     4 informatics 67269.          1  1599. 68868.
##  5     5 statistics  79641.          8   544. 83994.
##  6     6 sociology   41590.          7  2184. 56876.
##  7     7 biology     48098.          9   541. 52972.
##  8     8 english     57433.          7   536. 61181.
##  9     9 informatics 64145.          4  1794. 71323.
## 10    10 statistics  83165.          3   458. 84541.
## # ℹ 90 more rows

56.3 线性模型

薪酬制定规则一:假定教师收入主要取决于他从事工作的时间,也就说说工作时间越长收入越高。意味着,每个院系的起始薪酬(起薪)是一样的,并有相同的年度增长率。那么,这个收入问题就是一个简单线性模型:

\[\hat{y} = \alpha + \beta_1x_1 + ... + \beta_nx_n\]

具体到我们的案例中,薪酬模型可以写为 \[ \hat{salary_i} = \alpha + \beta * experience_i \]

通过这个等式,可以计算出各个系数,即截距\(\alpha\)就是起薪,斜率\(\beta\)就是年度增长率。确定了斜率和截距,也就确定了每个教职员工的收入曲线。

# Model without respect to grouping
m1 <- lm(salary ~ experience, data = df)
m1
## 
## Call:
## lm(formula = salary ~ experience, data = df)
## 
## Coefficients:
## (Intercept)   experience  
##       59509         1033
broom::tidy(m1)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   59509.     2382.     25.0  2.75e-44
## 2 experience     1033.      467.      2.21 2.93e- 2
df %>% modelr::add_predictions(m1)
## # A tibble: 100 × 7
##      ids department   bases experience raises salary   pred
##    <int> <chr>        <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
##  1     1 sociology   43472.          6  1861. 54636. 65710.
##  2     2 biology     49262.          5   545. 51989. 64676.
##  3     3 english     56677.          4   482. 58605. 63643.
##  4     4 informatics 67269.          1  1599. 68868. 60542.
##  5     5 statistics  79641.          8   544. 83994. 67777.
##  6     6 sociology   41590.          7  2184. 56876. 66743.
##  7     7 biology     48098.          9   541. 52972. 68810.
##  8     8 english     57433.          7   536. 61181. 66743.
##  9     9 informatics 64145.          4  1794. 71323. 63643.
## 10    10 statistics  83165.          3   458. 84541. 62609.
## # ℹ 90 more rows
# Model without respect to grouping
df %>%
  add_predictions(m1) %>%
  ggplot(aes(x = experience, y = salary)) +
  geom_point() +
  geom_line(aes(x = experience, y = pred)) +
  labs(x = "Experience", y = "Predicted Salary") +
  ggtitle("linear model Salary Prediction") +
  scale_colour_discrete("Department")

注意到,对每个教师来说,不管来自哪个学院的,系数\(\alpha\)\(\beta\)是一样的,是固定的,因此这种简单线性模型也称之为固定效应模型。

事实上,这种线性模型的方法太过于粗狂,构建的线性直线不能反映收入随院系的变化

56.4 变化的截距

薪酬制定规则二,假定不同的院系起薪不同,但年度增长率是相同的。

这种统计模型,相比于之前的固定效应模型(简单线性模型)而言,加入了截距会随所在学院不同而变化的思想,统计模型写为

\[\hat{y_i} = \alpha_{j[i]} + \beta x_i\]

这个等式中,斜率\(\beta\)代表着年度增长率,是一个固定值,也就前面说的固定效应项,而截距\(\alpha\)代表着起薪,随学院变化,是五个值,因为一个学院对应一个,称之为变化效应项(也叫随机效应项)。这里模型中既有固定效应项又有变化效应项,因此称之为混合效应模型

教师\(i\),他所在的学院\(j\),记为\(j[i]\),那么教师\(i\)所在学院\(j\)对应的\(\alpha\),很自然的记为\(\alpha_{j[i]}\)

# Model with varying intercept
m2 <- lmer(salary ~ experience + (1 | department), data = df)
m2
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (1 | department)
##    Data: df
## REML criterion at convergence: 1911.831
## Random effects:
##  Groups     Name        Std.Dev.
##  department (Intercept) 14055   
##  Residual                3499   
## Number of obs: 100, groups:  department, 5
## Fixed Effects:
## (Intercept)   experience  
##     59921.8        936.3
broom.mixed::tidy(m2, effects = "fixed")
## # A tibble: 2 × 5
##   effect term        estimate std.error statistic
##   <chr>  <chr>          <dbl>     <dbl>     <dbl>
## 1 fixed  (Intercept)   59922.     6320.      9.48
## 2 fixed  experience      936.      130.      7.20
broom.mixed::tidy(m2, effects = "ran_vals")
## # A tibble: 5 × 6
##   effect   group      level       term        estimate std.error
##   <chr>    <chr>      <chr>       <chr>          <dbl>     <dbl>
## 1 ran_vals department biology     (Intercept)  -10420.      781.
## 2 ran_vals department english     (Intercept)   -2442.      781.
## 3 ran_vals department informatics (Intercept)   11860.      781.
## 4 ran_vals department sociology   (Intercept)  -15808.      781.
## 5 ran_vals department statistics  (Intercept)   16810.      781.
df %>%
  add_predictions(m2) %>%
  ggplot(aes(
    x = experience, y = salary, group = department,
    colour = department
  )) +
  geom_point() +
  geom_line(aes(x = experience, y = pred)) +
  labs(x = "Experience", y = "Predicted Salary") +
  ggtitle("Varying Intercept Salary Prediction") +
  scale_colour_discrete("Department")

这种模型,我们就能看到院系不同 带来的员工收入的差别。

56.5 变化的斜率

薪酬制定规则三,不同的院系起始薪酬是相同的,但年度增长率不同。

与薪酬模型规则二的统计模型比较,我们只需要把变化的截距变成变化的斜率,那么统计模型可写为

\[\hat{y_i} = \alpha + \beta_{j[i]}x_i\]

这里,截距(\(\alpha\))对所有教师而言是固定不变的,而斜率(\(\beta\))会随学院不同而变化,5个学院对应着5个斜率。

# Model with varying slope
m3 <- lmer(salary ~ experience + (0 + experience | department), data = df)
m3
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (0 + experience | department)
##    Data: df
## REML criterion at convergence: 2071.328
## Random effects:
##  Groups     Name       Std.Dev.
##  department experience 2376    
##  Residual              8226    
## Number of obs: 100, groups:  department, 5
## Fixed Effects:
## (Intercept)   experience  
##       60121          852
broom.mixed::tidy(m3, effects = "fixed")
## # A tibble: 2 × 5
##   effect term        estimate std.error statistic
##   <chr>  <chr>          <dbl>     <dbl>     <dbl>
## 1 fixed  (Intercept)   60121.     1502.    40.0  
## 2 fixed  experience      852.     1104.     0.772
broom.mixed::tidy(m3, effects = "ran_vals")
## # A tibble: 5 × 6
##   effect   group      level       term       estimate std.error
##   <chr>    <chr>      <chr>       <chr>         <dbl>     <dbl>
## 1 ran_vals department biology     experience   -1876.      402.
## 2 ran_vals department english     experience    -335.      289.
## 3 ran_vals department informatics experience    2261.      389.
## 4 ran_vals department sociology   experience   -2604.      393.
## 5 ran_vals department statistics  experience    2553.      352.
df %>%
  add_predictions(m3) %>%
  ggplot(aes(
    x = experience, y = salary, group = department,
    colour = department
  )) +
  geom_point() +
  geom_line(aes(x = experience, y = pred)) +
  labs(x = "Experience", y = "Predicted Salary") +
  ggtitle("Varying slope Salary Prediction") +
  scale_colour_discrete("Department")

56.6 变化的斜率 + 变化的截距

薪酬制定规则四,不同的学院起始薪酬和年度增长率也不同。

这可能是最现实的一种情形了,它实际上是规则二和规则三的一种组合,要求截距和斜率都会随学院的不同变化,数学上记为

\[\hat{y_i} = \alpha_{j[i]} + \beta_{j[i]}x_i\] 具体来说,教师\(i\),所在的学院\(j\), 他的入职的起始收入表示为 (\(\alpha_{j[i]}\)),年度增长率表示为(\(\beta_{j[i]}\)).

# Model with varying slope and intercept
m4 <- lmer(salary ~ experience + (1 + experience | department), data = df)
m4
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (1 + experience | department)
##    Data: df
## REML criterion at convergence: 1906.88
## Random effects:
##  Groups     Name        Std.Dev. Corr 
##  department (Intercept) 14398.1       
##             experience    464.3  -0.23
##  Residual                3313.2       
## Number of obs: 100, groups:  department, 5
## Fixed Effects:
## (Intercept)   experience  
##     59758.9        983.8
broom.mixed::tidy(m4, effects = "fixed")
## # A tibble: 2 × 5
##   effect term        estimate std.error statistic
##   <chr>  <chr>          <dbl>     <dbl>     <dbl>
## 1 fixed  (Intercept)   59759.     6471.      9.24
## 2 fixed  experience      984.      242.      4.06
broom.mixed::tidy(m4, effects = "ran_vals")
## # A tibble: 10 × 6
##    effect   group      level       term        estimate std.error
##    <chr>    <chr>      <chr>       <chr>          <dbl>     <dbl>
##  1 ran_vals department biology     (Intercept)   -8913.     1054.
##  2 ran_vals department english     (Intercept)   -1369.     1687.
##  3 ran_vals department informatics (Intercept)   10660.     1209.
##  4 ran_vals department sociology   (Intercept)  -17902.     1257.
##  5 ran_vals department statistics  (Intercept)   17524.     1154.
##  6 ran_vals department biology     experience     -442.      220.
##  7 ran_vals department english     experience     -203.      259.
##  8 ran_vals department informatics experience      302.      245.
##  9 ran_vals department sociology   experience      522.      257.
## 10 ran_vals department statistics  experience     -179.      213.
df %>%
  add_predictions(m4) %>%
  ggplot(aes(
    x = experience, y = salary, group = department,
    colour = department
  )) +
  geom_point() +
  geom_line(aes(x = experience, y = pred)) +
  labs(x = "Experience", y = "Predicted Salary") +
  ggtitle("Varying Intercept and Slopes Salary Prediction") +
  scale_colour_discrete("Department")

56.7 信息池

56.7.1 提问

问题:薪酬制定规则四中,不同的院系起薪不同,年度增长率也不同,我们得出了5组不同的截距和斜率,那么是不是可以等价为,先按照院系分5组,然后各算各的截距和斜率? 比如

df %>%
  group_by(department) %>%
  group_modify(
    ~ broom::tidy(lm(salary ~ 1 + experience, data = .))
  )
## # A tibble: 10 × 6
## # Groups:   department [5]
##    department  term        estimate std.error statistic  p.value
##    <chr>       <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 biology     (Intercept)   51297.      937.     54.7  1.81e-21
##  2 biology     experience      399.      208.      1.92 7.13e- 2
##  3 english     (Intercept)   58929.     2046.     28.8  1.65e-16
##  4 english     experience      687.      324.      2.12 4.81e- 2
##  5 informatics (Intercept)   69934.     1327.     52.7  3.55e-21
##  6 informatics experience     1420.      284.      4.99 9.40e- 5
##  7 sociology   (Intercept)   40988.     1047.     39.2  7.13e-19
##  8 sociology   experience     1715.      227.      7.57 5.33e- 7
##  9 statistics  (Intercept)   77470.     1630.     47.5  2.24e-20
## 10 statistics  experience      771.      315.      2.45 2.49e- 2

分组各自回归,与这里的(变化的截距+变化的斜率)模型,不是一回事。

56.7.2 信息共享

  • 完全共享
broom::tidy(m1)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   59509.     2382.     25.0  2.75e-44
## 2 experience     1033.      467.      2.21 2.93e- 2
complete_pooling <-
  broom::tidy(m1) %>%
  dplyr::select(term, estimate) %>%
  tidyr::pivot_wider(
    names_from = term,
    values_from = estimate
  ) %>%
  dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
  dplyr::mutate(pooled = "complete_pool") %>%
  dplyr::select(pooled, Intercept, slope)

complete_pooling
## # A tibble: 1 × 3
##   pooled        Intercept slope
##   <chr>             <dbl> <dbl>
## 1 complete_pool    59509. 1033.
  • 部分共享
fix_effect <- broom.mixed::tidy(m4, effects = "fixed")
fix_effect
fix_effect$estimate[1]
fix_effect$estimate[2]
var_effect <- broom.mixed::tidy(m4, effects = "ran_vals")
var_effect
# random effects plus fixed effect parameters
partial_pooling <- var_effect %>%
  dplyr::select(level, term, estimate) %>%
  tidyr::pivot_wider(
    names_from = term,
    values_from = estimate
  ) %>%
  dplyr::rename(Intercept = `(Intercept)`, estimate = experience) %>%
  dplyr::mutate(
    Intercept = Intercept + fix_effect$estimate[1],
    estimate = estimate + fix_effect$estimate[2]
  ) %>%
  dplyr::mutate(pool = "partial_pool") %>%
  dplyr::select(pool, level, Intercept, estimate)

partial_pooling
partial_pooling <-
  coef(m4)$department %>%
  tibble::rownames_to_column() %>%
  dplyr::rename(level = rowname, Intercept = `(Intercept)`, slope = experience) %>%
  dplyr::mutate(pooled = "partial_pool") %>%
  dplyr::select(pooled, level, Intercept, slope)

partial_pooling
##         pooled       level Intercept     slope
## 1 partial_pool     biology  50845.60  541.3142
## 2 partial_pool     english  58389.97  780.7522
## 3 partial_pool informatics  70418.78 1286.2142
## 4 partial_pool   sociology  41857.11 1505.3033
## 5 partial_pool  statistics  77282.90  805.1879
  • 不共享
no_pool <- df %>%
  dplyr::group_by(department) %>%
  dplyr::group_modify(
    ~ broom::tidy(lm(salary ~ 1 + experience, data = .))
  )
no_pool
## # A tibble: 10 × 6
## # Groups:   department [5]
##    department  term        estimate std.error statistic  p.value
##    <chr>       <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 biology     (Intercept)   51297.      937.     54.7  1.81e-21
##  2 biology     experience      399.      208.      1.92 7.13e- 2
##  3 english     (Intercept)   58929.     2046.     28.8  1.65e-16
##  4 english     experience      687.      324.      2.12 4.81e- 2
##  5 informatics (Intercept)   69934.     1327.     52.7  3.55e-21
##  6 informatics experience     1420.      284.      4.99 9.40e- 5
##  7 sociology   (Intercept)   40988.     1047.     39.2  7.13e-19
##  8 sociology   experience     1715.      227.      7.57 5.33e- 7
##  9 statistics  (Intercept)   77470.     1630.     47.5  2.24e-20
## 10 statistics  experience      771.      315.      2.45 2.49e- 2
un_pooling <- no_pool %>%
  dplyr::select(department, term, estimate) %>%
  tidyr::pivot_wider(
    names_from = term,
    values_from = estimate
  ) %>%
  dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
  dplyr::mutate(pooled = "no_pool") %>%
  dplyr::select(pooled, level = department, Intercept, slope)

un_pooling
## # A tibble: 5 × 4
## # Groups:   level [5]
##   pooled  level       Intercept slope
##   <chr>   <chr>           <dbl> <dbl>
## 1 no_pool biology        51297.  399.
## 2 no_pool english        58929.  687.
## 3 no_pool informatics    69934. 1420.
## 4 no_pool sociology      40988. 1715.
## 5 no_pool statistics     77470.  771.

56.7.3 可视化

library(ggrepel)

un_pooling %>%
  dplyr::bind_rows(partial_pooling) %>%
  ggplot(aes(x = Intercept, y = slope)) +
  purrr::map(
    c(seq(from = 0.1, to = 0.9, by = 0.1)),
    .f = function(level) {
      stat_ellipse(
        geom = "polygon", type = "norm",
        size = 0, alpha = 1 / 10, fill = "gray10",
        level = level
      )
    }
  ) +
  geom_point(aes(group = pooled, color = pooled)) +
  geom_line(aes(group = level), size = 1 / 4) +
  # geom_point(data = complete_pooling, size = 4, color = "red") +
  geom_text_repel(
    data = . %>% filter(pooled == "no_pool"),
    aes(label = level)
  ) +
  scale_color_manual(
    name = "information pool",
    values = c(
      "no_pool" = "black",
      "partial_pool" = "red" # ,
      # "complete_pool" = "#A65141"
    ),
    labels = c(
      "no_pool" = "no share",
      "partial_pool" = "partial share" # ,
      # "complete_pool" = "complete share"
    )
  ) #+
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# theme_classic()

56.8 更多

  • 解释模型的含义
lmer(salary ~ 1 + (0 + experience | department), data = df)
# vs
lmer(salary ~ 1 + experience + (0 + experience | department), data = df)
lmer(salary ~ 1 + (1 + experience | department), data = df)
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ 1 + (1 + experience | department)
##    Data: df
## REML criterion at convergence: 1927.814
## Random effects:
##  Groups     Name        Std.Dev. Corr 
##  department (Intercept) 15656         
##             experience   1070    -0.44
##  Residual                3313         
## Number of obs: 100, groups:  department, 5
## Fixed Effects:
## (Intercept)  
##       66047
# vs
lmer(salary ~ 1 + (1 | department) + (0 + experience | department), data = df)
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ 1 + (1 | department) + (0 + experience | department)
##    Data: df
## REML criterion at convergence: 1928.025
## Random effects:
##  Groups       Name        Std.Dev.
##  department   (Intercept) 14391   
##  department.1 experience   1064   
##  Residual                  3314   
## Number of obs: 100, groups:  department, 5
## Fixed Effects:
## (Intercept)  
##       60018