第 56 章 多层线性模型
56.1 分组数据
在实验设计和数据分析中,我们可能经常会遇到分组的数据结构。所谓的分组,就是每一次观察,属于某个特定的组,比如考察学生的成绩,这些学生属于某个班级,班级又属于某个学校。有时候发现这种分组的数据,会给数据分析带来很多有意思的内容。
56.2 案例
我们从一个有意思的案例开始。
不同院系教职员工的收入
一般情况下,不同的院系,制定教师收入的依据和标准可能是不同的。我们假定有一份大学教职员的收入清单,这个学校包括信息学院、外国语学院、社会政治学、生物学院、统计学院共五个机构,我们通过数据建模,探索这个学校的薪酬制定规则。
create_data <- function() {
df <- tibble(
ids = 1:100,
department = rep(c("sociology", "biology", "english", "informatics", "statistics"), 20),
bases = rep(c(40000, 50000, 60000, 70000, 80000), 20) * runif(100, .9, 1.1),
experience = floor(runif(100, 0, 10)),
raises = rep(c(2000, 500, 500, 1700, 500), 20) * runif(100, .9, 1.1)
)
df <- df %>% mutate(
salary = bases + experience * raises
)
df
}
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.3
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Warning: package 'modelr' was built under R version 4.2.3
## Warning: package 'broom' was built under R version 4.2.3
##
## Attaching package: 'broom'
##
## The following object is masked from 'package:modelr':
##
## bootstrap
library(broom.mixed)
df <- create_data()
df
## # A tibble: 100 × 6
## ids department bases experience raises salary
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 sociology 43472. 6 1861. 54636.
## 2 2 biology 49262. 5 545. 51989.
## 3 3 english 56677. 4 482. 58605.
## 4 4 informatics 67269. 1 1599. 68868.
## 5 5 statistics 79641. 8 544. 83994.
## 6 6 sociology 41590. 7 2184. 56876.
## 7 7 biology 48098. 9 541. 52972.
## 8 8 english 57433. 7 536. 61181.
## 9 9 informatics 64145. 4 1794. 71323.
## 10 10 statistics 83165. 3 458. 84541.
## # ℹ 90 more rows
56.3 线性模型
薪酬制定规则一:假定教师收入主要取决于他从事工作的时间,也就说说工作时间越长收入越高。意味着,每个院系的起始薪酬(起薪)是一样的,并有相同的年度增长率。那么,这个收入问题就是一个简单线性模型:
\[\hat{y} = \alpha + \beta_1x_1 + ... + \beta_nx_n\]
具体到我们的案例中,薪酬模型可以写为 \[ \hat{salary_i} = \alpha + \beta * experience_i \]
通过这个等式,可以计算出各个系数,即截距\(\alpha\)就是起薪,斜率\(\beta\)就是年度增长率。确定了斜率和截距,也就确定了每个教职员工的收入曲线。
# Model without respect to grouping
m1 <- lm(salary ~ experience, data = df)
m1
##
## Call:
## lm(formula = salary ~ experience, data = df)
##
## Coefficients:
## (Intercept) experience
## 59509 1033
broom::tidy(m1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 59509. 2382. 25.0 2.75e-44
## 2 experience 1033. 467. 2.21 2.93e- 2
df %>% modelr::add_predictions(m1)
## # A tibble: 100 × 7
## ids department bases experience raises salary pred
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 sociology 43472. 6 1861. 54636. 65710.
## 2 2 biology 49262. 5 545. 51989. 64676.
## 3 3 english 56677. 4 482. 58605. 63643.
## 4 4 informatics 67269. 1 1599. 68868. 60542.
## 5 5 statistics 79641. 8 544. 83994. 67777.
## 6 6 sociology 41590. 7 2184. 56876. 66743.
## 7 7 biology 48098. 9 541. 52972. 68810.
## 8 8 english 57433. 7 536. 61181. 66743.
## 9 9 informatics 64145. 4 1794. 71323. 63643.
## 10 10 statistics 83165. 3 458. 84541. 62609.
## # ℹ 90 more rows
# Model without respect to grouping
df %>%
add_predictions(m1) %>%
ggplot(aes(x = experience, y = salary)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("linear model Salary Prediction") +
scale_colour_discrete("Department")
注意到,对每个教师来说,不管来自哪个学院的,系数\(\alpha\)和\(\beta\)是一样的,是固定的,因此这种简单线性模型也称之为固定效应模型。
事实上,这种线性模型的方法太过于粗狂,构建的线性直线不能反映收入随院系的变化。
56.4 变化的截距
薪酬制定规则二,假定不同的院系起薪不同,但年度增长率是相同的。
这种统计模型,相比于之前的固定效应模型(简单线性模型)而言,加入了截距会随所在学院不同而变化的思想,统计模型写为
\[\hat{y_i} = \alpha_{j[i]} + \beta x_i\]
这个等式中,斜率\(\beta\)代表着年度增长率,是一个固定值,也就前面说的固定效应项,而截距\(\alpha\)代表着起薪,随学院变化,是五个值,因为一个学院对应一个,称之为变化效应项(也叫随机效应项)。这里模型中既有固定效应项又有变化效应项,因此称之为混合效应模型。
教师\(i\),他所在的学院\(j\),记为\(j[i]\),那么教师\(i\)所在学院\(j\)对应的\(\alpha\),很自然的记为\(\alpha_{j[i]}\)
# Model with varying intercept
m2 <- lmer(salary ~ experience + (1 | department), data = df)
m2
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (1 | department)
## Data: df
## REML criterion at convergence: 1911.831
## Random effects:
## Groups Name Std.Dev.
## department (Intercept) 14055
## Residual 3499
## Number of obs: 100, groups: department, 5
## Fixed Effects:
## (Intercept) experience
## 59921.8 936.3
broom.mixed::tidy(m2, effects = "fixed")
## # A tibble: 2 × 5
## effect term estimate std.error statistic
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 fixed (Intercept) 59922. 6320. 9.48
## 2 fixed experience 936. 130. 7.20
broom.mixed::tidy(m2, effects = "ran_vals")
## # A tibble: 5 × 6
## effect group level term estimate std.error
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ran_vals department biology (Intercept) -10420. 781.
## 2 ran_vals department english (Intercept) -2442. 781.
## 3 ran_vals department informatics (Intercept) 11860. 781.
## 4 ran_vals department sociology (Intercept) -15808. 781.
## 5 ran_vals department statistics (Intercept) 16810. 781.
df %>%
add_predictions(m2) %>%
ggplot(aes(
x = experience, y = salary, group = department,
colour = department
)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Varying Intercept Salary Prediction") +
scale_colour_discrete("Department")
这种模型,我们就能看到院系不同 带来的员工收入的差别。
56.5 变化的斜率
薪酬制定规则三,不同的院系起始薪酬是相同的,但年度增长率不同。
与薪酬模型规则二的统计模型比较,我们只需要把变化的截距变成变化的斜率,那么统计模型可写为
\[\hat{y_i} = \alpha + \beta_{j[i]}x_i\]
这里,截距(\(\alpha\))对所有教师而言是固定不变的,而斜率(\(\beta\))会随学院不同而变化,5个学院对应着5个斜率。
# Model with varying slope
m3 <- lmer(salary ~ experience + (0 + experience | department), data = df)
m3
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (0 + experience | department)
## Data: df
## REML criterion at convergence: 2071.328
## Random effects:
## Groups Name Std.Dev.
## department experience 2376
## Residual 8226
## Number of obs: 100, groups: department, 5
## Fixed Effects:
## (Intercept) experience
## 60121 852
broom.mixed::tidy(m3, effects = "fixed")
## # A tibble: 2 × 5
## effect term estimate std.error statistic
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 fixed (Intercept) 60121. 1502. 40.0
## 2 fixed experience 852. 1104. 0.772
broom.mixed::tidy(m3, effects = "ran_vals")
## # A tibble: 5 × 6
## effect group level term estimate std.error
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ran_vals department biology experience -1876. 402.
## 2 ran_vals department english experience -335. 289.
## 3 ran_vals department informatics experience 2261. 389.
## 4 ran_vals department sociology experience -2604. 393.
## 5 ran_vals department statistics experience 2553. 352.
df %>%
add_predictions(m3) %>%
ggplot(aes(
x = experience, y = salary, group = department,
colour = department
)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Varying slope Salary Prediction") +
scale_colour_discrete("Department")
56.6 变化的斜率 + 变化的截距
薪酬制定规则四,不同的学院起始薪酬和年度增长率也不同。
这可能是最现实的一种情形了,它实际上是规则二和规则三的一种组合,要求截距和斜率都会随学院的不同变化,数学上记为
\[\hat{y_i} = \alpha_{j[i]} + \beta_{j[i]}x_i\] 具体来说,教师\(i\),所在的学院\(j\), 他的入职的起始收入表示为 (\(\alpha_{j[i]}\)),年度增长率表示为(\(\beta_{j[i]}\)).
# Model with varying slope and intercept
m4 <- lmer(salary ~ experience + (1 + experience | department), data = df)
m4
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ experience + (1 + experience | department)
## Data: df
## REML criterion at convergence: 1906.88
## Random effects:
## Groups Name Std.Dev. Corr
## department (Intercept) 14398.1
## experience 464.3 -0.23
## Residual 3313.2
## Number of obs: 100, groups: department, 5
## Fixed Effects:
## (Intercept) experience
## 59758.9 983.8
broom.mixed::tidy(m4, effects = "fixed")
## # A tibble: 2 × 5
## effect term estimate std.error statistic
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 fixed (Intercept) 59759. 6471. 9.24
## 2 fixed experience 984. 242. 4.06
broom.mixed::tidy(m4, effects = "ran_vals")
## # A tibble: 10 × 6
## effect group level term estimate std.error
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ran_vals department biology (Intercept) -8913. 1054.
## 2 ran_vals department english (Intercept) -1369. 1687.
## 3 ran_vals department informatics (Intercept) 10660. 1209.
## 4 ran_vals department sociology (Intercept) -17902. 1257.
## 5 ran_vals department statistics (Intercept) 17524. 1154.
## 6 ran_vals department biology experience -442. 220.
## 7 ran_vals department english experience -203. 259.
## 8 ran_vals department informatics experience 302. 245.
## 9 ran_vals department sociology experience 522. 257.
## 10 ran_vals department statistics experience -179. 213.
df %>%
add_predictions(m4) %>%
ggplot(aes(
x = experience, y = salary, group = department,
colour = department
)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Varying Intercept and Slopes Salary Prediction") +
scale_colour_discrete("Department")
56.7 信息池
56.7.1 提问
问题:薪酬制定规则四中,不同的院系起薪不同,年度增长率也不同,我们得出了5组不同的截距和斜率,那么是不是可以等价为,先按照院系分5组,然后各算各的截距和斜率? 比如
df %>%
group_by(department) %>%
group_modify(
~ broom::tidy(lm(salary ~ 1 + experience, data = .))
)
## # A tibble: 10 × 6
## # Groups: department [5]
## department term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 biology (Intercept) 51297. 937. 54.7 1.81e-21
## 2 biology experience 399. 208. 1.92 7.13e- 2
## 3 english (Intercept) 58929. 2046. 28.8 1.65e-16
## 4 english experience 687. 324. 2.12 4.81e- 2
## 5 informatics (Intercept) 69934. 1327. 52.7 3.55e-21
## 6 informatics experience 1420. 284. 4.99 9.40e- 5
## 7 sociology (Intercept) 40988. 1047. 39.2 7.13e-19
## 8 sociology experience 1715. 227. 7.57 5.33e- 7
## 9 statistics (Intercept) 77470. 1630. 47.5 2.24e-20
## 10 statistics experience 771. 315. 2.45 2.49e- 2
分组各自回归,与这里的(变化的截距+变化的斜率)模型,不是一回事。
56.7.2 信息共享
- 完全共享
broom::tidy(m1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 59509. 2382. 25.0 2.75e-44
## 2 experience 1033. 467. 2.21 2.93e- 2
complete_pooling <-
broom::tidy(m1) %>%
dplyr::select(term, estimate) %>%
tidyr::pivot_wider(
names_from = term,
values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = "complete_pool") %>%
dplyr::select(pooled, Intercept, slope)
complete_pooling
## # A tibble: 1 × 3
## pooled Intercept slope
## <chr> <dbl> <dbl>
## 1 complete_pool 59509. 1033.
- 部分共享
fix_effect <- broom.mixed::tidy(m4, effects = "fixed")
fix_effect
fix_effect$estimate[1]
fix_effect$estimate[2]
var_effect <- broom.mixed::tidy(m4, effects = "ran_vals")
var_effect
# random effects plus fixed effect parameters
partial_pooling <- var_effect %>%
dplyr::select(level, term, estimate) %>%
tidyr::pivot_wider(
names_from = term,
values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, estimate = experience) %>%
dplyr::mutate(
Intercept = Intercept + fix_effect$estimate[1],
estimate = estimate + fix_effect$estimate[2]
) %>%
dplyr::mutate(pool = "partial_pool") %>%
dplyr::select(pool, level, Intercept, estimate)
partial_pooling
partial_pooling <-
coef(m4)$department %>%
tibble::rownames_to_column() %>%
dplyr::rename(level = rowname, Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = "partial_pool") %>%
dplyr::select(pooled, level, Intercept, slope)
partial_pooling
## pooled level Intercept slope
## 1 partial_pool biology 50845.60 541.3142
## 2 partial_pool english 58389.97 780.7522
## 3 partial_pool informatics 70418.78 1286.2142
## 4 partial_pool sociology 41857.11 1505.3033
## 5 partial_pool statistics 77282.90 805.1879
- 不共享
no_pool <- df %>%
dplyr::group_by(department) %>%
dplyr::group_modify(
~ broom::tidy(lm(salary ~ 1 + experience, data = .))
)
no_pool
## # A tibble: 10 × 6
## # Groups: department [5]
## department term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 biology (Intercept) 51297. 937. 54.7 1.81e-21
## 2 biology experience 399. 208. 1.92 7.13e- 2
## 3 english (Intercept) 58929. 2046. 28.8 1.65e-16
## 4 english experience 687. 324. 2.12 4.81e- 2
## 5 informatics (Intercept) 69934. 1327. 52.7 3.55e-21
## 6 informatics experience 1420. 284. 4.99 9.40e- 5
## 7 sociology (Intercept) 40988. 1047. 39.2 7.13e-19
## 8 sociology experience 1715. 227. 7.57 5.33e- 7
## 9 statistics (Intercept) 77470. 1630. 47.5 2.24e-20
## 10 statistics experience 771. 315. 2.45 2.49e- 2
un_pooling <- no_pool %>%
dplyr::select(department, term, estimate) %>%
tidyr::pivot_wider(
names_from = term,
values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = "no_pool") %>%
dplyr::select(pooled, level = department, Intercept, slope)
un_pooling
## # A tibble: 5 × 4
## # Groups: level [5]
## pooled level Intercept slope
## <chr> <chr> <dbl> <dbl>
## 1 no_pool biology 51297. 399.
## 2 no_pool english 58929. 687.
## 3 no_pool informatics 69934. 1420.
## 4 no_pool sociology 40988. 1715.
## 5 no_pool statistics 77470. 771.
56.7.3 可视化
library(ggrepel)
un_pooling %>%
dplyr::bind_rows(partial_pooling) %>%
ggplot(aes(x = Intercept, y = slope)) +
purrr::map(
c(seq(from = 0.1, to = 0.9, by = 0.1)),
.f = function(level) {
stat_ellipse(
geom = "polygon", type = "norm",
size = 0, alpha = 1 / 10, fill = "gray10",
level = level
)
}
) +
geom_point(aes(group = pooled, color = pooled)) +
geom_line(aes(group = level), size = 1 / 4) +
# geom_point(data = complete_pooling, size = 4, color = "red") +
geom_text_repel(
data = . %>% filter(pooled == "no_pool"),
aes(label = level)
) +
scale_color_manual(
name = "information pool",
values = c(
"no_pool" = "black",
"partial_pool" = "red" # ,
# "complete_pool" = "#A65141"
),
labels = c(
"no_pool" = "no share",
"partial_pool" = "partial share" # ,
# "complete_pool" = "complete share"
)
) #+
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# theme_classic()
56.8 更多
- 解释模型的含义
lmer(salary ~ 1 + (0 + experience | department), data = df)
# vs
lmer(salary ~ 1 + experience + (0 + experience | department), data = df)
lmer(salary ~ 1 + (1 + experience | department), data = df)
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ 1 + (1 + experience | department)
## Data: df
## REML criterion at convergence: 1927.814
## Random effects:
## Groups Name Std.Dev. Corr
## department (Intercept) 15656
## experience 1070 -0.44
## Residual 3313
## Number of obs: 100, groups: department, 5
## Fixed Effects:
## (Intercept)
## 66047
# vs
lmer(salary ~ 1 + (1 | department) + (0 + experience | department), data = df)
## Linear mixed model fit by REML ['lmerMod']
## Formula: salary ~ 1 + (1 | department) + (0 + experience | department)
## Data: df
## REML criterion at convergence: 1928.025
## Random effects:
## Groups Name Std.Dev.
## department (Intercept) 14391
## department.1 experience 1064
## Residual 3314
## Number of obs: 100, groups: department, 5
## Fixed Effects:
## (Intercept)
## 60018
- 课后阅读文献,读完后大家一起分享
- 课后阅读 Understanding mixed effects models through data simulation,