第 17 章 广义线性模型

主要参考 David W. Hosmer 和 Stanley Lemeshow (Hosmer and Lemeshow 2000)

主要参考 Peter McCullagh 和 John Nelder (McCullagh and Nelder 1989) Annette J. Dobson 和 Adrian G. Barnett (Dobson and Barnett 2018) 逻辑回归模型主要参考 Applied Logistic Regression (Hosmer and Lemeshow 2000) 和 Discrete Choice Methods with Simulation (Train 2009)

每个模型都给出 rstan/rstanarm 中的实现方式

将线性回归模型,方差分析模型,协方差分析模型统一地看待

Generalized Least Squares 广义最小二乘

GLM General Linear Models 一般线性模型 lm

  • 允许线性模型方差非齐性、相关nlme::gls
  • 也包括线性混合效应模型

参考文献 An Introduction to Generalized Linear Models (Dobson and Barnett 2018) 和 Generalized Linear Models (McCullagh and Nelder 1989)

  • plsRglm 偏最小二乘回归方法用于广义线性模型
  • glmnet Lasso and Elastic-Net Regularized Generalized Linear Models

It’s not meant for sampling weights. It’s meant for precision weights. How best to include sampling weights in mixed models is a research problem at the moment, but you can rely on getting the wrong answer if you just use the weights = argument.

— Thomas Lumley19

17.1 介绍

模型结构,模型种类,参数估计办法,相当于综述

响应变量分别服从二项分布、多项分布、对数正态分布、泊松分布、伽马分布

17.2 理论基础

分两个段落分别介绍指数族和 GLM

\[ f(y;\theta,\phi) = \exp[(a(y) b(\theta) + c(\theta))/f(\phi) + d(y,\phi)] \]

泊松分布 (with \(\lambda \to \theta\), \(x \to y\)) (\(\phi=1\)):

\[\begin{equation} \begin{split} f(y,\theta) & = \exp(-\theta) \theta^y/(y!) \\ & = \exp\left( \underbrace{y}_{a(y)} \underbrace{\log \theta}_{b(\theta)} + \underbrace{(-\theta)}_{c(\theta)} + \underbrace{(- \log(y!))}_{d(y)} \right) \end{split} \end{equation}\]

17.2.1 岭回归

Geometry and properties of generalized ridge regression in high dimensions http://web.ccs.miami.edu/~hishwaran/papers/IR.conmath2014.pdf

这篇文章借助三维几何图形展示高维情形下的广义岭回归

17.2.2 Lasso

glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models https://glmnet.stanford.edu

17.2.3 最优子集回归

bestglm: Best Subset GLM and Regression Utilities

17.3 吸烟喝酒和食道癌的关系

存在有序分类数据

酒精的作用 effects of alcohol, tobacco and interaction, age-adjusted 数据集描述见 help(esoph)

head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7
str(esoph)
## 'data.frame':	88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 ..
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 ..
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4..
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...
library(ggplot2)
library(patchwork)
p1 <- ggplot(data = esoph, aes(x = agegp, y = ncases / ncontrols, color = agegp)) +
  geom_boxplot(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE) +
  theme_minimal()

p2 <- ggplot(data = esoph, aes(x = alcgp, y = ncases / ncontrols, color = alcgp)) +
  geom_boxplot(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE) +
  theme_minimal()

p3 <- ggplot(data = esoph, aes(x = tobgp, y = ncases / ncontrols, color = tobgp)) +
  geom_boxplot(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE) +
  theme_minimal()

p1 / (p2 + p3)

fit_esoph_glm <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
  data = esoph, family = binomial(link = "logit")
)
library(Rcpp)
fit_esoph_brm <- brms::brm(ncases | trials(ncases + ncontrols) ~ agegp + tobgp * alcgp,
  data = esoph, family = binomial(link = "logit"), refresh = 0
)

17.4 自然流产和人工流产后的不育

help(infert)

head(infert)
##   education age parity induced case spontaneous stratum
## 1    0-5yrs  26      6       1    1           2       1
## 2    0-5yrs  42      1       1    1           0       2
## 3    0-5yrs  39      6       2    1           0       3
## 4    0-5yrs  34      4       2    1           0       4
## 5   6-11yrs  35      3       1    1           1       5
## 6   6-11yrs  36      4       2    1           1       6
##   pooled.stratum
## 1              3
## 2              1
## 3              4
## 4              2
## 5             32
## 6             36
str(infert)
## 'data.frame':	248 obs. of  8 variables:
##  $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 ..
##  $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
##  $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
##  $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
##  $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
##  $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...

存在无序分类变量

infert_glm_1 <- glm(case ~ spontaneous + induced,
  data = infert, family = binomial()
)
summary(infert_glm_1)
## 
## Call:
## glm(formula = case ~ spontaneous + induced, family = binomial(), 
##     data = infert)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6678  -0.8360  -0.5772   0.9030   1.9362  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.7079     0.2677  -6.380 1.78e-10 ***
## spontaneous   1.1972     0.2116   5.657 1.54e-08 ***
## induced       0.4181     0.2056   2.033    0.042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 279.61  on 245  degrees of freedom
## AIC: 285.61
## 
## Number of Fisher Scoring iterations: 4

考虑其他潜在的因素

infert_glm_2 <- glm(case ~ age + parity + education + spontaneous + induced,
  data = infert, family = binomial()
)
summary(infert_glm_2)
## 
## Call:
## glm(formula = case ~ age + parity + education + spontaneous + 
##     induced, family = binomial(), data = infert)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7603  -0.8162  -0.4956   0.8349   2.6536  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.14924    1.41220  -0.814   0.4158    
## age               0.03958    0.03120   1.269   0.2046    
## parity           -0.82828    0.19649  -4.215 2.49e-05 ***
## education6-11yrs -1.04424    0.79255  -1.318   0.1876    
## education12+ yrs -1.40321    0.83416  -1.682   0.0925 .  
## spontaneous       2.04591    0.31016   6.596 4.21e-11 ***
## induced           1.28876    0.30146   4.275 1.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 257.80  on 241  degrees of freedom
## AIC: 271.8
## 
## Number of Fisher Scoring iterations: 4

实际上应该使用条件逻辑回归, 调用 survival

library(survival)
infert_glm_3 <- clogit(case ~ spontaneous + induced + strata(stratum),
  data = infert
)
summary(infert_glm_3)
## Call:
## coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous + induced + 
##     strata(stratum), data = infert, method = "exact")
## 
##   n= 248, number of events= 83 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## spontaneous 1.9859    7.2854   0.3524 5.635 1.75e-08 ***
## induced     1.4090    4.0919   0.3607 3.906 9.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## spontaneous     7.285     0.1373     3.651    14.536
## induced         4.092     0.2444     2.018     8.298
## 
## Concordance= 0.776  (se = 0.044 )
## Likelihood ratio test= 53.15  on 2 df,   p=3e-12
## Wald test            = 31.84  on 2 df,   p=1e-07
## Score (logrank) test = 48.44  on 2 df,   p=3e-11

17.5 细菌数据集

流感嗜血杆菌的细菌与中耳炎患儿

data(bacteria, package = "MASS")
# 惩罚拟似然
fit_glmmpql <- MASS::glmmPQL(y ~ trt + I(week > 2),
  random = ~ 1 | ID, verbose = FALSE,
  family = binomial, data = bacteria
)
summary(fit_glmmpql)
## Linear mixed-effects model fit by maximum likelihood
##  Data: bacteria 
##   AIC BIC logLik
##    NA  NA     NA
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept)  Residual
## StdDev:    1.410637 0.7800511
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~invwt 
## Fixed effects: y ~ trt + I(week > 2) 
##                     Value Std.Error  DF   t-value p-value
## (Intercept)      3.412014 0.5185033 169  6.580506  0.0000
## trtdrug         -1.247355 0.6440635  47 -1.936696  0.0588
## trtdrug+        -0.754327 0.6453978  47 -1.168779  0.2484
## I(week > 2)TRUE -1.607257 0.3583379 169 -4.485311  0.0000
##  Correlation: 
##                 (Intr) trtdrg trtdr+
## trtdrug         -0.598              
## trtdrug+        -0.571  0.460       
## I(week > 2)TRUE -0.537  0.047 -0.001
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -5.1985361  0.1572336  0.3513075  0.4949482  1.7448845 
## 
## Number of Observations: 220
## Number of Groups: 50
# 拉普拉斯近似
fit_glmer <- lme4::glmer(y ~ trt + I(week > 2) + (1 | ID),
  family = binomial, data = bacteria
)
summary(fit_glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: y ~ trt + I(week > 2) + (1 | ID)
##    Data: bacteria
## 
##      AIC      BIC   logLik deviance df.resid 
##    202.3    219.2    -96.1    192.3      215 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.5615  0.1359  0.3022  0.4217  1.1276 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  ID     (Intercept) 1.543    1.242   
## Number of obs: 220, groups:  ID, 50
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.5479     0.6958   5.099 3.41e-07 ***
## trtdrug          -1.3667     0.6770  -2.019 0.043517 *  
## trtdrug+         -0.7826     0.6831  -1.146 0.251927    
## I(week > 2)TRUE  -1.5985     0.4759  -3.359 0.000783 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) trtdrg trtdr+
## trtdrug     -0.593              
## trtdrug+    -0.537  0.487       
## I(wk>2)TRUE -0.656  0.126  0.064

17.6 研究婴儿出生体重低的相关危险因素

在线性回归的基础上,响应变量是离散的类别,且无序 (Hasan, Wang, and Mahani 2016)

birthwt 数据是 1986 年在马萨诸塞州斯普林菲尔德的 Baystate 医疗中心收集的,用于研究婴儿出生体重低的相关危险因素

# 加载数据
# library(MASS)
data(birthwt, package = "MASS")
# 查看 birthwt 数据集 `help(birthwt)`
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622
str(birthwt)
## 'data.frame':	189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

low 表示婴儿出生体重小于 2.5kg,age 表示母亲的年龄(年),lwt 母亲最后一次月经期间的体重(磅),race 母亲的种族(1 =白人,2 =黑人,3 =其他)。,smoke 怀孕期间的吸烟状况,ptl 以前早产的次数,ht 高血压病史,ui 子宫过敏,ftv 妊娠头三个月的医生就诊次数,bwt 出生体重(克)

with(birthwt, tapply(lwt, ui, var))
##        0        1 
## 940.8472 783.7196
t.test(lwt ~ ui, data = birthwt, var.equal = TRUE)
## 
## 	Two Sample t-test
## 
## data:  lwt by ui
## t = 2.1138, df = 187, p-value = 0.03586
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.8753389 25.3544748
## sample estimates:
## mean in group 0 mean in group 1 
##        131.7578        118.6429
t.test(lwt ~ ui, data = birthwt)
## 
## 	Welch Two Sample t-test
## 
## data:  lwt by ui
## t = 2.2547, df = 39.163, p-value = 0.02982
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   1.351128 24.878685
## sample estimates:
## mean in group 0 mean in group 1 
##        131.7578        118.6429
# birthwt$ui <- as.factor(birthwt$ui)
# library(lattice)
# bwplot(lwt ~ ui, data = birthwt, pch = "|")

boxplot(lwt ~ ui, data = birthwt)

# 重新编码,数据预处理,方便代入模型
bwt <- with(birthwt, {
  race <- factor(race, labels = c("white", "black", "other"))
  ptd <- factor(ptl > 0)
  ftv <- factor(ftv)
  levels(ftv)[-(1:2)] <- "2+" # 除了前两个水平外,其余的都编码为 2+
  data.frame(
    low = factor(low), age, lwt, race, smoke = (smoke > 0),
    ptd, ht = (ht > 0), ui = (ui > 0), ftv
  )
})
# 查看编码后的数据
head(bwt)
##   low age lwt  race smoke   ptd    ht    ui ftv
## 1   0  19 182 black FALSE FALSE FALSE  TRUE   0
## 2   0  33 155 other FALSE FALSE FALSE FALSE  2+
## 3   0  20 105 white  TRUE FALSE FALSE FALSE   1
## 4   0  21 108 white  TRUE FALSE FALSE  TRUE  2+
## 5   0  18 107 white  TRUE FALSE FALSE  TRUE   0
## 6   0  21 124 other FALSE FALSE FALSE FALSE   0
str(bwt)
## 'data.frame':	189 obs. of  9 variables:
##  $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "white","black",..: 2 3 1 1 1 3 1 3 1 ..
##  $ smoke: logi  FALSE FALSE TRUE TRUE TRUE FALSE ...
##  $ ptd  : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ht   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ui   : logi  TRUE FALSE FALSE TRUE TRUE FALSE ...
##  $ ftv  : Factor w/ 3 levels "0","1","2+": 1 3 2 3 1 1 2 2 2 1 ...

广义线性模型拟合,二项逻辑回归,响应变量为婴儿出生的体重,以2.5kg为界,它被编码成二分类变量 0或1

options(contrasts = c("contr.treatment", "contr.poly"))
glm(formula = low ~ ., family = binomial, data = bwt)
## 
## Call:  glm(formula = low ~ ., family = binomial, data = bwt)
## 
## Coefficients:
## (Intercept)          age          lwt    raceblack    raceother  
##     0.82302     -0.03723     -0.01565      1.19241      0.74068  
##   smokeTRUE      ptdTRUE       htTRUE       uiTRUE         ftv1  
##     0.75553      1.34376      1.91317      0.68020     -0.43638  
##       ftv2+  
##     0.17901  
## 
## Degrees of Freedom: 188 Total (i.e. Null);  178 Residual
## Null Deviance:	    234.7 
## Residual Deviance: 195.5 	AIC: 217.5

多项逻辑回归

library(nnet)
(bwt.mu <- multinom(formula = low ~ ., data = bwt))
## # weights:  12 (11 variable)
## initial  value 131.004817 
## iter  10 value 98.029803
## final  value 97.737759 
## converged
## Call:
## multinom(formula = low ~ ., data = bwt)
## 
## Coefficients:
## (Intercept)         age         lwt   raceblack   raceother 
##  0.82320102 -0.03723828 -0.01565359  1.19240391  0.74065606 
##   smokeTRUE     ptdTRUE      htTRUE      uiTRUE        ftv1 
##  0.75550487  1.34375901  1.91320116  0.68020207 -0.43638470 
##       ftv2+ 
##  0.17900392 
## 
## Residual Deviance: 195.4755 
## AIC: 217.4755
summary(bwt.mu)
## Call:
## multinom(formula = low ~ ., data = bwt)
## 
## Coefficients:
##                  Values  Std. Err.
## (Intercept)  0.82320102 1.24476766
## age         -0.03723828 0.03870437
## lwt         -0.01565359 0.00708079
## raceblack    1.19240391 0.53598076
## raceother    0.74065606 0.46176615
## smokeTRUE    0.75550487 0.42503626
## ptdTRUE      1.34375901 0.48063449
## htTRUE       1.91320116 0.72076133
## uiTRUE       0.68020207 0.46434974
## ftv1        -0.43638470 0.47941107
## ftv2+        0.17900392 0.45639129
## 
## Residual Deviance: 195.4755 
## AIC: 217.4755

计算 Z 分数和 P 值

z <- summary(bwt.mu)$coefficients / summary(bwt.mu)$standard.errors
z
## (Intercept)         age         lwt   raceblack   raceother 
##   0.6613291  -0.9621210  -2.2107121   2.2247140   1.6039635 
##   smokeTRUE     ptdTRUE      htTRUE      uiTRUE        ftv1 
##   1.7775069   2.7958023   2.6544170   1.4648486  -0.9102516 
##       ftv2+ 
##   0.3922159
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept)         age         lwt   raceblack   raceother 
## 0.508401310 0.335988847 0.027055777 0.026100443 0.108722092 
##   smokeTRUE     ptdTRUE      htTRUE      uiTRUE        ftv1 
## 0.075484881 0.005177106 0.007944557 0.142962228 0.362689827 
##       ftv2+ 
## 0.694898695

模型解释

17.7 哥本哈根住房状况调查

响应变量是离散类别,且存在强弱,等级,大小之分

调用函数 MASS::polr

数据集 housing 哥本哈根住房状况调查中的次数分布表,Sat 住户对目前居住环境的满意程度,是一个有序的因子变量,Infl 住户对物业管理的感知影响程度,Type 租赁住宿类型,如塔楼、中庭、公寓、露台,Cont 联系居民可与其他居民联系(低、高),Freq 每个类中的居民人数,调查的人数

data("housing", package = "MASS")
# 查看数据 help(housing)
head(housing)
##      Sat   Infl  Type Cont Freq
## 1    Low    Low Tower  Low   21
## 2 Medium    Low Tower  Low   21
## 3   High    Low Tower  Low   28
## 4    Low Medium Tower  Low   34
## 5 Medium Medium Tower  Low   22
## 6   High Medium Tower  Low   36
str(housing)
## 'data.frame':	72 obs. of  5 variables:
##  $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 ..
##  $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ..
##  $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1..
##  $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  21 21 28 34 22 36 10 11 36 61 ...

居民对居住环境满意度 Sat 三个等级的有序回归

options(contrasts = c("contr.treatment", "contr.poly"))
house.plr <- MASS::polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
house.plr
## Call:
## MASS::polr(formula = Sat ~ Infl + Type + Cont, data = housing, 
##     weights = Freq)
## 
## Coefficients:
##    InflMedium      InflHigh TypeApartment    TypeAtrium 
##     0.5663937     1.2888191    -0.5723501    -0.3661866 
##   TypeTerrace      ContHigh 
##    -1.0910149     0.3602841 
## 
## Intercepts:
##  Low|Medium Medium|High 
##  -0.4961353   0.6907083 
## 
## Residual Deviance: 3479.149 
## AIC: 3495.149

拟合结果

summary(house.plr, digits = 3)
## Call:
## MASS::polr(formula = Sat ~ Infl + Type + Cont, data = housing, 
##     weights = Freq)
## 
## Coefficients:
##                Value Std. Error t value
## InflMedium     0.566     0.1047    5.41
## InflHigh       1.289     0.1272   10.14
## TypeApartment -0.572     0.1192   -4.80
## TypeAtrium    -0.366     0.1552   -2.36
## TypeTerrace   -1.091     0.1515   -7.20
## ContHigh       0.360     0.0955    3.77
## 
## Intercepts:
##             Value  Std. Error t value
## Low|Medium  -0.496  0.125     -3.974 
## Medium|High  0.691  0.125      5.505 
## 
## Residual Deviance: 3479.149 
## AIC: 3495.149

再计算一下 P 值,置信区间

ctable <- coef(summary(house.plr))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = p)
# confidence intervals 计算置信区间
ci <- confint(house.plr)
exp(coef(house.plr))
##    InflMedium      InflHigh TypeApartment    TypeAtrium 
##     1.7619017     3.6284990     0.5641979     0.6933734 
##   TypeTerrace      ContHigh 
##     0.3358754     1.4337368
## OR and CI
exp(cbind(OR = coef(house.plr), ci))
##                      OR     2.5 %    97.5 %
## InflMedium    1.7619017 1.4356845 2.1639915
## InflHigh      3.6284990 2.8319659 4.6626461
## TypeApartment 0.5641979 0.4462124 0.7121941
## TypeAtrium    0.6933734 0.5114084 0.9398410
## TypeTerrace   0.3358754 0.2492277 0.4514276
## ContHigh      1.4337368 1.1892931 1.7296674

模型解释

参考文档 help(housing) 包含泊松回归、多项回归、比例风险模型,以及 https://www.analyticsvidhya.com/blog/2016/02/multinomial-ordinal-logistic-regression/

好好看文档 help(housing) 和对应的参考书籍,把原理弄清楚

有序因子变量是如何实现编码的

17.8 癫痫病发作次数

纵向数据 (Thall and Vail 1990),考虑了过度发散 overdispersion 异方差 heteroscedasticity 观测不独立

数据集 epil 记录癫痫发作的次数及病人的特征,下面是数据建模分析过程

data(epil, package = "MASS")
fit_glm_epil <- glm(y ~ lbase * trt + lage + V4,
  family = poisson,
  data = epil
)
summary(fit_glm_epil, cor = FALSE)

epil2 <- epil[epil$period == 1, ]
epil2["period"] <- rep(0, 59)
epil2["y"] <- epil2["base"]
epil["time"] <- 1
epil2["time"] <- 4
epil2 <- rbind(epil, epil2)
epil2$pred <- unclass(epil2$trt) * (epil2$period > 0)
epil2$subject <- factor(epil2$subject)

epil3 <- aggregate(
  epil2, list(epil2$subject, epil2$period > 0),
  function(x) if (is.numeric(x)) sum(x) else x[1]
)
epil3$pred <- factor(epil3$pred,
  labels = c("base", "placebo", "drug")
)
contrasts(epil3$pred) <- structure(MASS::contr.sdif(3),
  dimnames = list(NULL, c("placebo-base", "drug-placebo"))
)

summary(glm(y ~ pred + factor(subject) + offset(log(time)),
  family = poisson, data = epil3
), cor = FALSE)

summary(MASS::glmmPQL(y ~ lbase * trt + lage + V4,
  random = ~ 1 | subject,
  family = poisson, data = epil
))

summary(MASS::glmmPQL(y ~ pred,
  random = ~ 1 | subject,
  family = poisson, data = epil3
))

17.9 对数线性模型

gllm: Generalised log-Linear Model

Routines for log-linear models of incomplete contingency tables, including some latent class models, via EM and Fisher scoring approaches. Allows bootstrapping. See Espeland and Hui (1987) doi:10.2307/2531553 for general approach.

17.10 泊松回归模型

加载数据

data(beall.webworms, package = "agridat")

查看数据

head(beall.webworms)
##   row col y block trt spray lead
## 1   1   1 1    B1  T1     N    N
## 2   2   1 0    B1  T1     N    N
## 3   3   1 1    B1  T1     N    N
## 4   4   1 3    B1  T1     N    N
## 5   5   1 6    B1  T1     N    N
## 6   6   1 0    B2  T1     N    N

描述响应变量的分布

hist(beall.webworms$y, main = "Histogram of Worm Count", xlab = "Number of Worms")

boxplot(y ~ trt, data = beall.webworms)

抖动图

plot(y ~ trt, data = beall.webworms, xlab = "Types of Worms", ylab = "Worm Count")

library(ggplot2)
ggplot(beall.webworms, aes(trt, y)) +
  geom_boxplot(colour = "red") +
  geom_jitter() +
  labs(x = "Types of Worms", y = "Worm Count") +
  theme_minimal()

pois.mod <- glm(y ~ trt, data = beall.webworms, family = "poisson")
summary(pois.mod)
## 
## Call:
## glm(formula = y ~ trt, family = "poisson", data = beall.webworms)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6733  -1.0046  -0.9081   0.6141   4.2771  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.33647    0.04688   7.177 7.12e-13 ***
## trtT2       -1.02043    0.09108 -11.204  < 2e-16 ***
## trtT3       -0.49628    0.07621  -6.512 7.41e-11 ***
## trtT4       -1.22246    0.09829 -12.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1955.9  on 1299  degrees of freedom
## Residual deviance: 1720.4  on 1296  degrees of freedom
## AIC: 3125.5
## 
## Number of Fisher Scoring iterations: 6

模型系数 T2 的解释,这里 GLM 使用了对数联系函数(log link function) ,因此 -1.02 是对数变换后的值,T2的系数实际是 0.3605949 ,实际意义是相对于T1,T2 类型的蠕虫数量是 T1 的 0.3605949 倍

The first valuable information is related to the residuals of the model, which should be symmetrical as for any normal linear model. From this output we can see that minimum and maximum, as well as the first and third quartiles, are similar, so this assumption is confirmed. Then we can see that the variable trt (i.e. treatment factor) is highly significant for the model, with very low p-values. The statistical test in this case is not a t-test, as in the output of the function lm, but a Wald Test (Wald Test). This test computes the probability that the coefficient is 0, if the p is significant it means the chances the coefficient is zero are very low so the variable should be included in the model since it has an effect on y.

Another important information is the deviance, particularly the residual deviance. As a general rule, this value should be lower or in line than the residuals degrees of freedom for the model to be good. In this case the fact that the residual deviance is high (even though not dramatically) may suggests the explanatory power of the model is low. We will see below how to obtain p-value for the significance of the model.

par(mfrow = c(2, 2))
plot(pois.mod)

predict(pois.mod, newdata = data.frame(trt = c("T1", "T2")))
##          1          2 
##  0.3364722 -0.6839588

模型的P值

1 - pchisq(deviance(pois.mod), df.residual(pois.mod))
## [1] 1.709743e-14

模型选择

pois.mod2 <- glm(y ~ block + spray * lead, data = beall.webworms, family = "poisson")

两模型的 AIC 比较

AIC(pois.mod, pois.mod2)
##           df      AIC
## pois.mod   4 3125.478
## pois.mod2 16 3027.438

假设响应变量 Y 服从泊松分布,意味着随机变量 Y 的期望和方差相等

mean(beall.webworms$y)
## [1] 0.7923077
var(beall.webworms$y)
## [1] 1.290164

实际上方差比均值大,这种情况称之为过度发散 (overdispersed),分布应该修正为拟(似然)泊松分布

pois.mod3 <- glm(y ~ trt, data = beall.webworms, family = c("quasipoisson"))
summary(pois.mod3)
## 
## Call:
## glm(formula = y ~ trt, family = c("quasipoisson"), data = beall.webworms)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6733  -1.0046  -0.9081   0.6141   4.2771  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33647    0.05457   6.166 9.32e-10 ***
## trtT2       -1.02043    0.10601  -9.626  < 2e-16 ***
## trtT3       -0.49628    0.08870  -5.595 2.69e-08 ***
## trtT4       -1.22246    0.11440 -10.686  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.35472)
## 
##     Null deviance: 1955.9  on 1299  degrees of freedom
## Residual deviance: 1720.4  on 1296  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

计算得知发散参数(dispersion parameter) 是 1.35472,可见数据Y并不是发散得离谱,泊松分布可能仍然是对这个数据的合理假设

AER 包是书籍 Applied Econometrics with R 的配套材料 (Kleiber and Zeileis 2008),可用于直接检验发散参数是否大于1

# AER::dispersiontest(pois.mod, alternative="greater")

如果数据真的过度离散,就应该使用负二项分布作为响应变量的拟合分布,拟合它就采用 MASS 包 (Venables and Ripley 2002) 提供的 glm.nb函数

NB.mod1 <- MASS::glm.nb(y ~ trt, data = beall.webworms)
summary(NB.mod1)
## 
## Call:
## MASS::glm.nb(formula = y ~ trt, data = beall.webworms, init.theta = 2.004130573, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4572  -0.9488  -0.8660   0.5340   2.7698  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.33647    0.06110   5.507 3.65e-08 ***
## trtT2       -1.02043    0.10661  -9.572  < 2e-16 ***
## trtT3       -0.49628    0.09423  -5.267 1.39e-07 ***
## trtT4       -1.22246    0.11283 -10.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0041) family taken to be 1)
## 
##     Null deviance: 1442.7  on 1299  degrees of freedom
## Residual deviance: 1275.3  on 1296  degrees of freedom
## AIC: 3053
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.004 
##           Std. Err.:  0.325 
## 
##  2 x log-likelihood:  -3042.969

两个模型的方差分析

anova(pois.mod, pois.mod2, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: y ~ trt
## Model 2: y ~ block + spray * lead
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1296     1720.4                          
## 2      1284     1598.4 12   122.04 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

从方差分析比较的结果来看,P值告诉我们,两模型是显著不同的,由上面对两模型的 AIC 计算结果来看,模型 pois.mod2 比模型 pois.mod 要好,模型的 AIC 值越小,表明拟合得越准确。

参考文献

Dobson, Annette J., and Adrian G. Barnett. 2018. An Introduction to Generalized Linear Models. Fourth. Boca Raton, Florida: Chapman; Hall/CRC. https://www.crcpress.com/p/book/9781138741515.

Hasan, Asad, Zhiyu Wang, and Alireza S. Mahani. 2016. “Fast Estimation of Multinomial Logit Models: R Package mnlogit.” Journal of Statistical Software 75 (3): 1–24. https://doi.org/10.18637/jss.v075.i03.

Hosmer, David W., and Stanley Lemeshow. 2000. Applied Logistic Regression. Second. New York, NY: John Wiley & Sons.

Ishwaran, H., and J. S. Rao. 2005. “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies.” Ann. Statist. 33 (2): 730–73. http://arXiv.org/abs/math/0505633v1.

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.

McCullagh, Peter, and John Nelder. 1989. Generalized Linear Models. Second. London: Chapman; Hall/CRC. https://www.crcpress.com/p/book/9780412317606.

Thall, Peter F., and Stephen C. Vail. 1990. “Some Covariance Models for Longitudinal Count Data with Overdispersion.” Biometrics 46 (3): 657–71. https://www.jstor.org/stable/2532086.

Train, Kenneth E. 2009. Discrete Choice Methods with Simulation. Second. New York, NY: Cambridge University Press.

Venables, W. N., and Brian D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York, NY: Springer-Verlag. http://www.stats.ox.ac.uk/pub/MASS4.