第 13 章 参数估计

Jeremy Koster: My students were looking at the estimated varying intercepts for each higher-level group (or the “BLUP’s”, as some people seem to call them).

Douglas Bates: As Alan James once said, “these values are just like the BLUPs - Best Linear Unbiased Predictors - except that they aren’t linear and they aren’t unbiased and there is no clear sense in which they are”best“, but other than that …”

— Jeremy Koster and Douglas Bates6

13.1 点估计

  • 矩估计
  • 极大似然估计
  • 最小二乘估计
  • 同变估计
  • 稳健估计

单参数和多参数模型的参数估计,比如指数分布、泊松分布、二项分布、正态分布,线性模型各个估计的推导过程

应当考虑 \((X^{\top}X)^{-1}\) 不存在的情况,在均方误差最小的意义下,不必要求 \(\beta\) 的估计 \(\hat{\beta}\) 满足无偏性的要求,所以介绍岭回归估计 \(\hat{\beta}_{ridge}\)、压缩估计 \(\hat{\beta}_{jse}\)、主成分估计 \(\hat{\beta}_{pca}\) 和偏最小二乘估计 \(\hat{\beta}_{pls}\)。相比于 \(\hat{\beta}_{pca}\)\(\hat{\beta}_{pls}\) 考虑了响应变量的作用。《数理统计引论》第5章第5节线性估计类从改进 LS 估计出发,牺牲一部分估计的偏差,即采用有偏的估计,达到总体均方误差更小的效果 (陈希孺 1981)

James-Stein 估计可不可以看作一种压缩估计?从它牺牲一部分偏差,获取整体方差的降低来看和上面应该有某种联系

13.1.1 矩估计

13.1.2 最小二乘估计

谈非线性最小二乘,这段话的意思是非线性模型不要谈 ANOVA 和 R^2 之类的东西

As one of the developers of the nls function I would like to state that the lack of automatic ANOVA, \(R^2\) and \(adj. R^2\) from nls is a feature, not a bug :-)

— Douglas Bates7

最小二乘估计是一种非参数估计方法(对数据分布没有假设,只要预测误差达到最小即可),而极大似然估计是一种参数估计方法(观测数据服从带参数的多元分布)

非线性最小二乘估计

# Nonlinear least-squares using nlm()
# demo(nlm)

# Helical Valley Function
# 非线性最小二乘

theta <- function(x1, x2) (atan(x2 / x1) + (if (x1 <= 0) pi else 0)) / (2 * pi)
## 更加简洁的表达
theta <- function(x1, x2) atan2(x2, x1) / (2 * pi)
# 目标函数
f <- function(x) {
  f1 <- 10 * (x[3] - 10 * theta(x[1], x[2]))
  f2 <- 10 * (sqrt(x[1]^2 + x[2]^2) - 1)
  f3 <- x[3]
  return(f1^2 + f2^2 + f3^2)
}

## explore surface {at x3 = 0}
x <- seq(-1, 2, length.out = 50)
y <- seq(-1, 1, length.out = 50)
z <- apply(as.matrix(expand.grid(x, y)), 1, function(x) f(c(x, 0)))

contour(x, y, matrix(log10(z), 50, 50))

nlm.f <- nlm(f, c(-1, 0, 0), hessian = TRUE)

points(rbind(nlm.f$estim[1:2]), col = "red", pch = 20)
### the Rosenbrock banana valley function 香蕉谷函数

fR <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
  100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}

## explore surface
fx <- function(x) { ## `vectorized' version of fR()
  x1 <- x[, 1]
  x2 <- x[, 2]
  100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}
x <- seq(-2, 2, length.out = 100)
y <- seq(-0.5, 1.5, length.out = 100)
z <- fx(expand.grid(x, y))
op <- par(mfrow = c(2, 1), mar = 0.1 + c(3, 3, 0, 0))
contour(x, y, matrix(log10(z), length(x)))

nlm.f2 <- nlm(fR, c(-1.2, 1), hessian = TRUE)
points(rbind(nlm.f2$estim[1:2]), col = "red", pch = 20)

## Zoom in :
rect(0.9, 0.9, 1.1, 1.1, border = "orange", lwd = 2)
x <- y <- seq(0.9, 1.1, length.out = 100)
z <- fx(expand.grid(x, y))
contour(x, y, matrix(log10(z), length(x)))
mtext("zoomed in")
box(col = "orange")
points(rbind(nlm.f2$estim[1:2]), col = "red", pch = 20)
par(op)
with(
  nlm.f2,
  stopifnot(
    all.equal(estimate, c(1, 1), tol = 1e-5),
    minimum < 1e-11, abs(gradient) < 1e-6, code %in% 1:2
  )
)
fg <- function(x) {
  gr <- function(x1, x2) {
    c(-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1), 200 * (x2 - x1 * x1))
  }
  x1 <- x[1]
  x2 <- x[2]
  structure(100 * (x2 - x1 * x1)^2 + (1 - x1)^2,
    gradient = gr(x1, x2)
  )
}

nfg <- nlm(fg, c(-1.2, 1), hessian = TRUE)
str(nfg)
with(
  nfg,
  stopifnot(
    minimum < 1e-17, all.equal(estimate, c(1, 1)),
    abs(gradient) < 1e-7, code %in% 1:2
  )
)
## or use deriv to find the derivatives

fd <- deriv(~ 100 * (x2 - x1 * x1)^2 + (1 - x1)^2, c("x1", "x2"))
fdd <- function(x1, x2) {}
body(fdd) <- fd

nlfd <- nlm(function(x) fdd(x[1], x[2]), c(-1.2, 1), hessian = TRUE)
str(nlfd)
with(
  nlfd,
  stopifnot(
    minimum < 1e-17, all.equal(estimate, c(1, 1)),
    abs(gradient) < 1e-7, code %in% 1:2
  )
)
fgh <- function(x) {
  gr <- function(x1, x2) {
    c(-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1), 200 * (x2 - x1 * x1))
  }
  h <- function(x1, x2) {
    a11 <- 2 - 400 * x2 + 1200 * x1 * x1
    a21 <- -400 * x1
    matrix(c(a11, a21, a21, 200), 2, 2)
  }
  x1 <- x[1]
  x2 <- x[2]
  structure(100 * (x2 - x1 * x1)^2 + (1 - x1)^2,
    gradient = gr(x1, x2),
    hessian  =  h(x1, x2)
  )
}

nlfgh <- nlm(fgh, c(-1.2, 1), hessian = TRUE)

str(nlfgh)
## NB: This did _NOT_ converge for R version <= 3.4.0
with(
  nlfgh,
  stopifnot(
    minimum < 1e-15, # see 1.13e-17 .. slightly worse than above
    all.equal(estimate, c(1, 1), tol = 9e-9), # see 1.236e-9
    abs(gradient) < 7e-7, code %in% 1:2
  )
) # g[1] = 1.3e-7

13.1.3 极大似然估计

教材简短一句话,这里面有很多信息值得发散,一个数学家提出了统计学领域极其重要的一个核心思想,他是在研究什么的时候提出了这个想法,为什么后来没有得到重视,虽然这可能有点离题,但是对于读者可能有很多别的启迪。整整100 年以后,Fisher 又是怎么提出这一思想的呢?他做了什么使得这个思想被广泛接受和应用?

统计决策理论,任何统计推断都应该依赖损失函数,而极大似然估计未曾考虑到,这是它的局限性。 Lasso 和贝叶斯先验的关系,和损失函数的关系

是最大似然估计还是极大似然估计?当然是极大似然估计,如果有人告诉你是最大似然估计那一定是假的,这两个概念归根结底是极值和最值得区别

书本定义和性质,在后续章节介绍

介绍线性模型为何引入 REML 减少偏差

极大似然估计是费舍尔提出来的

  • 边际似然 Marginal Likelihood
  • 条件似然 conditional lkelihood
  • 完全似然 complete Likelihood
  • 层次似然 Hierarchical likelihood
  • 部分似然 partial likelihood

拟似然估计

极大似然估计

似然函数

  • 剖面似然 Profile Likelihood
  • 限制似然 Restricted Likelihood
  • 惩罚/边际拟似然 (PQL/MQL) Penalized Quasi-Likelihood/Marginal Quasi-Likelihood
  • 分布 边际分布 条件分布
  • 似然 边际似然 条件似然
  • 极大似然估计 Maximum likelihood 简称 ML
  • 限制极大似然 Restricted Maximum likelihood, 简称 REML
  • 惩罚拟似然 Penalized Quasi-Likelihood, 简称 PQL 和边际拟似然 Marginal Quasi-Likelihood, 简称 MQL,Profile Maximal Likelihood, 简称 PML

Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS).

BFGS 拟牛顿法和采样器 https://bookdown.org/rdpeng/advstatcomp

13.2 区间估计

13.2.1 置信区间和信仰区间

二项分布的参数估计,包括点估计和区间估计 (Clopper and Pearson 1934)

给定样本量 n = 10 0-1 分布 成功概率 p 分别取 0.1,0.2,…,1 置信度为 95% 观测到 x 取 1,2,3,..,10 时 估计 p 的上下限

set.seed(2019)
x <- rbinom(n = 1, size = 10, prob = 0.1) # 结果解读

抛掷硬币 10 次,观测到2次正面朝上,估计正面朝上的概率

观测到正面朝上 2 次 此时请以 95% 的信心给出 p 的区间 (p_low, p_up)

绘制曲线 p 关于 x 的曲线

set.seed(2019)
p <- seq(from = 0, to = 1, length.out = 11)
# 成功概率 总体参数 p 值
sapply(rep(p, each = 10), rbinom, n = 1, size = 10)
##   [1]  0  0  0  0  0  0  0  0  0  0  2  1  0  1  0  0  2  0  0  1  3
##  [22]  2  1  1  3  2  0  3  1  2  3  3  5  2  1  0  3  5  3  3  5  1
##  [43]  5  3  1  2  3  4  1  5  5  4  7  6  6  5  7  7  4  4  7  8  9
##  [64]  7  6  7  2  4  8  8  8  8  6  8  6  4  8  9  6  7  9  9  9  9
##  [85]  8  4  9  8  9  7 10  8  7 10  9 10  9  9  8 10 10 10 10 10 10
## [106] 10 10 10 10 10

计算每一次抽样获得的上下限

Clopper-Pearson 方法,即求和搜索,在保持累积概率

\[B(x,n;n,p) = \sum_{r = x}^{n} \binom{n}{r}p^r(1-p)^{n-r} = \alpha/2\]

其中 n 表示试验次数,这里是 10, p 是未知待求,已知 \(\alpha = 0.05\),而 \(1-\alpha\) 表示置信水平,意思是说对于我给出的区间估计,长期来看,我有 95% 的信心认为,真实值 \(p\) 会落在此区间内。

对上尾部从 x 到 n 求和,计算 p,对每一个 x 都能计算出一个 p,根据二项分布的对称性,区间 \([0, x]\)\([x,n]\) 的累积概率是相同的,各占 \(\alpha/2\)

# 精确计算二项分布检验的 p
# 调用符号计算
# x = 7
fun <- function(p, r = 8, n = 10) {
  choose(n, n-2)*p^r*(1-p)^(n-r) + choose(n, n-1)*p^(n-1)*(1-p) + choose(n, n)*p^n - 0.025
}
uniroot(fun, lower = 0, upper = 1)
## $root
## [1] 0.4439038
## 
## $f.root
## [1] -2.707352e-07
## 
## $iter
## [1] 9
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05
# x = 8
fun <- function(p) {
  45*p^8*(1-p)^2 + 10*p^9*(1-p) + p^10 - 0.025
}
uniroot(fun, lower = 0, upper = 1)
## $root
## [1] 0.4439038
## 
## $f.root
## [1] -2.707352e-07
## 
## $iter
## [1] 9
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05
# x = 9
fun <- function(x) {
  9 * x^10 - 10 * x^9 + 0.025
}
# 0.555 计算下限
uniroot(fun, lower = 0, upper = 1)
## $root
## [1] 0.5549828
## 
## $f.root
## [1] 3.773379e-07
## 
## $iter
## [1] 10
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.462529e-05
# x = 10
fun <- function(x) {
  x^10 - 0.025
}
# 0.691
uniroot(fun, lower = 0, upper = 1)
## $root
## [1] 0.6914996
## 
## $f.root
## [1] -1.194136e-06
## 
## $iter
## [1] 9
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05

累积二项概率

找到最小的 p 使得其等于 9

# 已知概率求上分位点

# 等于
qbinom(0.025, size = 10, prob = 0.565, lower.tail = F)
## [1] 9

找到使得函数为 0 的 p 中最小的那个,找到所有的根,然后取最小的那个

fun <- function(p, r = 9) qbinom(0.025, size = 10, prob = p, lower.tail = F) - r
# 计算每个 x 对应的 p 
(p <- sapply(1:10, function(x) uniroot(fun, lower = 0, upper = 1, r = x)$root))
##  [1] 0.01666667 0.05333333 0.09375000 0.16000000 0.25000000
##  [6] 0.30000000 0.35000000 0.53333333 0.67500000 1.00000000
plot(x = 1:10, y = p)

# 二项检验 菱形置信带
set.seed(2019)

dat <- replicate(10^3, expr = {
  x = sample(0:1, size = 10, replace = TRUE, prob = c(0.8, 0.2))
  sum(x)/10
})

# 成功概率 p = 0.2 每个样本量 10
dat <- rbinom(n = 10^3, size = 10, prob = 0.2)/10
table(dat)
## dat
##   0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 
## 119 274 304 180  82  30  10   1
# 分布图 y 轴是密度
# right = TRUE 区间形式 (a,b] 左开右闭
hist(dat, probability = T, breaks = seq(from = -0.1, to = 1, by = 0.1))

# 0.2^10 左闭右开区间
hist(dat, probability = T, breaks = seq(from = 0, to = 1.1, by = 0.1),
     right = FALSE, xlim = c(0, 1.1))

# 分布
library(ggplot2)
library(magrittr)
# 这个图里面会不会隐含什么信息,分布是怎样的?
# 二项展开有关系吗
dat1 <- as.data.frame(table(dat))


ggplot(data = dat1, aes(x = dat, y = Freq)) +
  geom_col()

ggplot(as.data.frame(dat), aes(x = dat)) +
  geom_histogram(bins = 12)

13.3 最小角回归

  1. Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Robert. 2004. Least angle regression. The Annals of Statistics. 32(2): 407–499. https://doi.org/10.1214/009053604000000067.

方差缩减技术,修偏技术

13.4 刀切法

  1. Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics. 7(1):1–26. https://doi.org/10.1214/aos/1176344552

13.5 重抽样

13.6 Delta 方法

参考文献

Clopper, C. J., and E. S. Pearson. 1934. “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial.” Biometrika 26 (4): 404–13. https://doi.org/10.1093/biomet/26.4.404.

陈希孺. 1981. 数理统计引论. 北京: 科学出版社.