# 第 25 章 线性回归

library(tidyverse)

## 25.1 从一个案例开始

wages <- read_csv("./demo_data/wages.csv")

wages %>%
head()
## # A tibble: 6 x 6
##     earn height sex    race     ed   age
##    <dbl>  <dbl> <chr>  <chr> <dbl> <dbl>
## 1 79571.   73.9 male   white    16    49
## 2 96397.   66.2 female white    16    62
## 3 48711.   63.8 female white    16    33
## 4 80478.   63.2 female other    16    95
## 5 82089.   63.1 female white    17    43
## 6 15313.   64.5 female white    15    30

### 25.1.1 缺失值检查

wages %>% colnames()
## [1] "earn"   "height" "sex"    "race"   "ed"
## [6] "age"

# 如何检查数据是否有缺失值？
wages %>%
summarise(
earn_na = sum(is.na(earn)),
height_na = sum(is.na(height)),
sex_na = sum(is.na(sex)),
race_na = sum(is.na(race)),
ed_na = sum(is.na(ed)),
age_na = sum(is.na(age))
)
## # A tibble: 1 x 6
##   earn_na height_na sex_na race_na ed_na age_na
##     <int>     <int>  <int>   <int> <int>  <int>
## 1       0         0      0       0     0      0

wages %>%
summarise_all(
~ sum(is.na(.))
)
## # A tibble: 1 x 6
##    earn height   sex  race    ed   age
##   <int>  <int> <int> <int> <int> <int>
## 1     0      0     0     0     0     0

wages %>%
map_df(~ sum(is.na(.)))
## # A tibble: 1 x 6
##    earn height   sex  race    ed   age
##   <int>  <int> <int> <int> <int> <int>
## 1     0      0     0     0     0     0

### 25.1.2 变量简单统计

wages %>% count(sex)
## # A tibble: 2 x 2
##   sex        n
##   <chr>  <int>
## 1 female   859
## 2 male     520

wages %>%
group_by(sex) %>%
summarise(
n = n(),
mean_height = mean(height),
mean_earn = mean(earn)
)
## # A tibble: 2 x 4
##   sex        n mean_height mean_earn
##   <chr>  <int>       <dbl>     <dbl>
## 1 female   859        64.5    24246.
## 2 male     520        70.0    45993.

wages %>%
ggplot(aes(x = earn, color = sex)) +
geom_density()

1. 长的越高的人挣钱越多？

2. 是否男性就比女性挣的多？

3. 影响收入最大的变量是哪个？

4. 怎么判定我们建立的模型是不是很好？

## 25.2 线性回归模型

\begin{aligned} y &= \alpha + \beta x + \epsilon \\ \epsilon &\in \text{Normal}(\mu, \sigma) \end{aligned}

$$\epsilon$$ 代表误差项，它与$$x$$ 无关，且服从正态分布。 建立线性模型，就是要估计这里的系数$$\hat\alpha$$$$\hat\beta$$，即截距项和斜率项。常用的方法是最小二乘法（ordinary least squares (OLS) regression）： 就是我们估算的$$\hat\alpha$$$$\hat\beta$$, 要使得残差的平方和最小，即$$\sum_i(y_i - \hat y_i)^2$$或者叫$$\sum_i \epsilon_i^2$$最小。

## 25.3 使用lm() 函数

lm参数很多, 但很多我们都用不上，所以我们只关注其中重要的两个参数

lm(formula = y ~ x, data)

lm(y ~ x, data) 是最常用的线性模型函数(lm是linear model的缩写)。参数解释说明

• formula：指定回归模型的公式，对于简单的线性回归模型y ~ x.
• ~ 符号：代表“预测”，可以读做“y由x预测”。有些学科不同的表述，比如下面都是可以的
• response ~ explanatory
• dependent ~ independent
• outcome ~ predictors
• data：代表数据框，数据框包含了响应变量和独立变量

wages %>%
ggplot(aes(x = height, y = earn)) +
geom_point()

mod1 <- lm(
formula = earn ~ height,
data = wages
)

names(mod1)
##  [1] "coefficients"  "residuals"     "effects"
##  [4] "rank"          "fitted.values" "assign"
##  [7] "qr"            "df.residual"   "xlevels"
## [10] "call"          "terms"         "model"

print(mod1)
##
## Call:
## lm(formula = earn ~ height, data = wages)
##
## Coefficients:
## (Intercept)       height
##     -126523         2387

summary(mod1)
##
## Call:
## lm(formula = earn ~ height, data = wages)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -47903 -19744  -5184  11642 276796
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -126523      14076   -8.99   <2e-16 ***
## height          2387        211   11.31   <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29900 on 1377 degrees of freedom
## Multiple R-squared:  0.085,	Adjusted R-squared:  0.0844
## F-statistic:  128 on 1 and 1377 DF,  p-value: <2e-16

# predict(mod1) # predictions at original x values
wages %>% modelr::add_predictions(mod1)
## # A tibble: 1,379 x 7
##      earn height sex    race        ed   age   pred
##     <dbl>  <dbl> <chr>  <chr>    <dbl> <dbl>  <dbl>
##  1 79571.   73.9 male   white       16    49 49867.
##  2 96397.   66.2 female white       16    62 31581.
##  3 48711.   63.8 female white       16    33 25708.
##  4 80478.   63.2 female other       16    95 24395.
##  5 82089.   63.1 female white       17    43 24061.
##  6 15313.   64.5 female white       15    30 27522.
##  7 47104.   61.5 female white       12    53 20385.
##  8 50960.   73.3 male   white       17    50 48434.
##  9  3213.   72.2 male   hispanic    15    25 45928.
## 10 42997.   72.4 male   white       12    30 46310.
## # ... with 1,369 more rows

# resid(mod1)
wages %>%
modelr::add_residuals(mod1)
## # A tibble: 1,379 x 8
##      earn height sex   race     ed   age   pred   resid
##     <dbl>  <dbl> <chr> <chr> <dbl> <dbl>  <dbl>   <dbl>
##  1 79571.   73.9 male  white    16    49 49867.  29705.
##  2 96397.   66.2 fema~ white    16    62 31581.  64816.
##  3 48711.   63.8 fema~ white    16    33 25708.  23003.
##  4 80478.   63.2 fema~ other    16    95 24395.  56083.
##  5 82089.   63.1 fema~ white    17    43 24061.  58028.
##  6 15313.   64.5 fema~ white    15    30 27522. -12209.
##  7 47104.   61.5 fema~ white    12    53 20385.  26720.
##  8 50960.   73.3 male  white    17    50 48434.   2526.
##  9  3213.   72.2 male  hisp~    15    25 45928. -42715.
## 10 42997.   72.4 male  white    12    30 46310.  -3313.
## # ... with 1,369 more rows

## 25.4 模型的解释

mod1的解释：

• 对于斜率$$\beta = 2387$$意味着，当一个人的身高是68英寸时，他的预期收入$$earn = -126532 + 2387 \times 68= 35806$$ 美元， 换个方式说，身高$$height$$每增加一个1英寸, 收入$$earn$$会增加2387美元。

• 对于截距$$\alpha = -126532$$，即当身高为0时，期望的收入值-126532。呵呵，人的身高不可能为0，所以这是一种极端的理论情况，现实不可能发生。

wages %>%
ggplot(aes(x = height, y = earn)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE)

## 25.5 多元线性回归

\begin{aligned} earn &= \alpha + \beta_1 \text{height} + \beta_2 \text{ed} +\epsilon \\ \end{aligned}

R语言代码实现也很简单，只需要把变量ed增加在公式的右边

mod2 <- lm(earn ~ height + ed, data = wages)

mod2
##
## Call:
## lm(formula = earn ~ height + ed, data = wages)
##
## Coefficients:
## (Intercept)       height           ed
##     -161541         2087         4118

## 25.6 更多模型

lm(earn ~ sex, data = wages)
lm(earn ~ ed, data = wages)
lm(earn ~ age, data = wages)

lm(earn ~ height + sex, data = wages)
lm(earn ~ height + ed, data = wages)
lm(earn ~ height + age, data = wages)
lm(earn ~ height + race, data = wages)

lm(earn ~ height + sex + ed, data = wages)
lm(earn ~ height + sex + age, data = wages)
lm(earn ~ height + sex + race, data = wages)
lm(earn ~ height + ed + age, data = wages)
lm(earn ~ height + ed + race, data = wages)
lm(earn ~ height + age + race, data = wages)

lm(earn ~ height + sex + ed + age, data = wages)
lm(earn ~ height + sex + ed + race, data = wages)
lm(earn ~ height + sex + age + race, data = wages)
lm(earn ~ height + ed + age + race, data = wages)
lm(earn ~ sex + ed + age + race, data = wages)

lm(earn ~ height + sex + ed + age + race, data = wages)

## 25.7 变量重要性

lm(earn ~ height + ed + age, data = wages)
• 方法一，变量都做标准化处理后，再放到模型中计算，然后对比系数的绝对值
fit <- wages %>%
mutate_at(vars(earn, height, ed, age), scale) %>%
lm(earn ~ 1 + height + ed + age, data = .)

summary(fit)
##
## Call:
## lm(formula = earn ~ 1 + height + ed + age, data = .)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -1.921 -0.536 -0.121  0.353  8.298
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.77e-16   2.40e-02    0.00        1
## height       2.74e-01   2.43e-02   11.26  < 2e-16 ***
## ed           3.39e-01   2.43e-02   13.97  < 2e-16 ***
## age          1.55e-01   2.44e-02    6.35  2.9e-10 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.89 on 1375 degrees of freedom
## Multiple R-squared:  0.21,	Adjusted R-squared:  0.208
## F-statistic:  122 on 3 and 1375 DF,  p-value: <2e-16
caret::varImp(fit)
##        Overall
## height   11.26
## ed       13.97
## age       6.35

## 25.8 可能遇到的情形

### 25.8.1 截距项

# 包含截距，以下两者是等价的
lm(earn ~ 1 + height, data = wages)
lm(earn ~ height, data = wages)

# 去掉截距，以下两者是等价的
lm(earn ~ height - 1, data = wages)
lm(earn ~ 0 + height, data = wages)

### 25.8.2 只有截距项

lm(earn ~ 1, data = wages)
##
## Call:
## lm(formula = earn ~ 1, data = wages)
##
## Coefficients:
## (Intercept)
##       32446

wages %>%
summarise(
mean_wages = mean(earn)
)
## # A tibble: 1 x 1
##   mean_wages
##        <dbl>
## 1     32446.

### 25.8.3 分类变量

race变量就是数据框wages的一个分类变量，代表四个不同的种族。用分类变量做回归，本质上是各组之间的进行比较。

wages %>% distinct(race)
## # A tibble: 4 x 1
##   race
##   <chr>
## 1 white
## 2 other
## 3 hispanic
## 4 black
wages %>%
ggplot(aes(x = race, y = earn, fill = race)) +
geom_boxplot(position = position_dodge()) +
scale_y_continuous(limits = c(0, 20000))

mod3 <- lm(earn ~ race, data = wages)
mod3
##
## Call:
## lm(formula = earn ~ race, data = wages)
##
## Coefficients:
##  (Intercept)  racehispanic     raceother
##        28372         -2887          3905
##    racewhite
##         4993

tidyverse框架下，喜欢数据框的统计结果，因此，可用broom的tidy()函数将模型输出转换为数据框的形式

broom::tidy(mod3)
## # A tibble: 4 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    28372.     2781.    10.2   1.29e-23
## 2 racehispanic   -2887.     4515.    -0.639 5.23e- 1
## 3 raceother       3905.     6428.     0.608 5.44e- 1
## 4 racewhite       4993.     2929.     1.70  8.85e- 2

• 对 black 组的估计，earn = 28372.09 = 28372.09
• 对 hispanic组的估计，earn = 28372.09 + -2886.79 = 25485.30
• 对 other 组的估计，earn = 28372.09 + 3905.32 = 32277.41
• 对 white 组的估计，earn = 28372.09 + 4993.33 = 33365.42

27 章专题讨论方差分析

### 25.8.4 因子变量

hispanic组的估计最低，适合做基线，因此可以将race转换为因子变量，这样方便调整因子先后顺序

wages_fct <- wages %>%
mutate(race = factor(race, levels = c("hispanic", "white", "black", "other"))) %>%
select(earn, race)

head(wages_fct)
## # A tibble: 6 x 2
##     earn race
##    <dbl> <fct>
## 1 79571. white
## 2 96397. white
## 3 48711. white
## 4 80478. other
## 5 82089. white
## 6 15313. white

wages_fct替换wages，然后建立线性模型

mod4 <- lm(earn ~ race, data = wages_fct)
broom::tidy(mod4)
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   25485.     3557.     7.16  1.26e-12
## 2 racewhite      7880.     3674.     2.14  3.22e- 2
## 3 raceblack      2887.     4515.     0.639 5.23e- 1
## 4 raceother      6792.     6800.     0.999 3.18e- 1

lm(earn ~ sex, data = wages)

### 25.8.5 一个分类变量和一个连续变量

mod5 <- lm(earn ~ height + sex, data = wages)
coef(mod5)
## (Intercept)      height     sexmale
##    -32479.9       879.4     16874.2
• height = 879.424 当sex保持不变时，height变化引起的earn变化
• sexmale = 16874.158 当height保持不变时，sex变化(female变为male)引起的earn变化
p1 <- wages %>%
ggplot(aes(x = height, y = earn, color = sex)) +
geom_point(alpha = 0.1) +
geom_line(aes(y = predict(mod5))) +
scale_y_continuous(limits = c(0, 100000))
p1

### 25.8.6 偷懒的写法

. is shorthand for “everything else.”

lm(earn ~ height + sex + race + ed + age, data = wages)
lm(earn ~ ., data = wages)

lm(earn ~ height + sex + race + ed, data = wages)
lm(earn ~ . - age, data = wages)

R 语言很多时候都出现了.，不同的场景，含义是不一样的。我会在后面第 22 章专门讨论这个问题， 这是一个非常重要的问题

### 25.8.7 交互项

lm(earn ~ height + sex + height:sex, data = wages)
lm(earn ~ height * sex, data = wages)
lm(earn ~ (height + sex)^2, data = wages)
lm(earn ~ height:sex, data = wages)
lm(earn ~ height:sex:race, data = wages)
mod6 <- lm(earn ~ height + sex + height:sex, data = wages)
coef(mod6)
##    (Intercept)         height        sexmale
##       -12167.0          564.5       -30510.4
## height:sexmale
##          701.4
• 对于女性，height增长1个单位，引起earn的增长564.5102
• 对于男性，height增长1个单位，引起earn的增长564.5102 + 701.4065 = 1265.92
p2 <- wages %>%
ggplot(aes(x = height, y = earn, color = sex)) +
geom_point(alpha = 0.1) +
geom_line(aes(y = predict(mod6))) +
scale_y_continuous(limits = c(0, 100000))
p2

library(patchwork)

combined <- p1 + p2 & theme(legend.position = "bottom")
combined + plot_layout(guides = "collect")

### 25.8.8 虚拟变量

earn ~ height + sex + height:sex

wages %>% mutate(sexmale = if_else(sex == "female", 0, 1))
## # A tibble: 1,379 x 7
##      earn height sex    race        ed   age sexmale
##     <dbl>  <dbl> <chr>  <chr>    <dbl> <dbl>   <dbl>
##  1 79571.   73.9 male   white       16    49       1
##  2 96397.   66.2 female white       16    62       0
##  3 48711.   63.8 female white       16    33       0
##  4 80478.   63.2 female other       16    95       0
##  5 82089.   63.1 female white       17    43       0
##  6 15313.   64.5 female white       15    30       0
##  7 47104.   61.5 female white       12    53       0
##  8 50960.   73.3 male   white       17    50       1
##  9  3213.   72.2 male   hispanic    15    25       1
## 10 42997.   72.4 male   white       12    30       1
## # ... with 1,369 more rows

\begin{aligned} \text{female}\qquad earn &= \alpha + \beta_1 \text{height} +\epsilon \\ \text{male}\qquad earn &= \alpha + \beta_1 \text{height} + \beta_2 *1 +\beta_3 \text{(height*1)}+ \epsilon \\ &= \alpha + \beta_1 \text{height} + \beta_2 +\beta_3 \text{height}+ \epsilon \\ & = (\alpha + \beta_2) + (\beta_1 + \beta_3)\text{height} + \epsilon \\ \end{aligned} 我们再对比mod6结果中的系数$$\alpha, \beta_1, \beta_2, \beta_3$$

mod6
##
## Call:
## lm(formula = earn ~ height + sex + height:sex, data = wages)
##
## Coefficients:
##    (Intercept)          height         sexmale
##         -12167             565          -30510
## height:sexmale
##            701

• 对于女性，(截距$$\alpha$$，系数$$\beta_1$$)，height增长1个单位，引起earn的增长564.5102
• 对于男性，(截距$$\alpha + \beta_2$$，系数$$\beta_1 + \beta_3$$)，height增长1个单位，引起earn的增长564.5102 + 701.4065 = 1265.92

wages %>%
group_by(sex) %>%
group_modify(
~ broom::tidy(lm(earn ~ height, data = .))
)
## # A tibble: 4 x 6
## # Groups:   sex [2]
##   sex    term      estimate std.error statistic p.value
##   <chr>  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
## 1 female (Interce~  -12167.    20160.    -0.604  0.546
## 2 female height        565.      312.     1.81   0.0710
## 3 male   (Interce~  -42677.    38653.    -1.10   0.270
## 4 male   height       1266.      551.     2.30   0.0221
wages %>%
ggplot(aes(x = height, y = earn, color = sex)) +
geom_smooth(method = lm, se = F)

wages %>%
ggplot(aes(x = height, y = earn, color = sex)) +
geom_line(aes(y = predict(mod6)))

mod7 <- lm(earn ~ height + height:sex, data = wages)
coef(mod7)
##    (Intercept)         height height:sexmale
##       -24632.7          757.5          251.3

\begin{aligned} \text{earn} &= \alpha + \beta_1 \text{height} + \beta_4 \text{(height*sexmale)}+ \epsilon \\ \end{aligned}

• 对于女性(截距$$\alpha$$，系数$$\beta_1$$)，height增长1个单位，引起earn的增长757.4661
• 对于男性(截距$$\alpha$$，系数$_1 + _4$)，height增长1个单位，引起earn的增长757.4661 + 251.2915 = 1008.758
• 注意到，mod6和mod7是两个不同的模型,
• mod7中男女拟合曲线在y轴的截距是相同的，而mod6在y轴的截距是不同的

### 25.8.9 predict vs fit

• fitted() , 模型一旦建立，可以使用拟合函数fitted()返回拟合值，建模和拟合使用的是同一数据
• predict()， 模型建立后，可以用新的数据进行预测，predict()要求数据框包含新的预测变量，如果没有提供，那么就使用建模时的预测变量进行预测，这种情况下，得出的结果和fitted()就时一回事了。

predict()函数和fitted()函数不同的地方，还在于predict()函数往往带有返回何种类型的选项，可以是具体数值，也可以是分类变量。具体会在第 32 章介绍。

### 25.8.10 回归和相关的关系

• 相关，比如求两个变量的相关系数cor(x, y)
• 回归，也是探寻自变量和因变量的关系，一般用来预测

r <- cor(wages$height, wages$earn)
print(r^2)
## [1] 0.08503

lm(formula = earn ~ height, data = wages) %>%
broom::glance() %>%
pull(r.squared)
## [1] 0.08503