2.2 The simple linear regression model
Here is how one might get the simple OLS coefficients in base R with the lm()
function.
(model2 <- lm(data = glbwarm, govact ~ 1 + negemot))
#>
#> Call:
#> lm(formula = govact ~ 1 + negemot, data = glbwarm)
#>
#> Coefficients:
#> (Intercept) negemot
#> 2.757 0.514
For more detailed output, put the model object model2
into the summary()
function.
summary(model2)
#>
#> Call:
#> lm(formula = govact ~ 1 + negemot, data = glbwarm)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.329 -0.673 0.102 0.755 3.214
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.7573 0.0987 27.9 <2e-16 ***
#> negemot 0.5142 0.0255 20.2 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.11 on 813 degrees of freedom
#> Multiple R-squared: 0.334, Adjusted R-squared: 0.333
#> F-statistic: 407 on 1 and 813 DF, p-value: <2e-16
Here’s the Bayesian model in brms.
model2 <-
brm(data = glbwarm, family = gaussian,
govact ~ 1 + negemot,
chains = 4, cores = 4)
There are several ways to get a brms model summary. A go-to is with the print()
function.
print(model2)
#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: govact ~ 1 + negemot
#> Data: glbwarm (Number of observations: 815)
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup samples = 4000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> Intercept 2.76 0.10 2.56 2.96 4000 1.00
#> negemot 0.51 0.03 0.46 0.57 3803 1.00
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma 1.11 0.03 1.06 1.17 4000 1.00
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
#> is a crude measure of effective sample size, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
The summary()
function works very much the same way. To get a more focused look, you can use the posterior_summary()
function:
posterior_summary(model2)
#> Estimate Est.Error Q2.5 Q97.5
#> b_Intercept 2.758 0.1021 2.556 2.956
#> b_negemot 0.514 0.0261 0.462 0.565
#> sigma 1.112 0.0275 1.060 1.167
#> lp__ -1248.621 1.2617 -1251.883 -1247.216
That also yields the log posterior, lp__
, which you can learn more about here or here. We won’t focus on the lp__
directly in this project.
But anyways, the Q2.5
and Q97.5
, are the lower- and upper-levels of the 95% credible intervals. The Q
prefix stands for quantile (see this thread). In this case, these are a renamed version of the l-95% CI
and u-95% CI
columns from our print()
output.
To make a quick plot of the regression line, one can use the convenient marginal_effects()
function.
marginal_effects(model2)
If you want to customize that output, you might nest it in plot()
.
plot(marginal_effects(model2),
points = T,
point_args = c(height = .05, width = .05,
alpha = 1/2, size = 1/3))
It’s also useful to be able to work with the output of a brms model directly. For our first step, we’ll put our HMC draws into a data frame.
post <- posterior_samples(model2)
head(post)
#> b_Intercept b_negemot sigma lp__
#> 1 2.73 0.525 1.05 -1250
#> 2 2.86 0.505 1.11 -1249
#> 3 2.76 0.520 1.11 -1247
#> 4 2.98 0.446 1.12 -1251
#> 5 2.91 0.462 1.08 -1250
#> 6 2.77 0.512 1.12 -1247
Next, we’ll use the fitted()
function to simulate model-implied summaries for the expected govact
value, given particular predictor values. Our first model only has negemot
as a predictor, and we’ll ask for the expected govact
values for negemot
ranging from 0 to 7.
nd <- tibble(negemot = seq(from = 0, to = 7, length.out = 30))
f_model2 <-
fitted(model2,
newdata = nd) %>%
as_tibble() %>%
bind_cols(nd)
f_model2
#> # A tibble: 30 x 5
#> Estimate Est.Error Q2.5 Q97.5 negemot
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2.76 0.102 2.56 2.96 0
#> 2 2.88 0.0963 2.69 3.07 0.241
#> 3 3.01 0.0906 2.83 3.18 0.483
#> 4 3.13 0.0849 2.96 3.29 0.724
#> 5 3.25 0.0794 3.10 3.41 0.966
#> 6 3.38 0.0739 3.23 3.52 1.21
#> # ... with 24 more rows
The first two columns should look familiar to the output from print(model2)
, above. The next two columns, Q2.5
and Q97.5
, are the lower- and upper-levels of the 95% credible intervals, like we got from posterior_samples()
. We got the final column with the bind_cols(nd)
code.
Here’s our bespoke version of Figure 2.4.
glbwarm %>%
group_by(negemot, govact) %>%
count() %>%
ggplot(aes(x = negemot)) +
geom_point(aes(y = govact, size = n)) +
geom_ribbon(data = f_model2,
aes(ymin = Q2.5,
ymax = Q97.5),
fill = "grey75", alpha = 3/4) +
geom_line(data = f_model2,
aes(y = Estimate)) +
annotate("text", x = 2.2, y = 7.5, label = "Cases with positive residuals", color = "red3") +
annotate("text", x = 4.75, y = .8, label = "Cases with negative residuals", color = "blue3") +
labs(x = expression(paste("NEGEMOT: Negative emotions about climate change (", italic("X"), ")")),
y = expression(paste("GOVACT: Support for governmentaction (", italic("Y"), ")"))) +
coord_cartesian(xlim = range(glbwarm$negemot)) +
theme_bw() +
theme(legend.position = "none")
2.2.1 Interpretation of the constant and regression coefficient
Nothing to recode, here.
2.2.2 The standardized regression model
Same here.
2.2.3 Simple linear regression with a dichotomous antecedent variable.
Here we add sex
to the model.
model3 <-
brm(data = glbwarm, family = gaussian,
govact ~ 1 + sex,
chains = 4, cores = 4)
print(model3)
#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: govact ~ 1 + sex
#> Data: glbwarm (Number of observations: 815)
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup samples = 4000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> Intercept 4.72 0.07 4.58 4.85 3872 1.00
#> sex -0.27 0.10 -0.45 -0.07 3960 1.00
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma 1.36 0.03 1.29 1.42 4000 1.00
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
#> is a crude measure of effective sample size, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
Our model output is very close to that in the text. If you just wanted the coefficients, you might use the fixef()
function.
fixef(model3) %>% round(digits = 3)
#> Estimate Est.Error Q2.5 Q97.5
#> Intercept 4.716 0.068 4.58 4.850
#> sex -0.265 0.097 -0.45 -0.073
Though not necessary, we used the round()
function to reduce the number of significant digits in the output.
You can get a little more information with the posterior_summary()
function.
But since Bayesian estimation yields an entire posterior distribution, you can visualize that distribution in any number of ways. Because we’ll be using ggplot2, we’ll need to put the posterior draws into a data frame before plotting.
post <- posterior_samples(model3)
We could summarize the posterior with boxplots:
post %>%
rename(female = b_Intercept) %>%
mutate(male = female + b_sex) %>%
select(male, female) %>%
gather() %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot(aes(fill = key)) +
theme_bw() +
theme(legend.position = "none")
Or with overlapping density plots:
post %>%
rename(female = b_Intercept) %>%
mutate(male = female + b_sex) %>%
select(male, female) %>%
gather() %>%
ggplot(aes(x = value, group = key, fill = key)) +
geom_density(color = "transparent", alpha = 3/4) +
theme_bw()
Or even with violin plots with superimposed posterior medians and 95% intervals:
post %>%
rename(female = b_Intercept) %>%
mutate(male = female + b_sex) %>%
select(male, female) %>%
gather() %>%
ggplot(aes(x = key, y = value)) +
geom_violin(aes(fill = key), color = "transparent", alpha = 3/4) +
stat_summary(fun.y = median,
fun.ymin = function(i){quantile(i, probs = .025)},
fun.ymax = function(i){quantile(i, probs = .975)}) +
theme_bw() +
theme(legend.position = "none")
For even more ideas, see Matthew Kay’s tidybayes package. You can also get a sense of the model estimates for women and men with a little addition. Here we continue to use the round()
function to simplify the output.
# for women
round(fixef(model3)[1, ], digits = 2)
#> Estimate Est.Error Q2.5 Q97.5
#> 4.72 0.07 4.58 4.85
# for men
round(fixef(model3)[1, ] + fixef(model3)[2, ], digits = 2)
#> Estimate Est.Error Q2.5 Q97.5
#> 4.45 0.17 4.13 4.78
Here’s the partially-standardized model, first with lm()
.
glbwarm <-
glbwarm %>%
mutate(govact_z = (govact - mean(govact))/sd(govact))
lm(data = glbwarm, govact_z ~ 1 + sex)
#>
#> Call:
#> lm(formula = govact_z ~ 1 + sex, data = glbwarm)
#>
#> Coefficients:
#> (Intercept) sex
#> 0.0963 -0.1972
Now we’ll use Bayes.
model3_p_z <-
brm(data = glbwarm, family = gaussian,
govact_z ~ 1 + sex,
chains = 4, cores = 4)
fixef(model3_p_z)
#> Estimate Est.Error Q2.5 Q97.5
#> Intercept 0.0979 0.0477 0.00333 0.1916
#> sex -0.1996 0.0688 -0.33517 -0.0664