2.2 The simple linear regression model

Here is how one might get the simple OLS coefficients in base R with the lm() function.

(model2 <- lm(data = glbwarm, govact ~ 1 + negemot))
#> 
#> Call:
#> lm(formula = govact ~ 1 + negemot, data = glbwarm)
#> 
#> Coefficients:
#> (Intercept)      negemot  
#>       2.757        0.514

For more detailed output, put the model object model2 into the summary() function.

summary(model2)
#> 
#> Call:
#> lm(formula = govact ~ 1 + negemot, data = glbwarm)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -4.329 -0.673  0.102  0.755  3.214 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.7573     0.0987    27.9   <2e-16 ***
#> negemot       0.5142     0.0255    20.2   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.11 on 813 degrees of freedom
#> Multiple R-squared:  0.334,  Adjusted R-squared:  0.333 
#> F-statistic:  407 on 1 and 813 DF,  p-value: <2e-16

Here’s the Bayesian model in brms.

model2 <- 
  brm(data = glbwarm, family = gaussian,
      govact ~ 1 + negemot,
      chains = 4, cores = 4)

There are several ways to get a brms model summary. A go-to is with the print() function.

print(model2)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: govact ~ 1 + negemot 
#>    Data: glbwarm (Number of observations: 815) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> Intercept     2.76      0.10     2.56     2.96       4000 1.00
#> negemot       0.51      0.03     0.46     0.57       3803 1.00
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma     1.11      0.03     1.06     1.17       4000 1.00
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
#> is a crude measure of effective sample size, and Rhat is the potential 
#> scale reduction factor on split chains (at convergence, Rhat = 1).

The summary() function works very much the same way. To get a more focused look, you can use the posterior_summary() function:

posterior_summary(model2)
#>              Estimate Est.Error      Q2.5     Q97.5
#> b_Intercept     2.758    0.1021     2.556     2.956
#> b_negemot       0.514    0.0261     0.462     0.565
#> sigma           1.112    0.0275     1.060     1.167
#> lp__        -1248.621    1.2617 -1251.883 -1247.216

That also yields the log posterior, lp__, which you can learn more about here or here. We won’t focus on the lp__ directly in this project.

But anyways, the Q2.5 and Q97.5, are the lower- and upper-levels of the 95% credible intervals. The Q prefix stands for quantile (see this thread). In this case, these are a renamed version of the l-95% CI and u-95% CI columns from our print() output.

To make a quick plot of the regression line, one can use the convenient marginal_effects() function.

marginal_effects(model2)

If you want to customize that output, you might nest it in plot().

plot(marginal_effects(model2),
     points = T,
     point_args = c(height = .05, width = .05, 
                    alpha = 1/2, size = 1/3))

It’s also useful to be able to work with the output of a brms model directly. For our first step, we’ll put our HMC draws into a data frame.

post <- posterior_samples(model2)

head(post)
#>   b_Intercept b_negemot sigma  lp__
#> 1        2.73     0.525  1.05 -1250
#> 2        2.86     0.505  1.11 -1249
#> 3        2.76     0.520  1.11 -1247
#> 4        2.98     0.446  1.12 -1251
#> 5        2.91     0.462  1.08 -1250
#> 6        2.77     0.512  1.12 -1247

Next, we’ll use the fitted() function to simulate model-implied summaries for the expected govact value, given particular predictor values. Our first model only has negemot as a predictor, and we’ll ask for the expected govact values for negemot ranging from 0 to 7.

nd <- tibble(negemot = seq(from = 0, to = 7, length.out = 30))

f_model2 <-
  fitted(model2,
         newdata = nd) %>% 
  as_tibble() %>% 
  bind_cols(nd)

f_model2
#> # A tibble: 30 x 5
#>   Estimate Est.Error  Q2.5 Q97.5 negemot
#>      <dbl>     <dbl> <dbl> <dbl>   <dbl>
#> 1     2.76    0.102   2.56  2.96   0    
#> 2     2.88    0.0963  2.69  3.07   0.241
#> 3     3.01    0.0906  2.83  3.18   0.483
#> 4     3.13    0.0849  2.96  3.29   0.724
#> 5     3.25    0.0794  3.10  3.41   0.966
#> 6     3.38    0.0739  3.23  3.52   1.21 
#> # ... with 24 more rows

The first two columns should look familiar to the output from print(model2), above. The next two columns, Q2.5 and Q97.5, are the lower- and upper-levels of the 95% credible intervals, like we got from posterior_samples(). We got the final column with the bind_cols(nd) code.

Here’s our bespoke version of Figure 2.4.

glbwarm %>% 
  group_by(negemot, govact) %>% 
  count() %>% 
  
  ggplot(aes(x = negemot)) +
  geom_point(aes(y = govact, size = n)) +
  geom_ribbon(data = f_model2,
              aes(ymin = Q2.5,
                  ymax = Q97.5),
              fill = "grey75", alpha = 3/4) +
  geom_line(data = f_model2,
            aes(y = Estimate)) +
  annotate("text", x = 2.2, y = 7.5, label = "Cases with positive residuals", color = "red3") +
  annotate("text", x = 4.75, y = .8, label = "Cases with negative residuals", color = "blue3") +
  labs(x = expression(paste("NEGEMOT: Negative emotions about climate change (", italic("X"), ")")),
       y = expression(paste("GOVACT: Support for governmentaction (", italic("Y"), ")"))) +
  coord_cartesian(xlim = range(glbwarm$negemot)) +
  theme_bw() +
  theme(legend.position = "none")

2.2.1 Interpretation of the constant and regression coefficient

Nothing to recode, here.

2.2.2 The standardized regression model

Same here.

2.2.3 Simple linear regression with a dichotomous antecedent variable.

Here we add sex to the model.

model3 <- 
  brm(data = glbwarm, family = gaussian,
      govact ~ 1 + sex,
      chains = 4, cores = 4)
print(model3)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: govact ~ 1 + sex 
#>    Data: glbwarm (Number of observations: 815) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> Intercept     4.72      0.07     4.58     4.85       3872 1.00
#> sex          -0.27      0.10    -0.45    -0.07       3960 1.00
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma     1.36      0.03     1.29     1.42       4000 1.00
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
#> is a crude measure of effective sample size, and Rhat is the potential 
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Our model output is very close to that in the text. If you just wanted the coefficients, you might use the fixef() function.

fixef(model3) %>% round(digits = 3)
#>           Estimate Est.Error  Q2.5  Q97.5
#> Intercept    4.716     0.068  4.58  4.850
#> sex         -0.265     0.097 -0.45 -0.073

Though not necessary, we used the round() function to reduce the number of significant digits in the output.

You can get a little more information with the posterior_summary() function.

But since Bayesian estimation yields an entire posterior distribution, you can visualize that distribution in any number of ways. Because we’ll be using ggplot2, we’ll need to put the posterior draws into a data frame before plotting.

post <- posterior_samples(model3)

We could summarize the posterior with boxplots:

post %>% 
  rename(female = b_Intercept) %>% 
  mutate(male = female + b_sex) %>% 
  select(male, female) %>% 
  gather() %>% 
  
  ggplot(aes(x = key, y = value)) +
  geom_boxplot(aes(fill = key)) +
  theme_bw() +
  theme(legend.position = "none")

Or with overlapping density plots:

post %>% 
  rename(female = b_Intercept) %>% 
  mutate(male = female + b_sex) %>% 
  select(male, female) %>% 
  gather() %>% 
  
  ggplot(aes(x = value, group = key, fill = key)) +
  geom_density(color = "transparent", alpha = 3/4) +
  theme_bw()

Or even with violin plots with superimposed posterior medians and 95% intervals:

post %>% 
  rename(female = b_Intercept) %>% 
  mutate(male = female + b_sex) %>% 
  select(male, female) %>% 
  gather() %>% 
  
  ggplot(aes(x = key, y = value)) +
  geom_violin(aes(fill = key), color = "transparent", alpha = 3/4) +
  stat_summary(fun.y = median,
               fun.ymin = function(i){quantile(i, probs = .025)},
               fun.ymax = function(i){quantile(i, probs = .975)}) +
  theme_bw() +
  theme(legend.position = "none")

For even more ideas, see Matthew Kay’s tidybayes package. You can also get a sense of the model estimates for women and men with a little addition. Here we continue to use the round() function to simplify the output.

# for women
round(fixef(model3)[1, ], digits = 2)
#>  Estimate Est.Error      Q2.5     Q97.5 
#>      4.72      0.07      4.58      4.85

# for men
round(fixef(model3)[1, ] + fixef(model3)[2, ], digits = 2)
#>  Estimate Est.Error      Q2.5     Q97.5 
#>      4.45      0.17      4.13      4.78

Here’s the partially-standardized model, first with lm().

glbwarm <-
  glbwarm %>% 
  mutate(govact_z = (govact - mean(govact))/sd(govact))

lm(data = glbwarm, govact_z ~ 1 + sex)
#> 
#> Call:
#> lm(formula = govact_z ~ 1 + sex, data = glbwarm)
#> 
#> Coefficients:
#> (Intercept)          sex  
#>      0.0963      -0.1972

Now we’ll use Bayes.

model3_p_z <- 
  brm(data = glbwarm, family = gaussian,
      govact_z ~ 1 + sex,
      chains = 4, cores = 4)
fixef(model3_p_z)
#>           Estimate Est.Error     Q2.5   Q97.5
#> Intercept   0.0979    0.0477  0.00333  0.1916
#> sex        -0.1996    0.0688 -0.33517 -0.0664