2.1 Correlation and prediction

Here we load a couple necessary packages, load the data, and take a peek.

library(tidyverse)

glbwarm <- read_csv("data/glbwarm/glbwarm.csv")

glimpse(glbwarm)
#> Observations: 815
#> Variables: 7
#> $ govact   <dbl> 3.6, 5.0, 6.6, 1.0, 4.0, 7.0, 6.8, 5.6, 6.0, 2.6, 1.4, 5.6, 7.0, 3.8, 3.4, 4.2, 1.0, 2.6...
#> $ posemot  <dbl> 3.67, 2.00, 2.33, 5.00, 2.33, 1.00, 2.33, 4.00, 5.00, 5.00, 1.00, 4.00, 1.00, 5.67, 3.00...
#> $ negemot  <dbl> 4.67, 2.33, 3.67, 5.00, 1.67, 6.00, 4.00, 5.33, 6.00, 2.00, 1.00, 4.00, 5.00, 4.67, 2.00...
#> $ ideology <int> 6, 2, 1, 1, 4, 3, 4, 5, 4, 7, 6, 4, 2, 4, 5, 2, 6, 4, 2, 4, 4, 2, 6, 4, 4, 3, 4, 5, 4, 5...
#> $ age      <int> 61, 55, 85, 59, 22, 34, 47, 65, 50, 60, 71, 60, 71, 59, 32, 36, 69, 70, 41, 48, 38, 63, ...
#> $ sex      <int> 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1...
#> $ partyid  <int> 2, 1, 1, 1, 1, 2, 1, 1, 2, 3, 2, 1, 1, 1, 1, 1, 2, 3, 1, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 1...

If you are new to tidyverse-style syntax, possibly the oddest component is the pipe (i.e., %>%). I’m not going to explain the %>% in this project, but you might learn more about in this brief clip, starting around minute 21:25 in this talk by Wickham, or in section 5.6.1 from Grolemund and Wickham’s R for Data Science. Really, all of chapter 5 of R4DS is great for new R and new tidyverse users, and their chapter 3 is a nice introduction to plotting with ggplot2.

Here is our version of Figure 2.1.

glbwarm %>% 
  group_by(negemot, govact) %>% 
  count() %>% 
  
  ggplot(aes(x = negemot, y = govact)) +
  geom_point(aes(size = n)) +
  labs(x = expression(paste("NEGEMOT: Negative emotions about climate change (", italic("X"), ")")),
       y = expression(paste("GOVACT: Support for governmentaction (", italic("Y"), ")"))) +
  theme_bw() +
  theme(legend.position = "none")

There are other ways to handle the overplotting issue, such as jittering.

glbwarm %>% 
  ggplot(aes(x = negemot, y = govact)) +
  geom_jitter(height = .05, width = .05, 
              alpha = 1/2, size = 1/3) +
  labs(x = expression(paste("NEGEMOT: Negative emotions about climate change (", italic("X"), ")")),
       y = expression(paste("GOVACT: Support for governmentaction (", italic("Y"), ")"))) +
  theme_bw()

The cor() function is a simple way to compute a Pearson’s correlation coefficient.

cor(glbwarm$negemot, glbwarm$govact)
#> [1] 0.578

If you want more plentiful output, the cor.test() function returns a \(t\)-value, the degrees of freedom, the corresponding \(p\)-value and the 95% confidence intervals, in addition to the Pearson’s correlation coefficient.

cor.test(glbwarm$negemot, glbwarm$govact)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  glbwarm$negemot and glbwarm$govact
#> t = 20, df = 800, p-value <2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.530 0.622
#> sample estimates:
#>   cor 
#> 0.578

To get the Bayesian version, we’ll open our focal statistical package, Bürkner’s brms. [And I’ll briefly note that you could also do many of these analyses with other packages, such as blavaan. I just prefer brms.]

library(brms)

We’ll start simple and just use the default priors and settings, but with the addition of parallel sampling via cores = 4.

model1 <- 
  brm(data = glbwarm, family = gaussian,
      cbind(negemot, govact) ~ 1,
      chains = 4, cores = 4)
print(model1)
#>  Family: MV(gaussian, gaussian) 
#>   Links: mu = identity; sigma = identity
#>          mu = identity; sigma = identity 
#> Formula: negemot ~ 1 
#>          govact ~ 1 
#>    Data: glbwarm (Number of observations: 815) 
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#> 
#> Population-Level Effects: 
#>                   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> negemot_Intercept     3.56      0.05     3.45     3.67       3633 1.00
#> govact_Intercept      4.59      0.05     4.50     4.68       3647 1.00
#> 
#> Family Specific Parameters: 
#>                        Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma_negemot              1.53      0.04     1.46     1.61       3603 1.00
#> sigma_govact               1.36      0.03     1.30     1.43       3558 1.00
#> rescor(negemot,govact)     0.58      0.02     0.53     0.62       3800 1.00
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
#> is a crude measure of effective sample size, and Rhat is the potential 
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Within the brms framework, \(\sigma\) of the Gaussian likelihood is considered a family-specific parameter (e.g., there is no \(\sigma\) for the Poisson distribution). When you have an intercept-only regression model with multiple variables, the covariance among their \(\sigma\) parameters are correlations. Since our multivariate model is of two variables, we have one covariance for the \(\sigmas\)s, rescor(negemot,govact), which is our Bayesian analogue to the Pearson’s correlation.

To learn more about the multivariate syntax in brms, click here or code vignette("brms_multivariate").

But to clarify the output:

  • ‘Estimate’ = the posterior mean, analogous to the frequentist point estimate
  • ‘Est.Error’ = the posterior \(SD\), analogous to the frequentist standard error
  • ‘l-95% CI’ = the lower-level of the percentile-based 95% Bayesian credible interval
  • ‘u-95% CI’ = the upper-level of the same