9.3 A caution on manual centering and standardization

It’s worthwhile considering the issue of listwise deletion when data are partially missing. The brms default is to delete rows with missingness, “NA” in R, for the predictors. However, brms allows users to perform one-step Bayesian imputation for missing values using the mi() syntax. First we’ll fit see what happens when you fit a model in brms when some of the negemot_z values are missing, but without using the mi() syntax. And of course before we do that, we’ll make a negemot_z_missing variable, which is identical to negemot_z, but about 10% of the values are missing.

set.seed(815)
glbwarm <-
  glbwarm %>% 
  mutate(missing = rbinom(n = 815, size = 1, prob = .1)) %>% 
  mutate(negemot_z_missing = ifelse(missing == 1, NA, negemot_z))

If you’ve never used rbinom() before, code ?rbinom or look it up in your favorite web search engine. Here’s our listwise deletion model, which corresponds to what you’d get from a typical OLS-based program.

model5 <- 
  update(model3, newdata = glbwarm,
         govact_z ~ 1 + negemot_z_missing + age_z + negemot_z_missing:age_z,
         chains = 4, cores = 4)

Let’s compare the listwise deletion results with the model based on all the data.

print(model3)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: govact_z ~ negemot_z + age_z + negemot_z:age_z 
##    Data: glbwarm (Number of observations: 815) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept           0.01      0.03    -0.05     0.06       4000 1.00
## negemot_z           0.56      0.03     0.50     0.62       4000 1.00
## age_z              -0.06      0.03    -0.12    -0.01       4000 1.00
## negemot_z:age_z     0.13      0.03     0.07     0.19       4000 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.81      0.02     0.77     0.85       4000 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
print(model5)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: govact_z ~ negemot_z_missing + age_z + negemot_z_missing:age_z 
##    Data: glbwarm (Number of observations: 719) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept                   0.00      0.03    -0.06     0.06       4000 1.00
## negemot_z_missing           0.56      0.03     0.50     0.62       4000 1.00
## age_z                      -0.05      0.03    -0.11     0.00       4000 1.00
## negemot_z_missing:age_z     0.12      0.03     0.06     0.18       4000 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.81      0.02     0.77     0.86       4000 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

In this case, the model results were similar to those based on all the data because we used rbinom() to delete the predictor values completely at random. With real data and real-live missing data mechanisms, the situation isn’t often so rosy. But anyway, the real story, here, is the Data: glbwarm (Number of observations: n) line at the top of the print() outputs. The number, \(n\), was 815 in the model using all the data and 719 for the one based on listwise deletion. That’s a lot of missing information.

The mi() syntax will allow us to use all the rows in a model, even if one or more of the predictors contain missing values. The syntax makes the model a multivariate model in that now we’ll be modeling both govact_z and negemot_z_missing. There are multiple ways to write a multivariate model in brms. One nice way is to write the model for each criterion separately in a bf() statement. You combine the bf() statements together with the + operator. And for models like the ones in Hayes’s text, you’ll also want to tack on set_rescor(FALSE). You can do this within the brm() function, as usual. But I find that this clutters the code up more than I like. So another approach is to save the combination of bf() statements as an object.

my_model <- 
  bf(govact_z  ~ 1 + mi(negemot_z_missing) + age_z + mi(negemot_z_missing):age_z) + 
  bf(negemot_z_missing | mi() ~ 1) + 
  set_rescor(FALSE)

With our multivariate formula saved as my_model, we’re ready to plug it into brm() and fit.

model6 <- 
  brm(data = glbwarm,
      family = gaussian,
      my_model,
      chains = 4, cores = 4)

Let’s see what we’ve done.

print(model6)
##  Family: MV(gaussian, gaussian) 
##   Links: mu = identity; sigma = identity
##          mu = identity; sigma = identity 
## Formula: govact_z ~ 1 + mi(negemot_z_missing) + age_z + mi(negemot_z_missing):age_z 
##          negemot_z_missing | mi() ~ 1 
##    Data: glbwarm (Number of observations: 815) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                                   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## govactz_Intercept                     0.01      0.03    -0.05     0.06       4000 1.00
## negemotzmissing_Intercept             0.00      0.04    -0.07     0.08       4000 1.00
## govactz_age_z                        -0.07      0.03    -0.12    -0.01       4000 1.00
## govactz_minegemot_z_missing           0.56      0.03     0.50     0.62       4000 1.00
## govactz_minegemot_z_missing:age_z     0.13      0.03     0.07     0.19       4000 1.00
## 
## Family Specific Parameters: 
##                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_govactz             0.81      0.02     0.77     0.85       4000 1.00
## sigma_negemotzmissing     1.00      0.03     0.95     1.05       4000 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

When using the multivariate mi() syntax, your print() output becomes more complicated. Now we have a regression model for both govact_z and negemot_z_missing. At a minimum, each has its own intercept and residual variance (i.e., sigma). In the ‘Population-Level Effects’ section, the first part of the names for each regression coefficient clarifies which \(Y\)-variable it corresponds to (e.g., govactz_Intercept is the intercept for our primary \(Y\)-variable, govact_z). In the ‘Family Specific Parameters’ section, the sigmas are similarly labeled.

Perhaps most importantly, we see “Data: glbwarm (Number of observations: 815)” at the top of the output. The multivariate mi() syntax used all the available data. No listwise deletion necessary.

The print() output for our model obscured some of the results. To clarify what the mi() syntax did, let’s peek at the first columns returned by posterior_samples().

post <- posterior_samples(model6)

post[, 1:20] %>% 
  glimpse()
## Observations: 4,000
## Variables: 20
## $ b_govactz_Intercept                     <dbl> -0.009990340, 0.021540051, 0.002501819, -0.034147985...
## $ b_negemotzmissing_Intercept             <dbl> 0.0169003031, -0.0314997166, 0.0268748216, 0.0229868...
## $ b_govactz_age_z                         <dbl> -0.049833149, -0.059868350, -0.075648875, -0.0778382...
## $ bsp_govactz_minegemot_z_missing         <dbl> 0.5816318, 0.5392590, 0.5393454, 0.4762133, 0.540401...
## $ `bsp_govactz_minegemot_z_missing:age_z` <dbl> 0.14544354, 0.10537405, 0.13267566, 0.13912678, 0.14...
## $ sigma_govactz                           <dbl> 0.8140491, 0.7789521, 0.8355927, 0.8168074, 0.789764...
## $ sigma_negemotzmissing                   <dbl> 1.0399030, 0.9872644, 1.0050982, 0.9943740, 0.967165...
## $ `Ymi_negemotzmissing[7]`                <dbl> 2.03642753, 0.01721568, 0.67960492, 0.64964174, 0.67...
## $ `Ymi_negemotzmissing[22]`               <dbl> 1.03507669, -0.32474974, 0.47110954, -0.38361393, 0....
## $ `Ymi_negemotzmissing[31]`               <dbl> -1.0684763, 1.1596497, -1.1248319, 0.8671490, -0.405...
## $ `Ymi_negemotzmissing[55]`               <dbl> -0.23866444, 1.11734470, -0.58935567, 0.27172464, -0...
## $ `Ymi_negemotzmissing[60]`               <dbl> -0.23798926, 1.76955914, -0.24605036, 1.70170122, 2....
## $ `Ymi_negemotzmissing[66]`               <dbl> -0.570751760, 0.910622983, -1.110464685, -0.59118488...
## $ `Ymi_negemotzmissing[72]`               <dbl> 0.07032315, 0.29686228, 1.90339923, 1.58553141, 0.68...
## $ `Ymi_negemotzmissing[86]`               <dbl> -0.58489937, -0.19960985, -0.17835803, -0.09381337, ...
## $ `Ymi_negemotzmissing[87]`               <dbl> -0.41169811, -0.61084609, -1.09517536, -2.48993414, ...
## $ `Ymi_negemotzmissing[98]`               <dbl> 0.47187314, -0.20029597, -0.74314411, -0.60489423, -...
## $ `Ymi_negemotzmissing[103]`              <dbl> 1.326050797, -0.414836975, 0.585970543, -0.004622591...
## $ `Ymi_negemotzmissing[107]`              <dbl> -0.455625258, -1.346992477, -0.670417852, -0.9688168...
## $ `Ymi_negemotzmissing[131]`              <dbl> -1.02817613, -0.38086307, -0.45937949, -0.83533452, ...

Columns b_govactz_Intercept through sigma_negemotzmissing were business as usual. But notice all the Ymi_negemotzmissing[i] columns. In each of these we see 4,000 posterior draws for the missing negemot_z_missing values. The [i] part of the column names indexes which row number the iterations correspond to. Since we made a lot of missing values in the data, I won’t go through them all. But we can focus on a few to get a sense of the results.

post %>% 
  select(`Ymi_negemotzmissing[7]`:`Ymi_negemotzmissing[131]`) %>% 
  gather(row, value) %>% 
  group_by(row) %>% 
  # Yep, that's right, we're summarizing as usual
  summarize(mean = mean(value),
            sd = sd(value),
            ll = quantile(value, probs = .025),
            ul = quantile(value, probs = .975)) %>% 
  mutate_if(is.double, round, digits = 2) %>%
  mutate(row = str_extract(row, "\\d+") %>% as.integer())  # this line just makes the row names easier to read
## # A tibble: 13 x 5
##      row   mean    sd    ll    ul
##    <int>  <dbl> <dbl> <dbl> <dbl>
##  1   103  0.65   0.84 -0.99  2.28
##  2   107 -0.580  0.75 -2.07  0.89
##  3   131 -0.45   0.76 -1.96  1.01
##  4    22  0.38   0.78 -1.16  1.89
##  5    31 -0.13   0.74 -1.56  1.33
##  6    55 -0.27   0.84 -1.89  1.39
##  7    60  0.48   0.78 -1.04  1.96
##  8    66 -0.34   0.81 -1.93  1.22
##  9     7  0.91   0.84 -0.7   2.59
## 10    72  0.34   0.75 -1.14  1.84
## 11    86 -0.33   0.83 -1.99  1.29
## 12    87 -0.67   0.73 -2.09  0.75
## 13    98 -0.18   0.85 -1.89  1.44

Please don’t do that. It’s a sin against science and artificially decreases the variability in the data. Artificially reducing the variability results in artificially confident models (i.e., overly-narrow posterior \(SD\)s within the Bayesian paradigm and inflated type-I errors within the frequentist paradigm) and biased parameter estimates. There are much better methods at our disposal.

One of those better methods is multiple imputation. With multiple imputation, you create a small number of alternative data sets, typically 5, into which you impute plausible values into the missing value slots. The imputed values will vary across the data sets and that uncertainty will get appropriately transmitted to the model. I know this might sound crazy, but it typically leads to much lower model bias when compared to mean imputation or listwise deletion.

But we didn’t quite use multiple imputation. With one-step Bayesian imputation using the mi() syntax, you get an entire posterior distribution for each missing value. And if you have variables in the data set that might help predict what those missing values are, you’d just plug that into the model. Improving the imputation model can improve the subsequent substantive model.

For more on the mi() approach, see Bürkner’s vignette. McElreath lectured on this topic within the context of his Statistical Rethinking text and you can find my effort to translate the chapter 14 code in McElreath’s text into brms here. For a more general introduction to missing data theory, check out van Burren’s blookdown book Flexible Imputation of Missing Data. Second Edition or Enders’ great Applied Missing Data Analysis. You can find Enders lecturing on missing data here.

The take home message is there is no need to ignore missing data or use outdated procedures like listwise deletion. Be a champion and model your missing data with brms.