## 9.3 A caution on manual centering and standardization

It’s worthwhile considering the issue of listwise deletion when data are partially missing. The brms default is to delete rows with missingness, “NA” in R, for the predictors. However, brms allows users to perform one-step Bayesian imputation for missing values using the `mi()`

syntax. First we’ll fit see what happens when you fit a model in brms when some of the `negemot_z`

values are missing, but without using the `mi()`

syntax. And of course before we do that, we’ll make a `negemot_z_missing`

variable, which is identical to `negemot_z`

, but about 10% of the values are missing.

```
set.seed(815)
glbwarm <-
glbwarm %>%
mutate(missing = rbinom(n = 815, size = 1, prob = .1)) %>%
mutate(negemot_z_missing = ifelse(missing == 1, NA, negemot_z))
```

If you’ve never used `rbinom()`

before, code `?rbinom`

or look it up in your favorite web search engine. Here’s our listwise deletion model, which corresponds to what you’d get from a typical OLS-based program.

```
model5 <-
update(model3, newdata = glbwarm,
govact_z ~ 1 + negemot_z_missing + age_z + negemot_z_missing:age_z,
chains = 4, cores = 4)
```

Let’s compare the listwise deletion results with the model based on all the data.

`print(model3)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: govact_z ~ negemot_z + age_z + negemot_z:age_z
## Data: glbwarm (Number of observations: 815)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 0.01 0.03 -0.05 0.06 4000 1.00
## negemot_z 0.56 0.03 0.50 0.62 4000 1.00
## age_z -0.06 0.03 -0.12 -0.01 4000 1.00
## negemot_z:age_z 0.13 0.03 0.07 0.19 4000 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 0.81 0.02 0.77 0.85 4000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

`print(model5)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: govact_z ~ negemot_z_missing + age_z + negemot_z_missing:age_z
## Data: glbwarm (Number of observations: 719)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 0.00 0.03 -0.06 0.06 4000 1.00
## negemot_z_missing 0.56 0.03 0.50 0.62 4000 1.00
## age_z -0.05 0.03 -0.11 0.00 4000 1.00
## negemot_z_missing:age_z 0.12 0.03 0.06 0.18 4000 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 0.81 0.02 0.77 0.86 4000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

In this case, the model results were similar to those based on all the data because we used `rbinom()`

to delete the predictor values completely at random. With real data and real-live missing data mechanisms, the situation isn’t often so rosy. But anyway, the real story, here, is the `Data: glbwarm (Number of observations: n)`

line at the top of the `print()`

outputs. The number, \(n\), was 815 in the model using all the data and 719 for the one based on listwise deletion. That’s a lot of missing information.

The `mi()`

syntax will allow us to use all the rows in a model, even if one or more of the predictors contain missing values. The syntax makes the model a multivariate model in that now we’ll be modeling both `govact_z`

*and* `negemot_z_missing`

. There are multiple ways to write a multivariate model in brms. One nice way is to write the model for each criterion separately in a `bf()`

statement. You combine the `bf()`

statements together with the `+`

operator. And for models like the ones in Hayes’s text, you’ll also want to tack on `set_rescor(FALSE)`

. You can do this within the `brm()`

function, as usual. But I find that this clutters the code up more than I like. So another approach is to save the combination of `bf()`

statements as an object.

```
my_model <-
bf(govact_z ~ 1 + mi(negemot_z_missing) + age_z + mi(negemot_z_missing):age_z) +
bf(negemot_z_missing | mi() ~ 1) +
set_rescor(FALSE)
```

With our multivariate formula saved as `my_model`

, we’re ready to plug it into `brm()`

and fit.

```
model6 <-
brm(data = glbwarm,
family = gaussian,
my_model,
chains = 4, cores = 4)
```

Let’s see what we’ve done.

`print(model6)`

```
## Family: MV(gaussian, gaussian)
## Links: mu = identity; sigma = identity
## mu = identity; sigma = identity
## Formula: govact_z ~ 1 + mi(negemot_z_missing) + age_z + mi(negemot_z_missing):age_z
## negemot_z_missing | mi() ~ 1
## Data: glbwarm (Number of observations: 815)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## govactz_Intercept 0.01 0.03 -0.05 0.06 4000 1.00
## negemotzmissing_Intercept 0.00 0.04 -0.07 0.08 4000 1.00
## govactz_age_z -0.07 0.03 -0.12 -0.01 4000 1.00
## govactz_minegemot_z_missing 0.56 0.03 0.50 0.62 4000 1.00
## govactz_minegemot_z_missing:age_z 0.13 0.03 0.07 0.19 4000 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_govactz 0.81 0.02 0.77 0.85 4000 1.00
## sigma_negemotzmissing 1.00 0.03 0.95 1.05 4000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

When using the multivariate `mi()`

syntax, your `print()`

output becomes more complicated. Now we have a regression model for both `govact_z`

and `negemot_z_missing`

. At a minimum, each has its own intercept and residual variance (i.e., sigma). In the ‘Population-Level Effects’ section, the first part of the names for each regression coefficient clarifies which \(Y\)-variable it corresponds to (e.g., `govactz_Intercept`

is the intercept for our primary \(Y\)-variable, `govact_z`

). In the ‘Family Specific Parameters’ section, the sigmas are similarly labeled.

Perhaps most importantly, we see “Data: glbwarm (Number of observations: 815)” at the top of the output. The multivariate `mi()`

syntax used all the available data. No listwise deletion necessary.

The `print()`

output for our model obscured some of the results. To clarify what the `mi()`

syntax did, let’s peek at the first columns returned by `posterior_samples()`

.

```
post <- posterior_samples(model6)
post[, 1:20] %>%
glimpse()
```

```
## Observations: 4,000
## Variables: 20
## $ b_govactz_Intercept <dbl> -0.009990340, 0.021540051, 0.002501819, -0.034147985...
## $ b_negemotzmissing_Intercept <dbl> 0.0169003031, -0.0314997166, 0.0268748216, 0.0229868...
## $ b_govactz_age_z <dbl> -0.049833149, -0.059868350, -0.075648875, -0.0778382...
## $ bsp_govactz_minegemot_z_missing <dbl> 0.5816318, 0.5392590, 0.5393454, 0.4762133, 0.540401...
## $ `bsp_govactz_minegemot_z_missing:age_z` <dbl> 0.14544354, 0.10537405, 0.13267566, 0.13912678, 0.14...
## $ sigma_govactz <dbl> 0.8140491, 0.7789521, 0.8355927, 0.8168074, 0.789764...
## $ sigma_negemotzmissing <dbl> 1.0399030, 0.9872644, 1.0050982, 0.9943740, 0.967165...
## $ `Ymi_negemotzmissing[7]` <dbl> 2.03642753, 0.01721568, 0.67960492, 0.64964174, 0.67...
## $ `Ymi_negemotzmissing[22]` <dbl> 1.03507669, -0.32474974, 0.47110954, -0.38361393, 0....
## $ `Ymi_negemotzmissing[31]` <dbl> -1.0684763, 1.1596497, -1.1248319, 0.8671490, -0.405...
## $ `Ymi_negemotzmissing[55]` <dbl> -0.23866444, 1.11734470, -0.58935567, 0.27172464, -0...
## $ `Ymi_negemotzmissing[60]` <dbl> -0.23798926, 1.76955914, -0.24605036, 1.70170122, 2....
## $ `Ymi_negemotzmissing[66]` <dbl> -0.570751760, 0.910622983, -1.110464685, -0.59118488...
## $ `Ymi_negemotzmissing[72]` <dbl> 0.07032315, 0.29686228, 1.90339923, 1.58553141, 0.68...
## $ `Ymi_negemotzmissing[86]` <dbl> -0.58489937, -0.19960985, -0.17835803, -0.09381337, ...
## $ `Ymi_negemotzmissing[87]` <dbl> -0.41169811, -0.61084609, -1.09517536, -2.48993414, ...
## $ `Ymi_negemotzmissing[98]` <dbl> 0.47187314, -0.20029597, -0.74314411, -0.60489423, -...
## $ `Ymi_negemotzmissing[103]` <dbl> 1.326050797, -0.414836975, 0.585970543, -0.004622591...
## $ `Ymi_negemotzmissing[107]` <dbl> -0.455625258, -1.346992477, -0.670417852, -0.9688168...
## $ `Ymi_negemotzmissing[131]` <dbl> -1.02817613, -0.38086307, -0.45937949, -0.83533452, ...
```

Columns `b_govactz_Intercept`

through `sigma_negemotzmissing`

were business as usual. But notice all the `Ymi_negemotzmissing[i]`

columns. In each of these we see 4,000 posterior draws for the missing `negemot_z_missing`

values. The `[i]`

part of the column names indexes which row number the iterations correspond to. Since we made a lot of missing values in the data, I won’t go through them all. But we can focus on a few to get a sense of the results.

```
post %>%
select(`Ymi_negemotzmissing[7]`:`Ymi_negemotzmissing[131]`) %>%
gather(row, value) %>%
group_by(row) %>%
# Yep, that's right, we're summarizing as usual
summarize(mean = mean(value),
sd = sd(value),
ll = quantile(value, probs = .025),
ul = quantile(value, probs = .975)) %>%
mutate_if(is.double, round, digits = 2) %>%
mutate(row = str_extract(row, "\\d+") %>% as.integer()) # this line just makes the row names easier to read
```

```
## # A tibble: 13 x 5
## row mean sd ll ul
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 103 0.65 0.84 -0.99 2.28
## 2 107 -0.580 0.75 -2.07 0.89
## 3 131 -0.45 0.76 -1.96 1.01
## 4 22 0.38 0.78 -1.16 1.89
## 5 31 -0.13 0.74 -1.56 1.33
## 6 55 -0.27 0.84 -1.89 1.39
## 7 60 0.48 0.78 -1.04 1.96
## 8 66 -0.34 0.81 -1.93 1.22
## 9 7 0.91 0.84 -0.7 2.59
## 10 72 0.34 0.75 -1.14 1.84
## 11 86 -0.33 0.83 -1.99 1.29
## 12 87 -0.67 0.73 -2.09 0.75
## 13 98 -0.18 0.85 -1.89 1.44
```

Please don’t do that. It’s a sin against science and artificially decreases the variability in the data. Artificially reducing the variability results in artificially confident models (i.e., overly-narrow posterior \(SD\)s within the Bayesian paradigm and inflated type-I errors within the frequentist paradigm) and biased parameter estimates. There are much better methods at our disposal.

One of those better methods is multiple imputation. With multiple imputation, you create a small number of alternative data sets, typically 5, into which you impute plausible values into the missing value slots. The imputed values will vary across the data sets and that uncertainty will get appropriately transmitted to the model. I know this might sound crazy, but it typically leads to much lower model bias when compared to mean imputation or listwise deletion.

But we didn’t quite use multiple imputation. With one-step Bayesian imputation using the `mi()`

syntax, you get an entire posterior distribution for each missing value. And if you have variables in the data set that might help predict what those missing values are, you’d just plug that into the model. Improving the imputation model can improve the subsequent substantive model.

For more on the `mi()`

approach, see Bürkner’s vignette. McElreath lectured on this topic within the context of his *Statistical Rethinking* text and you can find my effort to translate the chapter 14 code in McElreath’s text into brms here. For a more general introduction to missing data theory, check out van Burren’s blookdown book *Flexible Imputation of Missing Data. Second Edition* or Enders’ great *Applied Missing Data Analysis*. You can find Enders lecturing on missing data here.

The take home message is there is no need to ignore missing data or use outdated procedures like listwise deletion. Be a champion and model your missing data with brms.