# Chapter 6 More topics on Multiple Imputation and Regression Modelling

This Chapter is a follow-up on the previous Chapter 5 about data analysis with Multiple Imputation. In this Chapter, we will deal with some specific topics when you perform regression modeling in multiple imputed datasets.

## 6.1 Regression modeling with categorical covariates

For categorical covariates, SPSS does not generate a pooled p-value for the overall Wald test. This is equal to not presenting a pooled Chi-square value in SPSS because the overall Wald value is a Chi-square value that represents the relationship between variables with > 2 categories and the outcome. An example is shown in Figure 6.1, where the relationshp between a categorical version of the Tampa scale variable (categories 0 = low fear of movement, 1 = middle fear of movement and 2 is a high fear of movement) and the outcome Radiation (in the leg, 1=yes, 0=no) is pesented after MI in 5 imputed datasets. The overall Wald test and p-value is presented for each imputed dataset (in the row with a df of 2), but is missing for the pooled model. This is also the case for Cox regression models.

There are several procedures to derive a pooled p-value for categorical variables, the pooled sampling variance or D1 method, the multiple parameter Wald test or D2 method, and the Meng and Rubin pooling procedure (Van Buuren (2018); C. K. Enders (2010); Eekhout, Wiel, and Heymans (2017); Meng and Rubin (1992); Mistler (2013); Marshall et al. (2009)). A more simple procedures to derive a pooled p-value for the overall Wald test is just taking the median of the p-values from the overall Wald test of all imputed datasets, this rule is called the Median P Rule (MPR) (Eekhout, Wiel, and Heymans (2017)). These methods can only be performed in R.

## 6.2 Logistic regression with a categorical variable in R

Pooling of categorical variables can be done by using the psfmi package that is downloadable via Github. The package needs R version 3.4.4 or higher. The package contains a function called psfmi_lr for pooling of logistic regression models and psfmi_coxr, to pool right censored Cox regression models. Install the package by first running:

install.packages("devtools")

library(devtools)

devtools::install_github("mwheymans/psfmi")

library(psfmi)

Than run the following code to pool the logistic regression model with as independent variable the categorical Tampa scale variable and as outcome the Radiation variable. To derive the pooled p-value for the overall Wald test, the D1 method is used. Other settings that can be used for method=, are “D2” for the D2 pooling method, “MR” for the Meng and Rubin method and “MPR” for the Median P Rule.

library(foreign)
library(psfmi)
data <- read.spss(file="Backpain 150 Missing_Tampa_Cat Imp.sav", to.data.frame = TRUE) 
## re-encoding from UTF-8
psfmi_lr(data=data, nimp=5, impvar="Imputation_", Outcome="Radiation",
cat.predictors=c("Tampa_Cat"), method="D1")
##
##  Step 1
##
##  Variables included in model = factor(Tampa_Cat)
##
##  Pooled model (Rubin's Rules)
##                        est std.err signif   lower   upper     OR   L.OR
## (Intercept)        -0.7843  0.3330 0.0186 -1.4374 -0.1313 0.4564 0.2375
## factor(Tampa_Cat)2  0.3526  0.4319 0.4143 -0.4940  1.1993 1.4228 0.6102
## factor(Tampa_Cat)3  0.6304  0.4573 0.1691 -0.2698  1.5307 1.8784 0.7635
##                      U.OR
## (Intercept)        0.8770
## factor(Tampa_Cat)2 3.3178
## factor(Tampa_Cat)3 4.6214
##
##  Mixed Pooled p-values (D1 & RR)
##                   Chi.sq Pooled.P.value
## factor(Tampa_Cat)  1.002          0.368
##
##  Final results of Pooled model
##         with a p-value of 1

Pooling a multivariable regression model that contains a mix of continuous, dichotomous and categorical variables can easily be performed with the psfmi_lr function. With the next code example, pooling is done in 10 multiple imputed datasets that are stored in the file lbpmilr (see ?lbpmilr for more information about the dataset) . The relationship with the outcome variable “Chronic” is estimated, using a model with 2 dichotomous, 4 continuous and 2 categorical variables. The method to obtain a pooled p-value for the categorical variables is the Meng and Rubin procedure, indicated by method="MR". To obtain a pooled p-value for continuous and dichotomous variables Rubin’s Rules are used.

library(foreign)
library(psfmi)

psfmi_lr(data=lbpmilr, nimp=10, impvar="Impnr", Outcome="Chronic",
predictors=c("Gender", "Smoking", "Function", "JobControl", "JobDemands",
"SocialSupport"), cat.predictors = c("Carrying", "Satisfaction"),
method="MR", print.method = FALSE)
##
##  Step 1
##
##  Variables included in model = Gender Smoking Function JobControl JobDemands SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Mixed Pooled p-values (RR / MR)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR
## (Intercept)           -1.0450  2.5726 0.6846 -6.0875 3.9975 0.3517
## Gender                -0.4100  0.4635 0.3765 -1.3190 0.4989 0.6636
## Smoking                0.0334  0.3756 0.9291 -0.7028 0.7696 1.0340
## Function              -0.0828  0.0491 0.0915 -0.1790 0.0134 0.9205
## JobControl            -0.0050  0.0204 0.8050 -0.0450 0.0349 0.9950
## JobDemands             0.0057  0.0390 0.8838 -0.0707 0.0821 1.0057
## SocialSupport          0.0549  0.0585 0.3479 -0.0598 0.1697 1.0565
## factor(Carrying)2      1.3114  0.5276 0.0130  0.2765 2.3463 3.7115
## factor(Carrying)3      2.2395  0.5942 0.0002  1.0719 3.4070 9.3882
## factor(Satisfaction)2 -0.5464  0.4887 0.2642 -1.5073 0.4145 0.5790
## factor(Satisfaction)3 -1.1034  0.5860 0.0600 -2.2532 0.0465 0.3317
##                       lower.EXP upper.EXP
## (Intercept)              0.0023   54.4592
## Gender                   0.2674    1.6470
## Smoking                  0.4952    2.1590
## Function                 0.8361    1.0135
## JobControl               0.9560    1.0356
## JobDemands               0.9318    1.0855
## SocialSupport            0.9420    1.1849
## factor(Carrying)2        1.3186   10.4471
## factor(Carrying)3        2.9208   30.1761
## factor(Satisfaction)2    0.2215    1.5136
## factor(Satisfaction)3    0.1051    1.0476
##
##                      Mixed p-values (MR & RR)
## Gender                                0.37650
## Smoking                               0.92910
## Function                              0.09150
## JobControl                            0.80500
## JobDemands                            0.88380
## SocialSupport                         0.34790
## factor(Carrying)                      0.00067
## factor(Satisfaction)                  0.17475
##
##  Final results of Pooled model
##         with a p-value of 1

In the final table that is called “Mixed p-values (MR & RR)”, the pooled p-values can be found for all variables, including the categorical variables that are pooled by the Meng and Rubin procedure. For the dichotomous and continuous variables the pooled p-value estimated by Rubin’s Rules is reported. These are equal to the p-values in the first table under “Pooled model (Rubin’s Rules)”. You can set print.method = TRUE, to obtain the table with the pooled p-values by the Meng and Rubin method for all variables.

## 6.3 Cox Regression with a categorical variable in R

In the psfmi package, a function that is called psfmi_coxris also available to obtain overall pooled p-values for categorical variables in Cox regression models, because this is also not possible in SPSS. All pooling methods can be applied, except for the Meng and Rubin procedure. The Meng and Rubin procedure is not recommended to use for Cox regression models (Marshall et al. (2009)).

To pool a Cox regression model with a mix of continuous and dichotomous variables and 1 categorical variable over 10 impted datasets (see ?psfmi::lbpmicox for more information on the dataset) and pooling method “D1” use:

library(foreign)
library(psfmi)

psfmi_coxr(data=lbpmicox, nimp=10, impvar="Impnr", time="Time", status="Status",
"Previous", "Tampascale", "JobControl", "JobDemand", "Social"),
cat.predictors=c("Expect_cat"), method="D1", print.method = FALSE)
##
##  Step 1
##
##  Variables included in model = Duration Radiation Onset Function Age Previous Tampascale JobControl JobDemand Social factor(Expect_cat)
##
##  Pooled model (Rubin's Rules)
##                         est std.err signif   lower   upper     HR   L.HR
## Duration            -0.0078  0.0040 0.0513 -0.0156  0.0000 0.9922 0.9845
## Radiation           -0.0748  0.1534 0.6258 -0.3754  0.2258 0.9279 0.6870
## Onset               -0.0937  0.1759 0.5944 -0.4385  0.2511 0.9106 0.6450
## Function             0.0439  0.0169 0.0092  0.0109  0.0769 1.0449 1.0110
## Age                 -0.0089  0.0077 0.2485 -0.0240  0.0062 0.9911 0.9763
## Previous            -0.0995  0.1992 0.6176 -0.4899  0.2910 0.9053 0.6127
## Tampascale          -0.0235  0.0138 0.0893 -0.0506  0.0036 0.9768 0.9507
## JobControl          -0.0083  0.0083 0.3168 -0.0247  0.0080 0.9917 0.9756
## JobDemand           -0.0218  0.0155 0.1597 -0.0522  0.0086 0.9784 0.9491
## Social              -0.0515  0.0249 0.0387 -0.1002 -0.0027 0.9498 0.9047
## factor(Expect_cat)2  0.2435  0.2308 0.2914 -0.2088  0.6958 1.2757 0.8116
## factor(Expect_cat)3  0.2279  0.2002 0.2549 -0.1644  0.6203 1.2560 0.8484
##                       U.HR
## Duration            1.0000
## Onset               1.2854
## Function            1.0799
## Age                 1.0062
## Previous            1.3378
## Tampascale          1.0036
## JobControl          1.0080
## JobDemand           1.0086
## Social              0.9973
## factor(Expect_cat)2 2.0053
## factor(Expect_cat)3 1.8595
##
##  Pooled p-values (D1 & RR)
##                    Chi.sq Pooled.P.value
## Duration            3.797         0.0513
## Onset               0.283         0.5944
## Function            6.781         0.0092
## Age                 1.332         0.2485
## Previous            0.249         0.6176
## Tampascale          2.888         0.0893
## JobControl          1.002         0.3168
## JobDemand           1.977         0.1597
## Social              4.276         0.0387
## factor(Expect_cat)  0.728         0.4829
##
##  Final results of Pooled model
##         with a p-value of 1

The pooled p-values are shown in the table that is called “Pooled p-values (D1 & RR)”. For the categorical variable factor(Expect_cat), the p-value is pooled using method “D1” and for all other variables Rubin’s ules are used. These p-values are the same as the p-values in the “Pooled model (Rubin’s Rules)” table above in the output. You can set print.method = TRUE to obtain the pooled p-values by method “D1” for all variables.

For more information about pooling possibilities of logistic and Cox regression models with the psfmi package see ?psfmi::psfmi_lr or ?psfmi::psfmi_coxr.

## 6.4 Variable selection

Prediction models are frequently developed by using selection procedures in logistic and Cox regression models. As a selection procedure, backward selection is generally recommended (Moons et al. (2015)). However, variable selection in multiple imputed datasets may be challenging. When you apply variable selection in each imputed dataset, different variables may be selected between imputed datasets. The question is than what is the best procedure to merge these different selected models into one final model. Wood (Wood, White, and Royston (2008)) showed that the best choice to select variables from multiply imputed datasets is to start your selection procedure in the pooled model. It is possible to do this in SPSS when the model includes continuous and dichotomous variables. This is however not possible in SPSS when the model contains categorical variables. Because categorical variables are estimated and selected by using an overall Wald Chi-square value and as we have seen in the previous paragraphs that that is not possible in SPSS. For that reason we have developed the R psfmi package, that contains functions to pool and select variables in multiple imputed datasets.

### 6.4.1 Variable Selection with Logistic Regression models in R

For variable selection we use the psfmi package that was introduced in paragraph 6.2 (see that paragraph of how to install the package). With this package variable selection can be applied by using different methods to obtain pooled p-values and to select on basis of these pooled p-values. These methods were introduced in paragraph 6.1 and 6.2 and are called the “D1”, “D2”, “MR” (Meng and Rubin) and “MPR” (Median P Rule). Variable selection of the multivariable model that was pooled in paragraph 6.2 can be done by using the following code, where method “D1” is used and a p-value of 0.05 as selection criterium.

library(foreign)
library(psfmi)

psfmi_lr(data=lbpmilr, nimp=10, impvar="Impnr", Outcome="Chronic",
predictors=c("Gender", "Smoking", "Function", "JobControl", "JobDemands",
"SocialSupport"), cat.predictors = c("Carrying", "Satisfaction"),
p.crit = 0.05, method="D1", print.method = FALSE)
##
##  Step 1
##
##  Variables included in model = Gender Smoking Function JobControl JobDemands SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -1.0450  2.5726 0.6846 -6.0875 3.9975 0.3517 0.0023
## Gender                -0.4100  0.4635 0.3765 -1.3190 0.4989 0.6636 0.2674
## Smoking                0.0334  0.3756 0.9291 -0.7028 0.7696 1.0340 0.4952
## Function              -0.0828  0.0491 0.0915 -0.1790 0.0134 0.9205 0.8361
## JobControl            -0.0050  0.0204 0.8050 -0.0450 0.0349 0.9950 0.9560
## JobDemands             0.0057  0.0390 0.8838 -0.0707 0.0821 1.0057 0.9318
## SocialSupport          0.0549  0.0585 0.3479 -0.0598 0.1697 1.0565 0.9420
## factor(Carrying)2      1.3114  0.5276 0.0130  0.2765 2.3463 3.7115 1.3186
## factor(Carrying)3      2.2395  0.5942 0.0002  1.0719 3.4070 9.3882 2.9208
## factor(Satisfaction)2 -0.5464  0.4887 0.2642 -1.5073 0.4145 0.5790 0.2215
## factor(Satisfaction)3 -1.1034  0.5860 0.0600 -2.2532 0.0465 0.3317 0.1051
##                          U.OR
## (Intercept)           54.4592
## Gender                 1.6470
## Smoking                2.1590
## Function               1.0135
## JobControl             1.0356
## JobDemands             1.0855
## SocialSupport          1.1849
## factor(Carrying)2     10.4471
## factor(Carrying)3     30.1761
## factor(Satisfaction)2  1.5136
## factor(Satisfaction)3  1.0476
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Gender                0.782         0.3765
## Smoking               0.008         0.9291
## Function              2.850         0.0915
## JobControl            0.061         0.8050
## JobDemands            0.021         0.8838
## SocialSupport         0.881         0.3479
## factor(Carrying)      7.342         0.0007
## factor(Satisfaction)  1.712         0.1812
##
##  Variable excluded at Step 1 is  Smoking
##
##  Step 2
##
##  Variables included in model = Gender Function JobControl JobDemands SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -1.0362  2.5706 0.6869 -6.0748 4.0024 0.3548 0.0023
## Gender                -0.4081  0.4630 0.3782 -1.3160 0.4999 0.6649 0.2682
## Function              -0.0825  0.0488 0.0913 -0.1782 0.0133 0.9209 0.8368
## JobControl            -0.0050  0.0204 0.8067 -0.0450 0.0350 0.9950 0.9560
## JobDemands             0.0058  0.0389 0.8825 -0.0706 0.0821 1.0058 0.9319
## SocialSupport          0.0548  0.0585 0.3488 -0.0599 0.1695 1.0563 0.9419
## factor(Carrying)2      1.3163  0.5236 0.0120  0.2892 2.3434 3.7296 1.3354
## factor(Carrying)3      2.2397  0.5943 0.0002  1.0719 3.4075 9.3909 2.9211
## factor(Satisfaction)2 -0.5493  0.4870 0.2601 -1.5069 0.4084 0.5774 0.2216
## factor(Satisfaction)3 -1.1074  0.5836 0.0580 -2.2526 0.0378 0.3304 0.1051
##                          U.OR
## (Intercept)           54.7274
## Gender                 1.6485
## Function               1.0133
## JobControl             1.0356
## JobDemands             1.0856
## SocialSupport          1.1847
## factor(Carrying)2     10.4166
## factor(Carrying)3     30.1905
## factor(Satisfaction)2  1.5044
## factor(Satisfaction)3  1.0385
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Gender                0.777         0.3782
## Function              2.853         0.0913
## JobControl            0.060         0.8067
## JobDemands            0.022         0.8825
## SocialSupport         0.878         0.3488
## factor(Carrying)      7.347         0.0007
## factor(Satisfaction)  1.741         0.1761
##
##  Variable excluded at Step 2 is  JobDemands
##
##  Step 3
##
##  Variables included in model = Gender Function JobControl SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -0.8219  2.1206 0.6983 -4.9782 3.3343 0.4396 0.0069
## Gender                -0.4215  0.4521 0.3512 -1.3079 0.4649 0.6561 0.2704
## Function              -0.0829  0.0487 0.0892 -0.1784 0.0127 0.9205 0.8366
## JobControl            -0.0051  0.0204 0.8020 -0.0450 0.0348 0.9949 0.9560
## SocialSupport          0.0550  0.0585 0.3470 -0.0597 0.1697 1.0566 0.9421
## factor(Carrying)2      1.3086  0.5207 0.0121  0.2873 2.3300 3.7011 1.3328
## factor(Carrying)3      2.2305  0.5906 0.0002  1.0702 3.3908 9.3046 2.9159
## factor(Satisfaction)2 -0.5450  0.4851 0.2619 -1.4988 0.4088 0.5798 0.2234
## factor(Satisfaction)3 -1.1056  0.5828 0.0581 -2.2491 0.0380 0.3310 0.1055
##                          U.OR
## (Intercept)           28.0598
## Gender                 1.5918
## Function               1.0128
## JobControl             1.0354
## SocialSupport          1.1849
## factor(Carrying)2     10.2777
## factor(Carrying)3     29.6911
## factor(Satisfaction)2  1.5050
## factor(Satisfaction)3  1.0388
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Gender                0.869         0.3512
## Function              2.890         0.0892
## JobControl            0.063         0.8020
## SocialSupport         0.884         0.3470
## factor(Carrying)      7.374         0.0007
## factor(Satisfaction)  1.740         0.1762
##
##  Variable excluded at Step 3 is  JobControl
##
##  Step 4
##
##  Variables included in model = Gender Function SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -1.1578  1.6622 0.4861 -4.4160 2.1004 0.3142 0.0121
## Gender                -0.4009  0.4460 0.3688 -1.2754 0.4736 0.6697 0.2793
## Function              -0.0850  0.0479 0.0764 -0.1790 0.0090 0.9185 0.8361
## SocialSupport          0.0572  0.0579 0.3231 -0.0563 0.1707 1.0589 0.9453
## factor(Carrying)2      1.3072  0.5207 0.0122  0.2858 2.3286 3.6958 1.3308
## factor(Carrying)3      2.2199  0.5866 0.0002  1.0676 3.3722 9.2067 2.9084
## factor(Satisfaction)2 -0.5451  0.4842 0.2610 -1.4970 0.4069 0.5798 0.2238
## factor(Satisfaction)3 -1.0920  0.5788 0.0595 -2.2276 0.0437 0.3356 0.1078
##                          U.OR
## (Intercept)            8.1698
## Gender                 1.6057
## Function               1.0091
## SocialSupport          1.1862
## factor(Carrying)2     10.2631
## factor(Carrying)3     29.1439
## factor(Satisfaction)2  1.5022
## factor(Satisfaction)3  1.0447
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Gender                0.808         0.3688
## Function              3.141         0.0764
## SocialSupport         0.976         0.3231
## factor(Carrying)      7.377         0.0007
## factor(Satisfaction)  1.722         0.1794
##
##  Variable excluded at Step 4 is  Gender
##
##  Step 5
##
##  Variables included in model = Function SocialSupport factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -1.5406  1.5983 0.3351 -4.6734 1.5922 0.2142 0.0093
## Function              -0.0890  0.0476 0.0615 -0.1824 0.0043 0.9148 0.8333
## SocialSupport          0.0600  0.0573 0.2951 -0.0523 0.1722 1.0618 0.9491
## factor(Carrying)2      1.3827  0.5114 0.0069  0.3798 2.3856 3.9857 1.4620
## factor(Carrying)3      2.2234  0.5864 0.0002  1.0714 3.3754 9.2386 2.9194
## factor(Satisfaction)2 -0.5132  0.4798 0.2853 -1.4563 0.4298 0.5985 0.2331
## factor(Satisfaction)3 -1.0439  0.5717 0.0681 -2.1653 0.0776 0.3521 0.1147
##                          U.OR
## (Intercept)            4.9145
## Function               1.0043
## SocialSupport          1.1879
## factor(Carrying)2     10.8657
## factor(Carrying)3     29.2358
## factor(Satisfaction)2  1.5370
## factor(Satisfaction)3  1.0807
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Function              3.496         0.0615
## SocialSupport         1.096         0.2951
## factor(Carrying)      7.467         0.0006
## factor(Satisfaction)  1.609         0.2007
##
##  Variable excluded at Step 5 is  SocialSupport
##
##  Step 6
##
##  Variables included in model = Function factor(Carrying) factor(Satisfaction)
##
##  Pooled model (Rubin's Rules)
##                           est std.err signif   lower  upper     OR   L.OR
## (Intercept)           -0.1003  0.8048 0.9008 -1.6781 1.4774 0.9046 0.1867
## Function              -0.0892  0.0476 0.0611 -0.1826 0.0042 0.9146 0.8331
## factor(Carrying)2      1.3455  0.5064 0.0079  0.3524 2.3386 3.8401 1.4225
## factor(Carrying)3      2.1530  0.5791 0.0002  1.0154 3.2905 8.6105 2.7606
## factor(Satisfaction)2 -0.5138  0.4796 0.2848 -1.4567 0.4292 0.5982 0.2330
## factor(Satisfaction)3 -1.0278  0.5725 0.0729 -2.1512 0.0956 0.3578 0.1163
##                          U.OR
## (Intercept)            4.3817
## Function               1.0042
## factor(Carrying)2     10.3668
## factor(Carrying)3     26.8568
## factor(Satisfaction)2  1.5361
## factor(Satisfaction)3  1.1003
##
##  Mixed Pooled p-values (D1 & RR)
##                      Chi.sq Pooled.P.value
## Function              3.508         0.0612
## factor(Carrying)      7.189         0.0008
## factor(Satisfaction)  1.558         0.2112
##
##  Variable excluded at Step 6 is  factor(Satisfaction)
##
##  Step 7
##
##  Variables included in model = Function factor(Carrying)
##
##  Pooled model (Rubin's Rules)
##                       est std.err signif   lower  upper     OR   L.OR
## (Intercept)       -0.6671  0.7230 0.3562 -2.0849 0.7507 0.5132 0.1243
## Function          -0.0758  0.0464 0.1021 -0.1667 0.0151 0.9270 0.8464
## factor(Carrying)2  1.3317  0.5008 0.0079  0.3495 2.3138 3.7875 1.4184
## factor(Carrying)3  1.9718  0.5548 0.0004  0.8825 3.0612 7.1839 2.4170
##                      U.OR
## (Intercept)        2.1184
## Function           1.0152
## factor(Carrying)2 10.1132
## factor(Carrying)3 21.3527
##
##  Mixed Pooled p-values (D1 & RR)
##                  Chi.sq Pooled.P.value
## Function          2.673         0.1022
## factor(Carrying)  6.542         0.0015
##
##  Variable excluded at Step 7 is  Function
##
##  Step 8
##
##  Variables included in model = factor(Carrying)
##
##  Pooled model (Rubin's Rules)
##                       est std.err signif   lower   upper      OR   L.OR
## (Intercept)       -1.6553  0.4023 0.0000 -2.4442 -0.8663  0.1910 0.0868
## factor(Carrying)2  1.4651  0.4906 0.0029  0.5029  2.4272  4.3278 1.6535
## factor(Carrying)3  2.3173  0.5105 0.0000  1.3152  3.3195 10.1485 3.7255
##                      U.OR
## (Intercept)        0.4205
## factor(Carrying)2 11.3276
## factor(Carrying)3 27.6455
##
##  Mixed Pooled p-values (D1 & RR)
##                  Chi.sq Pooled.P.value
## factor(Carrying) 10.542              0

See ?psfmi::psfmi_lr for other possibilities during variable selection with the psfmi_lrfunction like forcing variables in the model during variable selection or variable selection with interaction terms.

### 6.4.2 Variable Selection with Cox Regression models in R

Variable selection with Cox regression models can be applied with the psfmi package, that was introduced in paragraph 6.2 (see that paragraph of how to install the package). The package contains the function psfmi_coxr that was used in paragraph 6.3 to pool a multivariable Cox regression model. We can apply variable selection in that model by the next code using the “D1” method to obtain pooled p-values for categorical variables and Rubin’s Rules for dichotomous and continuous variables and using a p-value of 0.05 as selection criterium.

library(foreign)
library(psfmi)

psfmi_coxr(data=lbpmicox, nimp=10, impvar="Impnr", time="Time", status="Status",
"Previous", "Tampascale", "JobControl", "JobDemand", "Social"),
cat.predictors=c("Expect_cat"), p.crit = 0.05, method="D1", print.method = FALSE)
##
##  Step 1
##
##  Variables included in model = Duration Radiation Onset Function Age Previous Tampascale JobControl JobDemand Social factor(Expect_cat)
##
##  Pooled model (Rubin's Rules)
##                         est std.err signif   lower   upper     HR   L.HR
## Duration            -0.0078  0.0040 0.0513 -0.0156  0.0000 0.9922 0.9845
## Radiation           -0.0748  0.1534 0.6258 -0.3754  0.2258 0.9279 0.6870
## Onset               -0.0937  0.1759 0.5944 -0.4385  0.2511 0.9106 0.6450
## Function             0.0439  0.0169 0.0092  0.0109  0.0769 1.0449 1.0110
## Age                 -0.0089  0.0077 0.2485 -0.0240  0.0062 0.9911 0.9763
## Previous            -0.0995  0.1992 0.6176 -0.4899  0.2910 0.9053 0.6127
## Tampascale          -0.0235  0.0138 0.0893 -0.0506  0.0036 0.9768 0.9507
## JobControl          -0.0083  0.0083 0.3168 -0.0247  0.0080 0.9917 0.9756
## JobDemand           -0.0218  0.0155 0.1597 -0.0522  0.0086 0.9784 0.9491
## Social              -0.0515  0.0249 0.0387 -0.1002 -0.0027 0.9498 0.9047
## factor(Expect_cat)2  0.2435  0.2308 0.2914 -0.2088  0.6958 1.2757 0.8116
## factor(Expect_cat)3  0.2279  0.2002 0.2549 -0.1644  0.6203 1.2560 0.8484
##                       U.HR
## Duration            1.0000
## Onset               1.2854
## Function            1.0799
## Age                 1.0062
## Previous            1.3378
## Tampascale          1.0036
## JobControl          1.0080
## JobDemand           1.0086
## Social              0.9973
## factor(Expect_cat)2 2.0053
## factor(Expect_cat)3 1.8595
##
##  Pooled p-values (D1 & RR)
##                    Chi.sq Pooled.P.value
## Duration            3.797         0.0513
## Onset               0.283         0.5944
## Function            6.781         0.0092
## Age                 1.332         0.2485
## Previous            0.249         0.6176
## Tampascale          2.888         0.0893
## JobControl          1.002         0.3168
## JobDemand           1.977         0.1597
## Social              4.276         0.0387
## factor(Expect_cat)  0.728         0.4829
##
##  Variable excluded at Step 1 is  Radiation
##
##  Step 2
##
##  Variables included in model = Duration Onset Function Age Previous Tampascale JobControl JobDemand Social factor(Expect_cat)
##
##  Pooled model (Rubin's Rules)
##                         est std.err signif   lower   upper     HR   L.HR
## Duration            -0.0079  0.0040 0.0474 -0.0157 -0.0001 0.9921 0.9844
## Onset               -0.0907  0.1756 0.6057 -0.4349  0.2536 0.9133 0.6473
## Function             0.0456  0.0165 0.0056  0.0133  0.0778 1.0467 1.0134
## Age                 -0.0089  0.0077 0.2495 -0.0240  0.0062 0.9911 0.9763
## Previous            -0.0893  0.1981 0.6520 -0.4777  0.2990 0.9146 0.6202
## Tampascale          -0.0235  0.0138 0.0883 -0.0506  0.0035 0.9768 0.9507
## JobControl          -0.0087  0.0083 0.2947 -0.0248  0.0075 0.9913 0.9755
## JobDemand           -0.0216  0.0155 0.1627 -0.0519  0.0087 0.9786 0.9494
## Social              -0.0509  0.0248 0.0405 -0.0996 -0.0022 0.9504 0.9052
## factor(Expect_cat)2  0.2666  0.2262 0.2385 -0.1767  0.7099 1.3055 0.8380
## factor(Expect_cat)3  0.2440  0.1976 0.2168 -0.1432  0.6313 1.2763 0.8666
##                       U.HR
## Duration            0.9999
## Onset               1.2887
## Function            1.0809
## Age                 1.0062
## Previous            1.3485
## Tampascale          1.0035
## JobControl          1.0075
## JobDemand           1.0087
## Social              0.9978
## factor(Expect_cat)2 2.0338
## factor(Expect_cat)3 1.8801
##
##  Pooled p-values (D1 & RR)
##                    Chi.sq Pooled.P.value
## Duration            3.930         0.0474
## Onset               0.267         0.6057
## Function            7.672         0.0056
## Age                 1.326         0.2495
## Previous            0.203         0.6520
## Tampascale          2.905         0.0883
## JobControl          1.098         0.2947
## JobDemand           1.949         0.1627
## Social              4.197         0.0405
## factor(Expect_cat)  0.883         0.4135
##
##  Variable excluded at Step 2 is  Previous
##
##  Step 3
##
##  Variables included in model = Duration Onset Function Age Tampascale JobControl JobDemand Social factor(Expect_cat)
##
##  Pooled model (Rubin's Rules)
##                         est std.err signif   lower   upper     HR   L.HR
## Duration            -0.0077  0.0040 0.0524 -0.0154  0.0001 0.9923 0.9847
## Onset               -0.0859  0.1753 0.6239 -0.4294  0.2576 0.9177 0.6509
## Function             0.0454  0.0164 0.0057  0.0132  0.0776 1.0464 1.0133
## Age                 -0.0093  0.0077 0.2277 -0.0243  0.0058 0.9907 0.9760
## Tampascale          -0.0231  0.0138 0.0933 -0.0500  0.0039 0.9772 0.9512
## JobControl          -0.0091  0.0082 0.2685 -0.0252  0.0070 0.9909 0.9751
## JobDemand           -0.0221  0.0154 0.1530 -0.0523  0.0082 0.9781 0.9490
## Social              -0.0516  0.0248 0.0373 -0.1002 -0.0030 0.9497 0.9047
## factor(Expect_cat)2  0.2640  0.2258 0.2424 -0.1786  0.7066 1.3021 0.8364
## factor(Expect_cat)3  0.2530  0.1964 0.1979 -0.1321  0.6380 1.2879 0.8763
##                       U.HR
## Duration            1.0001
## Onset               1.2938
## Function            1.0807
## Age                 1.0058
## Tampascale          1.0039
## JobControl          1.0070
## JobDemand           1.0082
## Social              0.9970
## factor(Expect_cat)2 2.0271
## factor(Expect_cat)3 1.8927
##
##  Pooled p-values (D1 & RR)
##                    Chi.sq Pooled.P.value
## Duration            3.762         0.0524
## Onset               0.240         0.6239
## Function            7.654         0.0057
## Age                 1.455         0.2277
## Tampascale          2.818         0.0933
## JobControl          1.225         0.2685
## JobDemand           2.042         0.1530
## Social              4.336         0.0373
## factor(Expect_cat)  0.919         0.3989
##
##  Variable excluded at Step 3 is  Onset
##
##  Step 4
##
##  Variables included in model = Duration Function Age Tampascale JobControl JobDemand Social factor(Expect_cat)
##
##  Pooled model (Rubin's Rules)
##                         est std.err signif   lower   upper     HR   L.HR
## Duration            -0.0078  0.0039 0.0488 -0.0155  0.0000 0.9922 0.9846
## Function             0.0453  0.0164 0.0058  0.0131  0.0776 1.0463 1.0132
## Age                 -0.0090  0.0077 0.2406 -0.0241  0.0060 0.9910 0.9762
## Tampascale          -0.0233  0.0137 0.0892 -0.0502  0.0036 0.9770 0.9510
## JobControl          -0.0088  0.0082 0.2825 -0.0248  0.0072 0.9912 0.9755
## JobDemand           -0.0220  0.0154 0.1544 -0.0522  0.0083 0.9782 0.9491
## Social              -0.0525  0.0247 0.0334 -0.1008 -0.0041 0.9489 0.9041
## factor(Expect_cat)2  0.2655  0.2258 0.2396 -0.1771  0.7081 1.3041 0.8377
## factor(Expect_cat)3  0.2523  0.1963 0.1987 -0.1324  0.6370 1.2870 0.8760
##                       U.HR
## Duration            1.0000
## Function            1.0807
## Age                 1.0060
## Tampascale          1.0036
## JobControl          1.0072
## JobDemand           1.0083
## Social              0.9959
## factor(Expect_cat)2 2.0301
## factor(Expect_cat)3 1.8908
##
##  Pooled p-values (D1 & RR)
##                    Chi.sq Pooled.P.value
## Duration            3.884         0.0488
## Function            7.609         0.0058
## Age                 1.377         0.2406
## Tampascale          2.889         0.0892
## JobControl          1.155         0.2825
## JobDemand           2.028         0.1544
## Social              4.526         0.0334
## factor(Expect_cat)  0.921         0.3981
##
##  Variable excluded at Step 4 is  factor(Expect_cat)
##
##  Step 5
##
##  Variables included in model = Duration Function Age Tampascale JobControl JobDemand Social
##
##  Pooled model (Rubin's Rules)
##                est std.err signif   lower   upper     HR   L.HR   U.HR
## Duration   -0.0075  0.0039 0.0553 -0.0151  0.0002 0.9925 0.9850 1.0002
## Function    0.0476  0.0165 0.0039  0.0152  0.0799 1.0488 1.0153 1.0832
## Age        -0.0081  0.0076 0.2811 -0.0229  0.0067 0.9919 0.9774 1.0067
## Tampascale -0.0211  0.0136 0.1200 -0.0476  0.0055 0.9791 0.9535 1.0055
## JobControl -0.0083  0.0082 0.3074 -0.0243  0.0077 0.9917 0.9760 1.0077
## JobDemand  -0.0209  0.0154 0.1744 -0.0511  0.0093 0.9793 0.9502 1.0093
## Social     -0.0525  0.0244 0.0311 -0.1003 -0.0048 0.9489 0.9046 0.9952
##
##  Pooled p-values (D1 & RR)
##            Chi.sq Pooled.P.value
## Duration    3.672         0.0553
## Function    8.315         0.0039
## Age         1.162         0.2811
## Tampascale  2.418         0.1200
## JobControl  1.042         0.3074
## JobDemand   1.845         0.1744
## Social      4.648         0.0311
##
##  Variable excluded at Step 5 is  JobControl
##
##  Step 6
##
##  Variables included in model = Duration Function Age Tampascale JobDemand Social
##
##  Pooled model (Rubin's Rules)
##                est std.err signif   lower   upper     HR   L.HR   U.HR
## Duration   -0.0071  0.0039 0.0688 -0.0147  0.0005 0.9929 0.9854 1.0005
## Function    0.0475  0.0165 0.0040  0.0151  0.0799 1.0486 1.0152 1.0832
## Age        -0.0080  0.0076 0.2931 -0.0228  0.0069 0.9920 0.9775 1.0069
## Tampascale -0.0211  0.0136 0.1222 -0.0477  0.0056 0.9791 0.9534 1.0056
## JobDemand  -0.0192  0.0154 0.2112 -0.0493  0.0109 0.9810 0.9519 1.0110
## Social     -0.0609  0.0229 0.0078 -0.1057 -0.0161 0.9409 0.8997 0.9840
##
##  Pooled p-values (D1 & RR)
##            Chi.sq Pooled.P.value
## Duration    3.312         0.0688
## Function    8.262         0.0040
## Age         1.105         0.2932
## Tampascale  2.389         0.1223
## JobDemand   1.563         0.2112
## Social      7.089         0.0078
##
##  Variable excluded at Step 6 is  Age
##
##  Step 7
##
##  Variables included in model = Duration Function Tampascale JobDemand Social
##
##  Pooled model (Rubin's Rules)
##                est std.err signif   lower   upper     HR   L.HR   U.HR
## Duration   -0.0079  0.0038 0.0385 -0.0153 -0.0004 0.9921 0.9848 0.9996
## Function    0.0481  0.0166 0.0038  0.0155  0.0806 1.0493 1.0156 1.0839
## Tampascale -0.0185  0.0133 0.1647 -0.0446  0.0076 0.9817 0.9564 1.0076
## JobDemand  -0.0177  0.0153 0.2450 -0.0477  0.0122 0.9825 0.9534 1.0123
## Social     -0.0603  0.0229 0.0085 -0.1052 -0.0154 0.9415 0.9001 0.9847
##
##  Pooled p-values (D1 & RR)
##            Chi.sq Pooled.P.value
## Duration    4.285         0.0385
## Function    8.377         0.0038
## Tampascale  1.930         0.1648
## JobDemand   1.352         0.2449
## Social      6.929         0.0085
##
##  Variable excluded at Step 7 is  JobDemand
##
##  Step 8
##
##  Variables included in model = Duration Function Tampascale Social
##
##  Pooled model (Rubin's Rules)
##                est std.err signif   lower   upper     HR   L.HR   U.HR
## Duration   -0.0081  0.0038 0.0325 -0.0155 -0.0007 0.9919 0.9846 0.9993
## Function    0.0484  0.0164 0.0032  0.0162  0.0805 1.0496 1.0163 1.0838
## Tampascale -0.0211  0.0132 0.1095 -0.0470  0.0047 0.9791 0.9541 1.0047
## Social     -0.0565  0.0227 0.0128 -0.1010 -0.0120 0.9451 0.9039 0.9881
##
##  Pooled p-values (D1 & RR)
##            Chi.sq Pooled.P.value
## Duration    4.574         0.0325
## Function    8.707         0.0032
## Tampascale  2.562         0.1095
## Social      6.197         0.0128
##
##  Variable excluded at Step 8 is  Tampascale
##
##  Step 9
##
##  Variables included in model = Duration Function Social
##
##  Pooled model (Rubin's Rules)
##              est std.err signif   lower   upper     HR   L.HR   U.HR
## Duration -0.0076  0.0037 0.0421 -0.0149 -0.0003 0.9924 0.9852 0.9997
## Function  0.0565  0.0154 0.0002  0.0263  0.0867 1.0581 1.0266 1.0906
## Social   -0.0520  0.0226 0.0215 -0.0963 -0.0077 0.9493 0.9082 0.9923
##
##  Pooled p-values (D1 & RR)
##          Chi.sq Pooled.P.value
## Duration  4.129         0.0422
## Function 13.430         0.0002
## Social    5.287         0.0215

## 6.5 Interaction terms in model

When the analysis model contains an interaction term, and one of the variables that is part of the interaction term contains missing data, it is important that the imputation procedure takes into account that different effects exist with the outcome for different categories of the effect modifier that is part of the interaction term. Otherwise the imputation model may not be consistent with the analysis model and this results in incorrect imputed values and biased coefficients and standard error estimates (Bartlett et al. (2015)).

For example, we may be interested in the relationship between (body)weight and blood pressure, with gender as effect modifier. The main aim of this model is to study the relationship between (body)weight and blood pressure (SBP) and whether this relationship depends on gender, i.e. is stronger or less strong for males compared to females. The analysis model is:

$SBP = β{_0} + β{_1} * BodyWeight + β{_2} * Gender + β{_3} * BodyWeight * Gender$

Assume now that there is missing data in the bodyweight variable. As a result, there will also be missing data in the interaction term between bodyweight and Gender.

When the imputation model consists of the following formula:

$BodyWeight{_{mis}} = β{_0} + β{_1} * SBP + β{_2} * Gender$ The imputation model is not consistent with the analysis model and would therefore not be able to generate valid imputations. We have to use an imputation procedure that takes into account that the relationship between Bodyweight and Blood pressure depends on Gender. A solution is to impute the relationship between blood pressure and bodyweight separately for males and females. That way, you account for the possibility that the relationship between bodyweight and blood pressure differs between males and females and the imputed values can then differ too.

Imputation of interaction terms in SPSS

To impute the missing data in the Bodyweight variable to examine the relationship between bodyweight and blood pressure, in a model including the interaction between bodyweight and gender are explained in the next steps.

Step 1 Split the dataset by Gender.

Data -> Split File -> Compare Groups

Move the Gender variable to the window: Group based on, then click OK.

Step 2 Perform Multiple Imputation. Use all variables in the imputation model, except the Gender variable. The Gender variable was used as a split variable and cannot be included in the imputation model. MI is now separately performed for Males and Females.

Step 3 Subsequently, turn on the split on the variable Imputation_ in the dataset with the imputed values. This will automatically turn off the split on Gender.

Step 4 Compute the Interaction term between Bodyweight and Gender via:

Transform -> Compute Variable

Step 5 Fit the regression model to Obtain Pooled results for the main analysis model.

For logistic or Cox regression models the same steps can be followed. For these models at step 4 it is not necessary to compute an interaction term. The inclusion of interaction terms in the model can be activated by first selecting 1 variable that is part of the interaction term, than pressing the Ctrl key and selecting the other variable at the same time. The >a*b> key will than be lighted and after you click on that button the interaction term is included in the model.

### 6.5.1 Imputation of interaction terms in R

The split file procedure that we used in SPSS in the previous paragraph to impute missing data and to take interaction terms into account can also be applied in R. First you have to split the data, impute the missing values, merge the data again and generate the interaction term in the dataset . Then results of the analysis model can be pooled. There is also another procedure in R that can be used. This procedure is called Substantive Model Compatible Fully Conditional Specification (SMC-FCS). This is a fairly complex procedure that can generate valid imputations by taking into account interaction terms. More about this procedure can be found in the technical paper of Bartlett et al.( 2015). To apply the procedure we need the smcfcs function which is available in the smcfs package (Bartlett, 2015). To get pooled analysis results we also have to install the mitoolspackage.

We are currently working on this Chapter.

### References

Van Buuren, S. 2018. Flexible Imputation of Missing Data. Second Edition. Boca Raton, FL: Chapman & Hall/CRC.

Enders, Craig K. 2010. Applied Missing Data Analysis. Guilford Press.

Eekhout, I., M.A. van de Wiel, and M. W. Heymans. 2017. “Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis.” BMC Med Res Methodol 17 (1): 129.

Meng, X. L., and D. B. Rubin. 1992. “Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets.” Biometrika 79 (1): 103–11.

Mistler, S.A. 2013. “A Sas Macro for Computing Pooled Likelihood Ratio Tests with Multiply Imputed Data.” Proceedings of the SAS Global Forum 2013, San Francisco, California: Contributed Paper (Statistics and Data Analysis), no. 438.

Marshall, Andrea, Douglas G Altman, Roger L Holder, and Patrick Royston. 2009. “Combining Estimates of Interest in Prognostic Modelling Studies After Multiple Imputation: Current Practice and Guidelines.” BMC Medical Research Methodology 9: 57.

Moons, K. G., D. G. Altman, J. B. Reitsma, J. P. Ioannidis, P. Macaskill, E. W. Steyerberg, A. J. Vickers, D. F. Ransohoff, and G. S. Collins. 2015. “Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.” Ann. Intern. Med. 162 (1): 1–73.

Wood, A. M., I. R. White, and P. Royston. 2008. “How Should Variable Selection Be Performed with Multiply Imputed Data?” Statistics in Medicine 27 (17): 3227–46.

Bartlett, J. W., S. R. Seaman, I. R. White, and J. R. Carpenter. 2015. “Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model.” Stat Methods Med Res 24 (4): 462–87.