# Chapter 4 Multiple Imputation

In this Chapter we discuss an advanced missing data handling method, Multiple Imputation (MI). With MI, each missing value is replaced by several different values and consequently several different completed datasets are generated. The concept of MI can be made clear by the following figure 4.1.

In the first step, the dataset with missing values (i.e. the incomplete dataset) is copied several times. Then in the next step, the missing values are replaced with imputed values in each copy of the dataset. In each copy, slightly different values are imputed due to random variation. This results in mulitple imputed datasets. In the third step, the imputed datasets are each analyzed and the study results are then pooled into the final study result. In this Chapter, the first phase in multiple imputation, the imputation step, is the main topic. In the next Chapter, the analysis and pooling phases are discussed.

## 4.1 Multivariate Imputation by Chained Equations

Multivariate imputation by chained equations (MICE) (Van Buuren 2018) is also known as Sequential Regression Imputation, Fully Conditional Specification or Gibbs sampling. In the MICE algorithm, a chain of regression equations is used to obtain imputations, which means that variables with missing data are imputed one by one. The regression models use information from all other variables in the model, i.e. conditional imputation models. In order to add sampling variability to the imputations, residual error is added to create the imputed values. This residual error can either be added to the prediced values directly, which is esentially similar to repeating stochastic regression imputation over several imputation runs. Or, the residual variance can be added via the parameter estimates of the regression model, which is a Bayesian sampling method. The Bayesian method is the default in the mice package in R. The MICE procedure became available in SPSS when version 17 was released.

## 4.2 Multiple imputation in SPSS

The MI procedure in SPSS is based on the MICE algorithm that was developed in R. It is therefore no surprise that most options of the mice function in R are also available in SPSS. Note that before you start the MI procedure it is important to set the measurement level of the variables with missing data in the Variable View window of your data. They are important for the regression model that is used to estimate the missing values in that variable. For example, if you define a variable as scale, then linear regression models are used, for categorical variables, logistic regression models are used.

We use as an example a dataset with 50 patient with low back pain. In these patients information was measured about their Pain, Tampa scale, Disability and Radiation. The variables Tampa scale and Disability contain missing values of 26% and 18% respectively.

The multiple imputation procedure is started by navigating to

Analyze -> Multiple Imputation -> Impute Missing Data Values.

Than a window opens that consists of 4 tabs, a Variables, a Method, a Constraints and an Output tab. You have to visit these tabs to specify the imputation settings before you can start the imputation process by clicking the OK button.

The first window is the Variables tab (Figure 4.2). Here you can transport the complete and incomplete variables that you want to include in the imputation model to the window “Variables in Model”. The variables are imputed sequentially in the order in which they are listed in the variables list. These variables are here Pain, Tampa scale, Disability and Radiation. In our example the Tampa scale variable is imputed before the Disability variable because the Tampa scale variable was first listed. Further, the number of imputed datasets can be defined in the “Imputations”" box, we choose 5 here. Than you go to “Create a new Dataset” and choose a name for the dataset to which the imputed data values are saved, which is called “LBP_Imp”. If you are finished you visit the Methods Tab where you can define the imputation method.

In the Method tab (Figure 4.3) you choose the imputation algorithm. We choose for “Custom” under Imputation Method and for Fully conditional specification (FCS). FCS is the Bayesian regression imputation method as explained in Chapter 3. You can also change the maximum number of Iterations which has a default setting of 10. It is recommended to increase that number to 50. Under Model type for scale (continuous) variables we choose for Predictive Mean Matching (PMM) (see paragraph 4.6 for a more detailed explanation of PMM). PMM is the default procedure in the mice package to impute continuous variables. In SPSS the default is the linear regression procedure. It is better to change it into PMM. Now visit the Constraints Tab.

In the Constraints tab (Figure 4.4) the role of variables during the imputation process can be defined. For example, it is possible to restrict the range of imputed values of a scale variable when for scale variables the Linear Regression model was chosen in the Method tab. To obtain the current range of variable values you can click on the “Scan” button. When the PMM method is selected in the Method Tab, the Constraints tab can be skipped. You can also restrict the analysis to variables with less than a maximum percentage of missing values when you select “Exclude variables with large amounts of missing data”. Finally, in the Output Tab the generated output can be selected.

In the Output tab (Figure 4.5) descriptive statistics of variables that are imputed can be exctracted by selecting “Imputation model” and “the Descriptive statistics for variables with imputed values” options. You can also request a dataset that contains iteration history data, which we name “Iter_Backpain”. The dataset contains the means and standard deviations of the imputed scale variables at each iteration. You can use this data to check for irregularities during imputation by making convergence plot that will be discussed in paragraph 4.5.

## 4.3 Random number generator

Before you start the multiple imputation procedure, it is possible to set the starting point of the random number generator in SPSS at a fixed value of 950 (in R we use the seed for this). In this way you are able to reproduce results exactly. later. It is also a good idea to store the multiple imputed datasets.

We set the random number generator in SPSS via

Transform -> Random Number Generators -> Set Starting point -> Fixed Value

## 4.4 The output for Multiple imputation in SPSS

### 4.4.1 The Imputed datasets

After multiple imputation, the multiple imputed datasets are stored in a new SPSS file and are stacked on top of each other. A new variable that is called Imputation_ is added to the dataset and can be found in the first column. This Imputation_ variable is a nominal variable that separates the original from the imputed datasets. This is also indicated in the corner on the right side below in the Data View and Variable View windows by the note “Split by imputation_”. You can compare the use of this variable with the Split File option in SPSS where all analyses are done separately for the categories of the variable used to split the analyses. The difference is that with the Imputation_ variable you also obtain pooled estimates for the statistical analyses. When missing values are imputed with any another software program, and you read in the imputed data in SPSS and add an Imputation_ variable yourself the data is recognized by SPSS as multiple imputed data. The imputed values are marked yellow. By these marking SPSS recognized the dataset as an (multiple) imputed dataset which is important for further statistical analyses (see Chapter 5, paragraph 5.1)

Figure 4.7 shows an example of a multiple imputed dataset with imputed values marked yellow.

You can mark and unmark the imputed values by using the option “Mark Imputed Data” under the View menu in the Data View window (Figure 4.8).

View -> Mark Imputed data

This marking and unmarking can also be done in the Data view window via the button with yellow and white squares on the right site above (Figure 4.9). If you click the button, a selection box appears with “Original data” selected,where you can easily move to the different imputed datasets.

### 4.4.2 Imputation history

The iteration history is stored in the Iter_Backpain dataset as we defined in the Output window. In this dataset the means and standard deviations of the imputed values at each iteration are stored. These values can be used to construct Convergence plots. More about making convergence plots will be discussed in the next paragraph.

### 4.4.3 Output tables

Based on our settings, SPSS produces the following results in the output window. In the Imputation Specifications table, information is provided on the imputation method used, the number of imputations, the model used for the scale variables, if interactions were included in the imputation models, the setting for the maximum percentage of missing values and the setting for the maximum number of parameters in the imputation model (Figure 4.11).

A second table, called Imputation Results, is presented with information about the imputation method, the number of fully conditional specification methods, the variables that are imputed and not imputed and the imputation sequence(Figure 4.12).

The Imputation Models table presents information about the imputation models used for the variables with missing data (Figure 4.13). Information is provided about the method of imputation, under the type column, the effect estimates used to impute the missing values, the number of missing and imputed values. For example for the Tampascale variable 13 values were missing and m=5 times 13 is 65 values were imputed.

The Descriptive statistics display the descriptive information of the original, imputed and completed data of the Tampascale and the Disability variable. In this way you can compare the completed data after MI with the original data.

## 4.5 Checking Convergence after Multiple imputation in SPSS

The dataset Iter_Backpain in the previous paragraph contains the means and standard deviations of the imputed values at each iteration and imputation round. This information is similar as the information in imp$chainMean in R. This dataset can be used to generate convergence plots, to check if the imputed values have the expected variation between the iterations.The iteration can be checked for the means and standard deviations seperately. In order to obtain seperate plots for htese summary statistics, the split file option in SPSS can be activated. After activation of the split file option, the Graph menu in SPSS can be used to make the plots. Graph -> Char Builder. Two windows will open that can be used to build a chart: On the x-axis the put the iteration number variable and on the y-axis the variable for which we want to display the iteration history. The Imputation Number variable is dragged to the set color top-right. As a result two plots appear with the iteation history for each imputation run. ## 4.6 Multiple Imputation in R In R multiple imputation can be performed with the mice function from the mice package. As an example dataset to show MI in R we use the same dataset as in the previous paragraph with 50 patient with low back pain. The variables Tampa scale and Disability contain missing values and the Pain and Radiation variables are complete. The following default settings are used in the mice function to start MI, m=5, to generate 5 imputed datasets, maxit=10, to use 10 iterations for each imputed dataset, method=”pmm” to use predictive mean matching (see paragraph 4.8). For an elobate explanation of all options withing the mice function, see ?mice. library(mice) library(foreign) data <- read.spss(file="Backpain50 MI missing.sav", to.data.frame=T)[, -1] # Read in dataset an exclude ID variable ## re-encoding from UTF-8 imp <- mice(data, m=5, maxit=10, method="pmm") ## ## iter imp variable ## 1 1 Tampascale Disability ## 1 2 Tampascale Disability ## 1 3 Tampascale Disability ## 1 4 Tampascale Disability ## 1 5 Tampascale Disability ## 2 1 Tampascale Disability ## 2 2 Tampascale Disability ## 2 3 Tampascale Disability ## 2 4 Tampascale Disability ## 2 5 Tampascale Disability ## 3 1 Tampascale Disability ## 3 2 Tampascale Disability ## 3 3 Tampascale Disability ## 3 4 Tampascale Disability ## 3 5 Tampascale Disability ## 4 1 Tampascale Disability ## 4 2 Tampascale Disability ## 4 3 Tampascale Disability ## 4 4 Tampascale Disability ## 4 5 Tampascale Disability ## 5 1 Tampascale Disability ## 5 2 Tampascale Disability ## 5 3 Tampascale Disability ## 5 4 Tampascale Disability ## 5 5 Tampascale Disability ## 6 1 Tampascale Disability ## 6 2 Tampascale Disability ## 6 3 Tampascale Disability ## 6 4 Tampascale Disability ## 6 5 Tampascale Disability ## 7 1 Tampascale Disability ## 7 2 Tampascale Disability ## 7 3 Tampascale Disability ## 7 4 Tampascale Disability ## 7 5 Tampascale Disability ## 8 1 Tampascale Disability ## 8 2 Tampascale Disability ## 8 3 Tampascale Disability ## 8 4 Tampascale Disability ## 8 5 Tampascale Disability ## 9 1 Tampascale Disability ## 9 2 Tampascale Disability ## 9 3 Tampascale Disability ## 9 4 Tampascale Disability ## 9 5 Tampascale Disability ## 10 1 Tampascale Disability ## 10 2 Tampascale Disability ## 10 3 Tampascale Disability ## 10 4 Tampascale Disability ## 10 5 Tampascale Disability By default, the mice fucntion returns information about the iteration and imputation steps of the imputed variables under the columns named “iter”, “imp” and “variable” respectively. This information can be turned off by setting the mice function parameter printFlag = FALSE, which results in silent computation of the missing values. A summary of the imputation results can be obtained by calling the imp object. imp ## Class: mids ## Number of multiple imputations: 5 ## Imputation methods: ## Pain Tampascale Disability Radiation ## "" "pmm" "pmm" "" ## PredictorMatrix: ## Pain Tampascale Disability Radiation ## Pain 0 1 1 1 ## Tampascale 1 0 1 1 ## Disability 1 1 0 1 ## Radiation 1 1 1 0 This imp object returns information about the number of imputed datasets, the imputation methods for each variable, information of the PredictorMatrix (used to customize the imputation model, see paragraph below). The imputed datasets can be extracted by using the complete function. The settings action = ”long” and include = TRUE return a data.frame where the imputed datasets are stacked under each other and the original dataset (with missings) is included on top. complete(imp, action = "long", include = TRUE) ## .imp .id Pain Tampascale Disability Radiation ## 1 0 1 9 45 20 1 ## 2 0 2 6 NA 10 0 ## 3 0 3 1 36 1 0 ## 4 0 4 5 38 NA 0 ## 5 0 5 6 44 14 1 ## 6 0 6 7 NA 11 1 ## 7 0 7 8 43 NA 0 ## 8 0 8 6 43 11 1 ## 9 0 9 2 NA 11 1 ## 10 0 10 4 36 NA 0 ## 11 0 11 5 38 16 1 ## 12 0 12 9 47 14 0 ## 13 0 13 0 32 3 1 ## 14 0 14 6 NA 12 0 ## 15 0 15 3 34 13 0 ## 16 0 16 6 42 NA 1 ## 17 0 17 3 35 11 0 ## 18 0 18 1 31 1 0 ## 19 0 19 2 31 7 0 ## 20 0 20 4 32 9 1 ## 21 0 21 5 NA 13 0 ## 22 0 22 5 39 12 0 ## 23 0 23 4 34 8 1 ## 24 0 24 8 47 13 1 ## 25 0 25 5 NA 6 0 ## 26 0 26 5 38 16 1 ## 27 0 27 9 NA 23 1 ## 28 0 28 3 36 NA 1 ## 29 0 29 2 36 9 0 ## 30 0 30 6 37 16 0 ## 31 0 31 10 NA 21 1 ## 32 0 32 4 37 8 0 ## 33 0 33 10 42 20 1 ## 34 0 34 2 37 3 0 ## 35 0 35 6 NA 12 1 ## 36 0 36 3 38 7 1 ## 37 0 37 8 NA 8 0 ## 38 0 38 3 38 6 1 ## 39 0 39 3 39 NA 0 ## 40 0 40 7 44 15 0 ## 41 0 41 7 45 NA 0 ## 42 0 42 6 40 12 1 ## 43 0 43 7 40 16 1 ## 44 0 44 1 NA 2 0 ## 45 0 45 9 41 NA 0 ## 46 0 46 5 41 17 0 ## 47 0 47 6 43 11 0 ## 48 0 48 3 39 NA 0 ## 49 0 49 2 NA 6 1 ## 50 0 50 8 NA 19 0 ## 51 1 1 9 45 20 1 ## 52 1 2 6 43 10 0 ## 53 1 3 1 36 1 0 ## 54 1 4 5 38 6 0 ## 55 1 5 6 44 14 1 ## 56 1 6 7 40 11 1 ## 57 1 7 8 43 11 0 ## 58 1 8 6 43 11 1 ## 59 1 9 2 31 11 1 ## 60 1 10 4 36 12 0 ## 61 1 11 5 38 16 1 ## 62 1 12 9 47 14 0 ## 63 1 13 0 32 3 1 ## 64 1 14 6 40 12 0 ## 65 1 15 3 34 13 0 ## 66 1 16 6 42 11 1 ## 67 1 17 3 35 11 0 ## 68 1 18 1 31 1 0 ## 69 1 19 2 31 7 0 ## 70 1 20 4 32 9 1 ## 71 1 21 5 38 13 0 ## 72 1 22 5 39 12 0 ## 73 1 23 4 34 8 1 ## 74 1 24 8 47 13 1 ## 75 1 25 5 38 6 0 ## 76 1 26 5 38 16 1 ## 77 1 27 9 42 23 1 ## 78 1 28 3 36 11 1 ## 79 1 29 2 36 9 0 ## 80 1 30 6 37 16 0 ## 81 1 31 10 47 21 1 ## 82 1 32 4 37 8 0 ## 83 1 33 10 42 20 1 ## 84 1 34 2 37 3 0 ## 85 1 35 6 42 12 1 ## 86 1 36 3 38 7 1 ## 87 1 37 8 47 8 0 ## 88 1 38 3 38 6 1 ## 89 1 39 3 39 6 0 ## 90 1 40 7 44 15 0 ## 91 1 41 7 45 11 0 ## 92 1 42 6 40 12 1 ## 93 1 43 7 40 16 1 ## 94 1 44 1 36 2 0 ## 95 1 45 9 41 23 0 ## 96 1 46 5 41 17 0 ## 97 1 47 6 43 11 0 ## 98 1 48 3 39 9 0 ## 99 1 49 2 32 6 1 ## 100 1 50 8 41 19 0 ## 101 2 1 9 45 20 1 ## 102 2 2 6 43 10 0 ## 103 2 3 1 36 1 0 ## 104 2 4 5 38 16 0 ## 105 2 5 6 44 14 1 ## 106 2 6 7 40 11 1 ## 107 2 7 8 43 14 0 ## 108 2 8 6 43 11 1 ## 109 2 9 2 35 11 1 ## 110 2 10 4 36 13 0 ## 111 2 11 5 38 16 1 ## 112 2 12 9 47 14 0 ## 113 2 13 0 32 3 1 ## 114 2 14 6 43 12 0 ## 115 2 15 3 34 13 0 ## 116 2 16 6 42 10 1 ## 117 2 17 3 35 11 0 ## 118 2 18 1 31 1 0 ## 119 2 19 2 31 7 0 ## 120 2 20 4 32 9 1 ## 121 2 21 5 40 13 0 ## 122 2 22 5 39 12 0 ## 123 2 23 4 34 8 1 ## 124 2 24 8 47 13 1 ## 125 2 25 5 38 6 0 ## 126 2 26 5 38 16 1 ## 127 2 27 9 41 23 1 ## 128 2 28 3 36 11 1 ## 129 2 29 2 36 9 0 ## 130 2 30 6 37 16 0 ## 131 2 31 10 43 21 1 ## 132 2 32 4 37 8 0 ## 133 2 33 10 42 20 1 ## 134 2 34 2 37 3 0 ## 135 2 35 6 43 12 1 ## 136 2 36 3 38 7 1 ## 137 2 37 8 47 8 0 ## 138 2 38 3 38 6 1 ## 139 2 39 3 39 9 0 ## 140 2 40 7 44 15 0 ## 141 2 41 7 45 11 0 ## 142 2 42 6 40 12 1 ## 143 2 43 7 40 16 1 ## 144 2 44 1 32 2 0 ## 145 2 45 9 41 21 0 ## 146 2 46 5 41 17 0 ## 147 2 47 6 43 11 0 ## 148 2 48 3 39 7 0 ## 149 2 49 2 34 6 1 ## 150 2 50 8 47 19 0 ## 151 3 1 9 45 20 1 ## 152 3 2 6 42 10 0 ## 153 3 3 1 36 1 0 ## 154 3 4 5 38 16 0 ## 155 3 5 6 44 14 1 ## 156 3 6 7 40 11 1 ## 157 3 7 8 43 16 0 ## 158 3 8 6 43 11 1 ## 159 3 9 2 31 11 1 ## 160 3 10 4 36 12 0 ## 161 3 11 5 38 16 1 ## 162 3 12 9 47 14 0 ## 163 3 13 0 32 3 1 ## 164 3 14 6 37 12 0 ## 165 3 15 3 34 13 0 ## 166 3 16 6 42 9 1 ## 167 3 17 3 35 11 0 ## 168 3 18 1 31 1 0 ## 169 3 19 2 31 7 0 ## 170 3 20 4 32 9 1 ## 171 3 21 5 38 13 0 ## 172 3 22 5 39 12 0 ## 173 3 23 4 34 8 1 ## 174 3 24 8 47 13 1 ## 175 3 25 5 44 6 0 ## 176 3 26 5 38 16 1 ## 177 3 27 9 40 23 1 ## 178 3 28 3 36 13 1 ## 179 3 29 2 36 9 0 ## 180 3 30 6 37 16 0 ## 181 3 31 10 45 21 1 ## 182 3 32 4 37 8 0 ## 183 3 33 10 42 20 1 ## 184 3 34 2 37 3 0 ## 185 3 35 6 37 12 1 ## 186 3 36 3 38 7 1 ## 187 3 37 8 45 8 0 ## 188 3 38 3 38 6 1 ## 189 3 39 3 39 11 0 ## 190 3 40 7 44 15 0 ## 191 3 41 7 45 12 0 ## 192 3 42 6 40 12 1 ## 193 3 43 7 40 16 1 ## 194 3 44 1 31 2 0 ## 195 3 45 9 41 19 0 ## 196 3 46 5 41 17 0 ## 197 3 47 6 43 11 0 ## 198 3 48 3 39 6 0 ## 199 3 49 2 36 6 1 ## 200 3 50 8 43 19 0 ## 201 4 1 9 45 20 1 ## 202 4 2 6 40 10 0 ## 203 4 3 1 36 1 0 ## 204 4 4 5 38 8 0 ## 205 4 5 6 44 14 1 ## 206 4 6 7 45 11 1 ## 207 4 7 8 43 16 0 ## 208 4 8 6 43 11 1 ## 209 4 9 2 31 11 1 ## 210 4 10 4 36 17 0 ## 211 4 11 5 38 16 1 ## 212 4 12 9 47 14 0 ## 213 4 13 0 32 3 1 ## 214 4 14 6 40 12 0 ## 215 4 15 3 34 13 0 ## 216 4 16 6 42 11 1 ## 217 4 17 3 35 11 0 ## 218 4 18 1 31 1 0 ## 219 4 19 2 31 7 0 ## 220 4 20 4 32 9 1 ## 221 4 21 5 37 13 0 ## 222 4 22 5 39 12 0 ## 223 4 23 4 34 8 1 ## 224 4 24 8 47 13 1 ## 225 4 25 5 40 6 0 ## 226 4 26 5 38 16 1 ## 227 4 27 9 45 23 1 ## 228 4 28 3 36 11 1 ## 229 4 29 2 36 9 0 ## 230 4 30 6 37 16 0 ## 231 4 31 10 47 21 1 ## 232 4 32 4 37 8 0 ## 233 4 33 10 42 20 1 ## 234 4 34 2 37 3 0 ## 235 4 35 6 38 12 1 ## 236 4 36 3 38 7 1 ## 237 4 37 8 45 8 0 ## 238 4 38 3 38 6 1 ## 239 4 39 3 39 7 0 ## 240 4 40 7 44 15 0 ## 241 4 41 7 45 8 0 ## 242 4 42 6 40 12 1 ## 243 4 43 7 40 16 1 ## 244 4 44 1 31 2 0 ## 245 4 45 9 41 23 0 ## 246 4 46 5 41 17 0 ## 247 4 47 6 43 11 0 ## 248 4 48 3 39 6 0 ## 249 4 49 2 36 6 1 ## 250 4 50 8 45 19 0 ## 251 5 1 9 45 20 1 ## 252 5 2 6 41 10 0 ## 253 5 3 1 36 1 0 ## 254 5 4 5 38 6 0 ## 255 5 5 6 44 14 1 ## 256 5 6 7 44 11 1 ## 257 5 7 8 43 13 0 ## 258 5 8 6 43 11 1 ## 259 5 9 2 31 11 1 ## 260 5 10 4 36 13 0 ## 261 5 11 5 38 16 1 ## 262 5 12 9 47 14 0 ## 263 5 13 0 32 3 1 ## 264 5 14 6 38 12 0 ## 265 5 15 3 34 13 0 ## 266 5 16 6 42 12 1 ## 267 5 17 3 35 11 0 ## 268 5 18 1 31 1 0 ## 269 5 19 2 31 7 0 ## 270 5 20 4 32 9 1 ## 271 5 21 5 32 13 0 ## 272 5 22 5 39 12 0 ## 273 5 23 4 34 8 1 ## 274 5 24 8 47 13 1 ## 275 5 25 5 38 6 0 ## 276 5 26 5 38 16 1 ## 277 5 27 9 43 23 1 ## 278 5 28 3 36 8 1 ## 279 5 29 2 36 9 0 ## 280 5 30 6 37 16 0 ## 281 5 31 10 47 21 1 ## 282 5 32 4 37 8 0 ## 283 5 33 10 42 20 1 ## 284 5 34 2 37 3 0 ## 285 5 35 6 42 12 1 ## 286 5 36 3 38 7 1 ## 287 5 37 8 45 8 0 ## 288 5 38 3 38 6 1 ## 289 5 39 3 39 3 0 ## 290 5 40 7 44 15 0 ## 291 5 41 7 45 14 0 ## 292 5 42 6 40 12 1 ## 293 5 43 7 40 16 1 ## 294 5 44 1 37 2 0 ## 295 5 45 9 41 16 0 ## 296 5 46 5 41 17 0 ## 297 5 47 6 43 11 0 ## 298 5 48 3 39 9 0 ## 299 5 49 2 37 6 1 ## 300 5 50 8 43 19 0 In the imputed datasets two variables are added, an .id variable and an .imp variable to distinguish cases and the imputed datasets. To extract the first imputed dataset, only the setting action = 1 is needed in the complete function. (see ?complete for more possibilities to extract the imputed datasets).The imputed datasets can further be used in mice to conduct pooled analyses or to store them for further use. ### 4.6.1 The mice algorithm and iteration steps During MI, each imputed dataset is generated after several iterations of the imputation algorithm. The imputation algorithm includes the chain of regression models to impute the missing values. We explain how this works is explained with the LBP data from the previous paragraph, with missing values in the Tampa scale and Disability variables. Iteration 0 Per imputed dataset we start with iteration number 0. Values are randomly drawn from the observed values of the Tampa scale and Disability variable and these are used to replace the missing values in these variables. Iteration 1 At this step the Tampa scale values are set back to missing. Subsequently, a linear regression model is applied in the available data (i.e. all subjects with observed Tampa scale values) using the Tampa scale as the dependent variable and the Pain, Disability and Radiation variables as independent variables. From this regression model the missing Tampa scale values are imputed. Note that for this regression model the imputed values for the Disability variable are used from the previous iteration step of 0. The Baysian stochastic regression imputation method adds uncertainty to the imputed values via the error variance (residuals) and the regression coefficients. This regression model is defined as: $$Tampa_{mis} = \beta_0 + \beta_1Pain + \beta_2Disability + \beta_3Radiation$$ The same procedure is repeated for the Disability variable. The Disability scores are first set back to missing, then the regression coefficients for the Pain, Tampa scale and Radiation variables are obtained from the subjects without missing Disability values. Note that now the imputed values for the Tampa scale variable. The imputed values for Disability are estimated using (Bayesian) regression coefficients with additional error variance to the residuals. This regression model is defined as: $$Disability_{mis} = \beta_0 + \beta_1Pain + \beta_2Tampa + \beta_3Radiation$$ Iteration 2 For iteration 2 the Tampa scale values are again set back to missing and (new) updated regression coefficients for Pain, Disability and Radiation are obtained, using the imputed values for Disability from iteration 1. Accordingly, missing values are estimated from the regression model, again using Bayesian regression coefficients. The same holds for the Disability variable. The imputed values for the Disability variable are estimated from the regression model using the imputed values in the Tampa scale variable within the same iteration number. This process is repeated in each following iteration until the final iteration where the imputed values are used for the first imputed dataset. For the next imputed dataset, the entire process of iterations is repeated. ### 4.6.2 Customizing the Imputation model In the mice imputation models, the variables Tampa scale and Disability are imputed with the help of the variables Pain and Radiation. The latter two variables are called auxiliary variables when they are not part of the main analysis model but they help to impute the Tampa scale and Disability variables. By customizing the PredictorMatrix in the mice function variables that are used to impute other variables can be switched off and on. To get information about the PredictorMatrix that was used in the mice function use: imp$PredictorMatrix
## NULL

The predictor matrix is a matrix with the names of the variables in the dataset listed in the rows and the columns. The variables in the columns are used to impute the row variables. Accordingly, variables in the columns can be switched on or off to in- or exclude them from the imputation model to impute the missing data in the row variable. In our example, the first and fourth rows contain only zeroes, because the Pain and Radiation variables do not have missing values and therefore do not need to be imputed. The variable in the second row, i.e. Tampa scale, contains missing values and the 1´s in this row mean that the column variables Pain, Disability and Radiation are included in the imputation model. For the Disability variable, the variables Pain, Tampa scale and Radiation are used. As a default setting all variables are included in the imputation model to predict missing values in other variables. The diagonal of the predictormatrix is always zero. The predictormatrix can be adapted when for example a variable that contains a high percentage of missing data should be excluded from the imputation model. For example, if we want to exclude the variable Disability from the imputation model of the Tampa scale variable we can do the following:

pred <-imp$PredictorMatrix pred["Tampascale", "Disability"] <- 0 pred imp <- mice(data,m=5, maxit=10, method="pmm", predictorMatrix = pred) There are several guidelines that can be used to set the PredictorMatrix (L. M. Collins, Schafer, and Kam (2001), Van Buuren (2018), D. B. Rubin (1976)). To summarize: 1. Include all variables that are part of the analysis model, including the dependent (outcome) variable. 2. Include the variables at the same way in the imputation model as they appear in the analysis model (i.e. if interaction terms are in the analysis model they also have to be included in the imputation model). 3. Include additional (auxiliary) variables that are related to missingness or to variables with missing values. ## 4.7 Output of the mice function The mice function returns a mids (multiple imputed data set) object. In this object, imputation information is stored and can be extracted by typing imp$, followed by the type of information you want to obtain.

imp$m ## [1] 5 imp$nmis
##       Pain Tampascale Disability  Radiation
##          0         13          9          0
imp$seed ## [1] NA imp$iteration
## [1] 10

The above objects contain the the number of imputed datasets, missing values in each variable, the specified seed value (NA here because we did not define one) and the number of iterations.

The original data can be found in:

imp$data ## Pain Tampascale Disability Radiation ## 1 9 45 20 1 ## 2 6 NA 10 0 ## 3 1 36 1 0 ## 4 5 38 NA 0 ## 5 6 44 14 1 ## 6 7 NA 11 1 ## 7 8 43 NA 0 ## 8 6 43 11 1 ## 9 2 NA 11 1 ## 10 4 36 NA 0 ## 11 5 38 16 1 ## 12 9 47 14 0 ## 13 0 32 3 1 ## 14 6 NA 12 0 ## 15 3 34 13 0 ## 16 6 42 NA 1 ## 17 3 35 11 0 ## 18 1 31 1 0 ## 19 2 31 7 0 ## 20 4 32 9 1 ## 21 5 NA 13 0 ## 22 5 39 12 0 ## 23 4 34 8 1 ## 24 8 47 13 1 ## 25 5 NA 6 0 ## 26 5 38 16 1 ## 27 9 NA 23 1 ## 28 3 36 NA 1 ## 29 2 36 9 0 ## 30 6 37 16 0 ## 31 10 NA 21 1 ## 32 4 37 8 0 ## 33 10 42 20 1 ## 34 2 37 3 0 ## 35 6 NA 12 1 ## 36 3 38 7 1 ## 37 8 NA 8 0 ## 38 3 38 6 1 ## 39 3 39 NA 0 ## 40 7 44 15 0 ## 41 7 45 NA 0 ## 42 6 40 12 1 ## 43 7 40 16 1 ## 44 1 NA 2 0 ## 45 9 41 NA 0 ## 46 5 41 17 0 ## 47 6 43 11 0 ## 48 3 39 NA 0 ## 49 2 NA 6 1 ## 50 8 NA 19 0 The imputed values for each variable in the imptued values can be found under: imp$imp
## $Pain ## [1] 1 2 3 4 5 ## <0 rows> (or 0-length row.names) ## ##$Tampascale
##     1  2  3  4  5
## 2  43 43 42 40 41
## 6  40 40 40 45 44
## 9  31 35 31 31 31
## 14 40 43 37 40 38
## 21 38 40 38 37 32
## 25 38 38 44 40 38
## 27 42 41 40 45 43
## 31 47 43 45 47 47
## 35 42 43 37 38 42
## 37 47 47 45 45 45
## 44 36 32 31 31 37
## 49 32 34 36 36 37
## 50 41 47 43 45 43
##
## $Disability ## 1 2 3 4 5 ## 4 6 16 16 8 6 ## 7 11 14 16 16 13 ## 10 12 13 12 17 13 ## 16 11 10 9 11 12 ## 28 11 11 13 11 8 ## 39 6 9 11 7 3 ## 41 11 11 12 8 14 ## 45 23 21 19 23 16 ## 48 9 7 6 6 9 ## ##$Radiation
## [1] 1 2 3 4 5
## <0 rows> (or 0-length row.names)

The imputation methods used:

imp$method ## Pain Tampascale Disability Radiation ## "" "pmm" "pmm" "" The predictor matrix: imp$predictorMatrix
##            Pain Tampascale Disability Radiation
## Pain          0          1          1         1
## Tampascale    1          0          1         1
## Disability    1          1          0         1
## Radiation     1          1          1         0

The sequence of the variables used in the impution procedure:

imp$visitSequence ## [1] "Pain" "Tampascale" "Disability" "Radiation" ### 4.7.1 Checking Convergence in R The convergence of the imputation procedure can be evaluated. The means of the imputed values for each iteration can be extracted as chainMean. imp$chainMean
## , , Chain 1
##
##                   1        2        3        4         5        6        7
## Pain            NaN      NaN      NaN      NaN       NaN      NaN      NaN
## Tampascale 42.15385 40.53846 40.84615 41.46154 40.538462 40.69231 40.07692
## Disability 13.22222 11.77778 11.11111 11.66667  9.777778 11.44444 11.33333
## Radiation       NaN      NaN      NaN      NaN       NaN      NaN      NaN
##                   8        9       10
## Pain            NaN      NaN      NaN
## Tampascale 41.07692 40.61538 39.76923
## Disability 10.33333 10.33333 11.11111
##
## , , Chain 2
##
##                   1        2        3        4        5        6        7
## Pain            NaN      NaN      NaN      NaN      NaN      NaN      NaN
## Tampascale 39.07692 40.23077 40.00000 41.07692 40.38462 42.00000 40.30769
## Disability 11.44444 13.66667 11.22222 11.11111 13.44444 12.11111 11.00000
## Radiation       NaN      NaN      NaN      NaN      NaN      NaN      NaN
##                   8        9       10
## Pain            NaN      NaN      NaN
## Tampascale 40.00000 41.00000 40.46154
## Disability 10.33333 10.88889 12.44444
##
## , , Chain 3
##
##                   1        2        3        4        5        6         7
## Pain            NaN      NaN      NaN      NaN      NaN      NaN       NaN
## Tampascale 39.00000 40.69231 40.76923 40.07692 41.07692 40.53846 41.461538
## Disability 13.44444 11.88889 11.77778 11.77778 10.55556 13.33333  9.333333
## Radiation       NaN      NaN      NaN      NaN      NaN      NaN       NaN
##                   8        9       10
## Pain            NaN      NaN      NaN
## Tampascale 41.69231 40.38462 39.15385
## Disability 10.00000 12.33333 12.66667
##
## , , Chain 4
##
##                   1        2        3        4        5        6        7
## Pain            NaN      NaN      NaN      NaN      NaN      NaN      NaN
## Tampascale 42.38462 40.92308 38.76923 39.84615 39.69231 40.76923 39.53846
## Disability 12.88889 12.00000 11.44444 12.00000 12.88889 13.22222 10.33333
## Radiation       NaN      NaN      NaN      NaN      NaN      NaN      NaN
##                   8        9       10
## Pain            NaN      NaN      NaN
## Tampascale 39.92308 39.38462 40.00000
## Disability 10.77778 12.44444 11.88889
##
## , , Chain 5
##
##                    1         2        3         4        5         6
## Pain             NaN       NaN      NaN       NaN      NaN       NaN
## Tampascale 40.461538 40.153846 40.76923 39.461538 40.38462 39.461538
## Disability  9.444444  9.222222 10.66667  9.444444 11.44444  9.555556
## Radiation        NaN       NaN      NaN       NaN      NaN       NaN
##                   7        8         9       10
## Pain            NaN      NaN       NaN      NaN
## Tampascale 40.07692 40.30769 39.846154 39.84615
## Disability 13.22222 12.44444  9.333333 10.44444
## Radiation       NaN      NaN       NaN      NaN

The number of chains is equal to the number of imputed datasets. A chain refers to the chain of regression models that is used to generate the imputed values. The length of each chain is equal to the number of iterations.

The convergence can be visualised by plotting the means in a convergence plot. For our example, the convergence plots are shown below. In this plot you see that the variance between the imputation chains is almost equal to the variance within the chains, which indicates healthy convergence.

plot(imp)

### 4.7.2 Imputation diagnostics in R

It can also be of interest to compare the values that are imputed with those that are observed. For that, the stripplot function can be used in mice. This function visualises the observed and imputed values in one plot. By comparing the observed and the imputed data points we get an idea if the imputed values are in range of the observed data. If there are no large differences between the imputed and observed values than we can conclude the imputed values are plausible.

stripplot(imp)

## 4.8 Predictive Mean Matching or Regression imputation

Within the mice algorithm continuous variables can be imputed by two methods, linear regression imputation or Predictive Mean Matching (PMM). PMM is an imputation method that predicts values and subsequently selects observed values to be used to replace the missing values. We recommend to use PMM during imputation. It is the default imputation procedure in the mice package (D. B. Rubin 1987). In SPSS the default imputation procedure is linear regression.

### 4.8.1 Predictive Mean Matching, how does it work?

The Predictive Mean Matching algorithm takes place in several steps:

We take as an example a dataset with 10 cases with 3 missing values in the Tampa scale variable. They are defined as NA in the dataset below. The Pain variable is used to predict the missing Tampa scale values.

##    ID Pain Tampascale
## 1   1    5         40
## 2   2    6         NA
## 3   3    1         41
## 4   4    5         42
## 5   5    6         44
## 6   6    7         NA
## 7   7    8         43
## 8   8    6         40
## 9   9    2         NA
## 10 10    6         38

Step 1: Estimate a linear regression model

A linear regression model is estimated with the Tampa scale variable as the outcome and the Pain variable as the predictor variable. We define the regression coefficient for Pain as $$\hat{\beta_{Pain}}$$.

Step 2: Determine Bayesion version of regression coefficient

A Bayesian regression coefficient for the Pain variable is determined. We define this regression coefficient as $$\beta_{Pain}^*$$.

Step 3: Predict Missing values

Observed Tampa scale valueas are predicted by the Pain regression coefficient $$\hat{\beta_{Pain}}$$ from step 1 and the Pain data, we call these values $$Tampa_{Obs}$$ and can be found in the Table below.

## Warning: package 'kableExtra' was built under R version 3.5.2

Missing Tampa scale valueas are predicted by the regression coefficient $$\beta_{Pain}^*$$ from step 2 and the Pain data, we call these $$Tampa_{Pred}$$.

These values for the three missing Tampa scale are: 43.594, 41.456 and 39.852.

Step 4: Find closest donor

Find the closest donor for the first missing value by subtracting the first $$Tampa_{Pred}$$ value of 43.594 from all predicted observed values in the $$Tampa_{Obs}$$ column. These differences are shown in the column Difference in the table below.

ID Pain Tampascale Tampa_Obs Difference
1 5 40 42.020 1.574
2 6 NA NA NA
3 1 41 41.624 1.970
4 5 42 41.426 2.168
5 6 44 41.228 2.366
6 7 NA NA NA
7 8 43 40.832 2.762
8 6 40 40.634 2.960
9 2 NA NA NA
10 6 38 40.238 3.356

The smallest differences are 1.574, 1.970 and 2.168 and these belong to the cases with observed Tampa scale values of 40, 41 and 42 respectively. Subsequently, a value is randomly drawn from these observed values and used to impute the first missing Tampa scale value. Other missing values are imputed by following the same procedure, i.e. now subtracting the second $$Tampa_{Pred}$$ value of 41.456 from all predicted observed alues and finding the closest match.

The strength of PMM is that missing data is replaced by data that is observed in the dataset and not replaced by unrealistic values (as negative Tampa scale scores). PMM can therefore handle better the imputation of variables with skewed distributions or non-linear relationships between variables.

## 4.9 Number of Imputed datasets and iterations

Researchers assume that the number of imputations needed to generate valid imputations has to be set at 3-5 imputations. This idea was based on the work of Rubin (D. B. Rubin (1987)). He showed that the precision of a pooled parameter becomes lower when a finite number of multiply imputed datasets is used compared to an infinite number (finite means a limited number of imputed datasets, like 5 imputed datasets and infinite means unlimited and can be recognized by the mathematical symbol ∞). The precision of a parameter is often represented by the sampling variance (or standard error (SE) estimate; the sampling variance is equal to SE2) of for example a regression coefficient. In case of multiple imputed datasets precision is determined by the pooled sampling variance or pooled SE. A measure to value the amount of precision (i.e. between the pooled sampling variance estimated in a finite compared to an infinite number of imputed datasets) is the relative efficiency ($$RE$$). The $$RE$$ is low when the number of imputations is high (and the precision becomes larger) and is defined as:

$RE= \frac{1}{1+ \frac{FMI}{m}}$

FMI is the fraction of missing information and m is the number of imputed datasets. Where FMI is roughly equal to the percentage of missing data in the simplest case of one variable with missing data. When there are more variables in the imputation model, and these variables are correlated with the variables with missing data the FMI becomes lower.

The relationship between the $$RE$$ and the pooled sampling variance (\$T_) can be written as (Van Buuren (2018)): $T_{Pooled,finite}=RE×T_{Pooled,infinite}$ which is equal to: $SE_{Pooled,finite}^2=RE×SE_{Pooled,infinite}^2$

These can be interpreted as follows: if the $$RE$$ is 0.93 for FMI=0.4 and m=5, $$T_{Pooled,finite}$$ is:

$T_{Pooled,finite}=0.93×T_{Pooled,infinite}$

Accordingly, when 5 imputed datasets are used, the standard error $$SE$$ is √0.93=0.96 times as large as the $$SE$$ when an infinite number of imputed datasets are used. Because the $$RE$$ is divided by 1, when 5 imputed datasets are used, the $$SE$$ is 1/√0.93=1.04 times larger (or 4%) than the $$SE$$ when an infinite number of imputed datasets are used. Graham (Graham, Olchowski, and Gilreath (2007)) also studied the loss in power when infinite numbers of imputed datasets are used. They recommended that at least 20 imputed datasets are needed to restrict the loss of power when testing a relationship between variables. Bodner (Bodner (2008)) proposed the following guidelines after a simulation study using different values for the FMI to determine the number of imputed datasets. For FMI´s of 0.05, 0.1, 0.2, 0.3, 0.5 the following number of imputed dataets are needed: ≥3, 6, 12, 24, 59, respectively. Following the study of Bodner (Bodner (2008)), White et al. (I. R. White, Royston, and Wood (2011)), proposed a rule of thumb, based on the idea that the FMI is frequently lower than the percentage of missing cases. Their rule of thumb states that the number of imputed datasets should be at least equal to the percentage of missing cases. This means that when 10% of the subjects have missing values, at least 10 imputed datasets should be generated.

Iterations Van Buuren (Van Buuren (2018)) states that the number of iterations may depend on the correlation between variables and the percentage of missing data in variables. He proposed that a number of 5-20 iterations is enough to reach convergence. This number may be adjusted when the percentage of missing data is high. Nowadays computers are fast so that a higher number of iterations can easily be used.

### References

Van Buuren, S. 2018. Flexible Imputation of Missing Data. Second Edition. Boca Raton, FL: Chapman & Hall/CRC.

Collins, L. M., J. L. Schafer, and C. M. Kam. 2001. “A Comparison of Inclusive and Restrictive Strategies in Modern Missing Data Procedures.” Psychological Methods 6 (3): 330–51.

Rubin, D. B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–90.

Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley.

Graham, J. W., A. E. Olchowski, and T. D. Gilreath. 2007. “How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory.” Preventive Science 8 (3): 206–13.

Bodner, T. E. 2008. “What Improves with Increased Missing Data Imputations?” Structural Equation Modeling 15 (4): 651–75.

White, I. R., P. Royston, and A. M. Wood. 2011. “Multiple imputation using chained equations: Issues and guidance for practice.” Stat Med 30 (4): 377–99.