# Chapter 7 Multiple Imputation models for Multilevel data

## 7.1 Advanced Multiple Imputation models for Multilevel data

In this Chapter, we will apply more advanced imputation models. With “advanced”, we mean multiple imputation models for multilevel data, which are also called mixed models. We start this Chapter with a brief introduction about multilevel analyses and the structure of the data followed by a conceptual description of different methods of multilevel imputation. Subsequently, we will shortly discuss the basic principles of mixed models and the levels of data that are used in these models. After that, we will discuss the levels of missing data that you can encounter when you have a multilevel dataset and we will show some examples of how to apply multilevel imputation models.

## 7.2 Characteristics of Multilevel data

Multilevel data is also known as clustered data, where collected data is clustered into groups. Examples are observations of patients within the same hospitals or observations of students within the same schools. We say that these data are clustered (or correlated) because assessments of patients or students within the same hospital or school (or cluster) are more equal to each other than assessments of patients or students between different hospitals or schools (J. W. R. Twisk (2006)). It is called multilevel data because data is assessed at different levels. Data can be assessed at the level of the school when we would be interested in the school type, i.e. private or public. We say than that the data is assessed at two levels, i.e. the school level (highest level or level 2) and the students level (lowest level or level 1). Another example of multilevel data are data that are repeatedly assessed within the same person over time, for example when blood parameters or variables as bodyweight are repeatedly assessed within the same individuals (clusters) over time. Here the clusters are the individuals. This type of data is also called longitudinal data. In this example, also, assessments within the same individual may be more alike than assessments between individuals. This kind of data is also assessed at two levels, now the individuals are the highest level (level 2) and the time measurements are the lowest level (level 1). The different types of Multilevel data are graphically displayed in Figure 7.1a and b. Multilevel data may also consist of data assessed at more than 2 levels, i.e. data that is assessed in different schools, classes and students or different regions, hospitals and patients.

## 7.3 Multilevel data - from wide to long

In this Chapter we use an example dataset from The Amsterdam Growth and Health Longitudinal Study (AGGO). In this study persons were repeatedly assessed over time and growth, health and life-style factors were measured. Assessments are available of Gender, Fitness, Smoking, Hypercholestrolemia, Cholesterol level and Sum of Skinfolds. The dataset contains information of 147 patients which are assessed six times, once at baseline and at 5 repeated measurement occasions. Usually, a dataset contains one row per subject and the separate variables are placed in the columns. When subjects are repeatedly assessed, additional variables are added for new assessments. This is also called a wide data format (Figure 7.1).

## 7.4 Multilevel data - from wide to long

In this Chapter we use an example dataset from The Amsterdam Growth and Health Longitudinal Study (AGGO). In this study persons were repeatedly assessed over time and growth, health and life-style factors were measured. Assessments are available of Gender, Fitness, Smoking, Hypercholestrolemia, Cholesterol level and Sum of Skinfolds. The dataset contains information of 147 patients which are assessed six times, once at baseline and at 5 repeated measurement occasions. Usually, a dataset contains one row per subject and the separate variables are placed in the columns. When subjects are repeatedly assessed, additional variables are added for new assessments. This is also called a wide data format (Figure 7.2).

In order to apply multilevel analyses, we need a long version of the data. For the analysis we have to convert the dataset into a long data format, which is explained in paragraph 7.4.1 and 7.4.2. An example of a long dataset is presented in Figure 7.3. The variable that separates the clusters is called the ID variable and the variable that distinguishes the measurements at different time points is the Time variable. This means that repeated assessment within a subject are stacked under each other. Each subject has multiple rows, one row for each repeated measurement.

## 7.5 Multilevel data - Clusters and Levels

In the previous paragraph we have seen an example of a long dataset that is needed for multilevel analyses. We can organize this kind of multilevel information by level of assessment in the following way (Figure 7.4):

1. Level 1 outcome variable: This is for example the Cholesterol variable that is repeatedly assessed over time within persons. Other examples may be math test scores of individual students in a class or their Intelligent Quotient (IQ) scores. In other words, level one outcome information varies within a cluster or the value changes over time (i.e. does not have a fixed value).

2. Level 1 independent variable: These are variables that vary within a cluster, but now are used as independent variables. Examples are the Time or Sum of Skinfold measurements that are repeatedly assessed within persons over time or hours that students in a class spent on their homework each week or the level of education of their parents.

3. Level 2 independent variables. These variables do not vary within a cluster but vary only between clusters. An example is the Gender or Fitness variable which is only assessed at the start of the study, or in case of schools, if the school is a private or public school.

4. Level 4 is the cluster variable itself. This is a special variable which distinguishes the clusters. This could be the school identification number which form the blocks of measurement or the identification number that distinguishes individuals with repeated information over time.

In the next paragraph the type of statistical model is defined that is used to analyze multilevel data.

## 7.6 Restructuring datasets from wide to long in SPSS

We start with the dataset in wide format, which is presented in (Figure 7.5). In this dataset, information is repeatedly assessed over time for the Cholesterol and Sum of Skinfold variables and this information is stored in the column variables. The repeated assessments are distinguished by the numbers at the end of the variable names. The number 1 indicates the first measurement, number 2, the second, etc. The Gender and Smoking variable is not repeatedly assessed over time. Each row represents a separate case.

Now click on Data Restructure and following the next steps:

Step1 A new window opens with three options (Figure 7.6). The default is to Restructure selected variables into cases and that is exactly what we want to do if we want to restructure from wide to long data files.

Step 2 Click Next and a new window opens (Figure 7.7). In this window, you can choose how many variables you wish to restructure. Here we should think of the number of time varying variables (level 1) we have that we wish to examine. In our example dataset, we have two such variables: Cholesterol and Sum of Skinfolds. Therefore, we click the option More than one and type 2.

Step 3 We click Next and a new window opens (Figure 7.8). In this window, we should first define which variable should be used as our case group identifier. SPSS by default makes a new variable for this named Id. You can also use the ID variable in your dataset (you usually have such a variable): by clicking the arrow next to the Use case number option, you can select Use selected variable and after that drag the ID variable in your dataset (here ID) to this pane. Subsequently, we should define the variables in our broad dataset that should be restructured to one new long variable under Variables to be transposed. In our case this refers to two new variables. We rename trans1 into Cholesterol and select the 6 Cholesterol variables by holding down the Ctrl or Shift key. Next, we move these variables to the pane on the right and continue with the second variable (Figure 7.9). Now we change trans2 into SumSkinfolds and repeat the procedures for the Sum of Skinfolds variables.

Step 4 We click Next and a new window opens (Figure 7.9). In this window, we can create so called Index variables. In longitudinal data analyses the index variable refers to the time points. We therefore only want one index variable. This is the default, so we can click Next again.

Step 5 A new window opens again (Figure 7.10). This window allows us to create the index/time variable. The default is to use sequential numbers, which we also choose. In case of unequal time points you can redefine these numbers later in the long file with the Compute command in SPSS. We can change the name index1, by double clicking on it. Rename it in “Time”. In addition we can define a label for this variable in the same way.

Step 6 Click Next and a new window opens again (Figure 7.11). Here the only important thing is that we should choose what to do with the other variables in our dataset. We can either Drop them, meaning that we will not be able to use them in the subsequent analyses, or Keep them and treat them as fixed (time independent). In this case we choose this latter option.

Click on Next and the last window will open (Figure 7.12).

This is the final step. Click on Finish (if we wish to paste the syntax we can choose for that here). Be aware that your converted dataset replaces now your original dataset (Figure 7.13). To keep both datasets, use for Save as in the menu file and choose another file name for the converted file.

Figure 7.14. Converted wide to long dataset.

### 7.6.1 Restructuring a dataset from wide to long in R

To convert a dataset in R from wide to long, you can use the reshape function. Before you convert a wide dataset, it is a good idea to redesign the dataset a little bit and to place all variables in the order of their names. It is than easier to apply the reshape function. You see an example in the R code below, where all Cholesterol variables are nicely ordered.

library(foreign)
dataset <- read.spss(file="AGGO_wide.sav", to.data.frame = T)
## re-encoding from UTF-8
head(dataset, 10)
##    ID Gender  Fitness Smoking Cholesterol1 Cholesterol2 Cholesterol3
## 1   1      1 2.151339       0          4.2          3.9          3.9
## 2   2      1 2.119814       0          4.4          4.2          4.6
## 3   3      1 2.472159       0          3.7          4.0          3.3
## 4   4      1 2.205836       0          4.3          4.1          3.8
## 5   5      1 2.393320       0          4.2          4.1          4.1
## 6   6      1 2.523713       0          4.1          3.8          3.6
## 7   7      1 2.260524       0          4.2          3.9          3.7
## 8   8      2 1.915354       0          3.2          3.7          3.6
## 9   9      2 2.278520       0          5.1          4.6          3.9
## 10 10      2 2.002636       1          5.7          5.4          5.3
##    Cholesterol4 Cholesterol5 Cholesterol6 SumSkinfolds1 SumSkinfolds2
## 1           3.6         3.92         4.13          2.51          2.10
## 2           4.1         5.18         5.50          2.48          2.34
## 3           3.6         3.48         3.85          2.17          2.39
## 4           3.4         4.35         4.06          2.58          2.68
## 5           4.5         3.81         4.52          2.33          2.02
## 6           3.7         3.57         4.24          2.05          2.09
## 7           3.6         4.32         5.08          2.42          2.93
## 8           3.2         3.82         4.42          3.44          3.78
## 9           4.5         4.22         5.27          2.43          2.72
## 10          4.9         5.24         5.34          4.12          4.83
##    SumSkinfolds3 SumSkinfolds4 SumSkinfolds5 SumSkinfolds6
## 1           2.16          2.26          2.33          2.56
## 2           2.45          2.57          3.97          5.08
## 3           2.44          2.42          2.80          4.48
## 4           2.84          3.00          2.85          2.77
## 5           2.00          2.27          1.94          2.16
## 6           2.25          2.24          2.66          2.73
## 7           3.14          3.35          3.90          4.01
## 8           4.78          4.86          5.49          6.25
## 9           3.06          3.72          5.28          5.11
## 10          6.21          6.35          5.99          3.70
##    Hypercholestrolemia
## 1                    0
## 2                    0
## 3                    0
## 4                    0
## 5                    0
## 6                    0
## 7                    0
## 8                    0
## 9                    1
## 10                   1

Now it is easy to convert the dataset by using the following code. The object dataset_long shows the results (first 10 patients shown):

# Reshape wide to long
dataset_long <- reshape(dataset, idvar = "ID", varying = list(5:10, 11:16), timevar="Time",
v.names = c("Cholesterol", "SumSkinfolds"), direction = "long")
#dataset_long

The long dataset is not ordered yet by ID and Time. This can be done by using the order function.

dataset_long <- dataset_long[order(dataset_long$ID, dataset_long$Time), ]
#dataset_long

Now that we have restructured the dataset we are going to discover how missing data in multilevel data can be imputed.

## 7.7 Missing data at different levels

Missing data in multilevel studies can occur at the same levels as measurements as was discussed above. In other words, missing data can occur at the level of:

1. The Level 1 outcome variable: Missing data is present in the Cholesterol variable when this variable is repeatedly assessed over time or in math scores of pupils within a class. Note that when Mixed models are used and there is only missing data in the outcome variable, imputation of missing values is not necessary. Full information maximum likelihood procedures, that are used to estimate the parameters of a mixed model, can be used to get estimates of regression coeficients and standard errors.

2. The Level 1 independent variable: Missing data occur at the level of the independent variables that vary within a cluster. Examples are missing data in the Sum of Skinfold variable or the age or IQ scores of students within a class.

3. The Level 2 independent variables: Missing values are located in the variables that have a constant value within a cluster. For example, in the Fitness variable assessed at baseline or if data is missing for the variable if a school is a private or public school. Other examples are when data is missing for variables as gender or educational level of persons that were repeatedly assessed over time.

4. The Cluster level variable: Missing data may be present in the cluster variable itself, for example if students did not fill in the name of the school or patients did not fill in the name of the hospital they were treated.

We are currently working on this Chapter.

### References

Twisk, J. W. R. 2006. Applied Multilevel Analysis. Cambridge: CUP.