Chapter13 Pooling Methods for Categorical variables

13.1 The pooled sampling variance or D1 method

Alternatively, a combination of the pooled parameter estimates and the pooled sampling variances can be used to construct a test that resembles a multivariate Wald test (Marshall, 2009). This test pools within and between covariance matrices of each imputed dataset and finally corrects the total parameter covariance matrix of the multivariate Wald test by including the average relative increase in variance to account for the missing data.

The multivariate Wald statistic is calculated as (Enders, 2010; Marshall et al., 2009):

$\begin{equation} D_1 = \frac{ (\bar\theta - \theta_0)V_T^{-1} (\bar\theta - \theta_0) } {k} \tag{13.1} \end{equation}$

where $\bar\theta$ and $theta_0$ are the pooled coefficient and the value under the null hypothesis (mostly zero), $V_T$ is the total variance, and k is the number of parameters. $V_T$ is:

$V_T = (1+r_{1})V_W$

$r_1$ is the relative increase in variance due to nonresponse (fraction of missing information), which is in this case obtained by:

$\begin{equation} \bar r_1 = \left(1+\frac{1}{m}\right)\mathrm{tr}(V_B\bar V_W^{-1})/k \tag{13.2} \end{equation}$

where $V_B$ is the between imputation variance, $V_W$ the within imputation variance, m the number of imputed datasets.

The p-value of the $D_1$ statistic is calculated by comparing the value to an F distribution with $k$ and $v_1$ degrees of freedom.

$p = \Pr[F_{k,\nu_1}>D_1]$

$\begin{equation} v_1 = 4 + (t-4)[1+(1-2t^{-1})r_1^{-1}]^2 \tag{13.3} \end{equation}$

this equation is used when $t = k(m-1) > 4$ , otherwise use:

$\begin{equation} v_1 = t(1+k^{-1})(1+r_1^{-1})^2/2 \tag{13.4} \end{equation}$

13.2 Multiple parameter Wald test or D2 method

One possibility is to pool the Chi-square values from the multiple parameter Wald or likelihood ratio tests with multiple degrees of freedom. This procedure is also called the D2 procedure (Enders 2012). We used this procedure also in the previous Chapter to obtain the pooled Chi-square values.

The following formula is used to obtain the Chi-square values from a multiple parameter Wald test (Marshall, 2009):

$\begin{equation} D_2 = (1+r)^{-1} (\frac{ \bar\omega}{k}-\frac{ m+1}{m-1}r) \tag{13.5} \end{equation}$

where $\bar\omega$ is the mean of the Chi-square values over the imputed datasets, k is the degrees of freedom of the Chi-square test statistic, m is the number of imputed datasets and r reflects the relative increase in variance due to nonresponse, which is obtained by the following formula:

$r_2 = \left(\frac{m+1}{m(m-1)}\right)\sum_{j=1}^m\left(\sqrt{\omega_j}-{\sqrt{\bar\omega}}\right)^2$

with m and $\bar\omega$ as above, j = 1 …, m the index of each separate imputed dataset and $\omega_j$ is the Chi-square value in each imputed dataset. The p-value is calculated by comparing the $D_2$ statistic to an F distribution with k and v degrees of freedom as follows:

$p = \Pr[F_{k,\nu_1}>D_2]$

The application of the function to pool Chi-square values can be found in the R code below where the function miPoolChi is applied. Input values in the function are the Chi-square values of each imputed dataset that are shown in Table 6.1 and the degrees of freedom of the Chi-square test, which is 2 here.

miceadds::micombine.chisquare(c(1.815,
            1.303,
            2.826,
            1.759,
            3.634),2)

## Combination of Chi Square Statistics for Multiply Imputed Data
## Using 5 Imputed Data Sets
## F(2, 254.38)=0.865     p=0.4221

As we already showed in the previous Chapter this pooling function is also available in the miceadds package, which will of course provide the same results.

##Meng and Rubin pooling

Meng and Rubin proposed a method to test overall categorical variables indirectly based on the likelihood ratio test statistic (MR pooling) (Meng and Rubin (1992), Mistler (2013)). For each regression parameter, two nested models are fitted in each imputed dataset: one restricted model where the parameter is not included in the model and one full model where the parameter is included. The pooled likelihood ratio tests are then compared to obtain pooled p-values for each parameter. The MR pooling method requires fitting multiple models for each variable in the data, hence it is an indirect approach. This can be a very time-consuming process.

Meng and Rubin pooling (MR pooling) The Meng and Rubin pooling method (also called $D_3$ method) works according to the following steps (Meng and Rubin (1992)): 1. For each regression parameter $\theta$ two nested models are fitted in each imputed dataset: one where $\theta$ is included (full model) and one where $\theta$ is not included in the model (restricted model). Subsequently, these models are pooled to obtain $\bar\theta_{full}$ and $\bar\theta_{restricted}$ .

The average likelihood ratio test statistic $\bar d_L$ over the imputed datasets as a result of comparing the log likelihood values between these models is calculated as:

$\bar d = \frac{1}{m} \sum_{j=1}^m 2(L_{restricted} - L_{full})$

where $L_{restricted}$ and $L_{full}$ represent the maximum log likelihood values with respect to $\theta$ .

The log likelihood values from the two models of step 2 are then re-calculated and averaged using the model parameters $\bar\theta_{full}$ and $\bar\theta_{restricted}$ of step 1 (which were constrained to the values from the models in the imputed data):

$\bar d_{constrained} = \frac{1}{m} \sum_{j=1}^m 2(L(\bar\theta_{full} - \bar\theta_{restricted})$

The resulting test statistic $D_L$ , required to obtain the pooled p-value, is calculated by incorporating the average increase in variance due to nonresponse $\bar r_L$ as follows:

$D_3 = \frac{\bar d_{constrained}}{k(1+\bar r_L)}$

$\bar r_L = \frac{m+1}{k(m-1)}(\bar d_L-\bar d_{constrained})$

where k is the number of degrees of freedom in the complete data likelihood ratio test (Mistler (2013); Van Buuren (2018)). The p-value is calculated by comparing the $D_3$ statistic with an F distribution with k and $v_L$ (i.e., degrees of freedom of the denominator) according to:

$p = \Pr[F_{k,\nu_L}>D_3]$

with:

$v_L = 4 + (km-k-4)[1+(1-\frac{2}{(km-k)}\frac{1}{(r_L)}]^2$

13.3 The Median P Rule

For the Median P Rule (MPR) one simply uses the median p-value of the significance tests conducted in each imputed dataset (MPR pooling). Hence, it depends on p-values only and not on the parameter estimates. The MPR can be calculated by using the p-values from the likelihood ratio test for multiple parameters for the categorical variables in the multivariable model (Eekhout, Wiel, and Heymans (2017)). The Median P Rule is therefore very simple to apply. For the median p-value to be valid you have to run the MI procedure without the outcome variable in the imputation model.

References

Eekhout, I., M. A. van de Wiel, and M. W. Heymans. 2017. “Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis.” BMC Med Res Methodol 17 (1): 129.

Meng, X. L., and D. B. Rubin. 1992. “Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets.” Biometrika 79 (1): 103–11.

Mistler, S. A. 2013. “A SAS Macro for Computing Pooled Likelihood Ratio Tests with Multiply Imputed Data.” Proceedings of the SAS Global Forum 2013, San Francisco, California: Contributed Paper (Statistics and Data Analysis), no. 438.

Van Buuren, S. 2018. Flexible Imputation of Missing Data. Second Edition. Boca Raton, FL: Chapman & Hall/CRC.