8.3 Multiple MetaRegression
In the beginning of Chapter 8, we already introduced the basic idea behind metaregression. We showed that, in general, a metaregression has the following form:
\[\hat \theta_k = \theta + \beta_1x_{1k} + ... + \beta_nx_{nk} + \epsilon_k + \zeta_k\]
Where \(\hat \theta_k\) is the estimate of our effect size of a study \(k\) and \(\beta\) are the regression coefficients. There are two parameters in this formula which cause our effect size estimate to deviate from the “true” overall effect size: \(\epsilon_k\), the sampling error, and \(\zeta_k\), which denotes that the true effect size of the study is only sampled from an overarching distribution of effect sizes. As we explained before, if \(\zeta_k>0\), the model we describe with this formula is a randomeffects metaregression model, also known as a mixedeffects model.
The idea behind multiple metaregression
Previously, we only considered the scenario in which we use one predictor \(\beta_1x_1\) in our metaregression (e.g., we want to check if the effect size a study reports depends on the year it was published). But imagine another scenario. Let us assume that we have made the experience in our previous research that the effect sizes we find for a certain study in some research field depends on the prestige of the journal in which the study was published (e.g., higher effects are reported in journals with a high reputation). On the other hand, it also seems apriori reasonable that journals with a good reputation publish studies of high quality, which may also be associated with higher effect sizes. So to check if journal reputation is indeed associated with higher effect sizes, we have to make sure that this relationship is not confounded by the fact that prestigious journals publish studies of higher quality, which may be the true reason we find this relationship between journal reputation and effect size. This means we have to control for study quality (e.g., rated on a scale from 0 to 10) when we examine the relationship between journal prestige (e.g., operationalized as the impact factor of a journal) and effect size.
This, and many other possible scenarios can be dealt with using multiple metaregression. In multiple metaregression we use several predictors (variables) to predict (differences in) effect sizes. When we look back at the general metaregression formula we defined before, we actually see that the formula already provides us with this feature through the \(\beta_nx_{nk}\) part. Here, the parameter \(n\) denotes that we can include \(n\) more predictors/variables into our metaregression, making it a multiple metaregression.
While, statistically speaking, the \(n\) parameter gives us the possibility to include as many predictors into our multiple metaregression model as we wish to, things are a little more tricky in reality. In the following, we will discuss a few important pitfalls in multiple metaregression and how we can build multiple metaregression models which are robust and trustworthy. But first, let us cover another important feature of multiple metaregression: interactions.
Interactions
So far, in our multiple metaregression model, we only considered the case where we have multiple predictor variables \(x_1,x_2, ... x_n\), and along with their predictor estimates \(\beta_n\), add them together to calculate our estimate of the true effect size \(\hat \theta_k\) for each study \(k\). In multiple metaregression models, however, we can not only model such additive relationships. We can also model socalled interactions. Interactions mean that the relationship between one predictor variable (e.g., \(x_1\)) and the estimated effect size changes for different values of another predictor variable (e.g. \(x_2\)).
Imagine a scenario where we want to model two predictors and their relationship to the effect size: the publication year (\(x_1\)) of a study and the quality (\(x_2\)) of a study, which we rate like this:
\[\begin{equation} x_2 = \left \{\begin{array}{ll} 0: bad \\1: moderate \\2: good \end{array} \right. \end{equation}\]
As we described before, we can now imagine a metaregression model in which we combine these two predictors \(x_1\) and \(x_2\) and assume an additive relationship. We can do this by simply adding them:
\[\hat \theta_k = \theta + \beta_1x_{1k} + \beta_2x_{2k} + \epsilon_k + \zeta_k\]
Let us assume that, overall, higher publication year (\(x_1\)) is associated with higher effect sizes (i.e., reported effect sizes have risen over the years). We could now ask ourselves if this positive relationship varies depending on the quality of the studies (\(x_2\)). For example, maybe this rise in effect sizes was strongest for highquality studies, while effect sizes stayed mostly the same over the years for studies of lower quality? We can visualize our assumed relationship between effect size (\(\hat \theta_k\)), publication year (\(x_1\)) and study quality (\(x_2\)) the following way:
To examine such questions, we can add an interaction term to our metaregression model. This interaction term lets predictions of \(x_1\) vary for different values of \(x_2\) (and vice versa). We can denote this additional interactional relationship in our model by introducing a third predictor, \(\beta_3\), which captures this interaction \(x_{1k}x_{2k}\) we want to test in our model:
\[\hat \theta_k = \theta + \beta_1x_{1k} + \beta_2x_{2k} + \beta_3x_{1k}x_{2k}+ \epsilon_k + \zeta_k\]
8.3.1 Common pitfalls of multiple metaregression models
As we have mentioned before, multiple metaregression, while very useful when applied properly, comes with certain caveats we have to know and consider when fitting a model. Indeed, some argue that (multiple) metaregression is often improperly used and interpreted in practice, leading to a low validity of many metaregression models (Higgins and Thompson 2004). Thus, there are some points we have to keep in mind when fitting multiple metaregression models, which we will describe in the following.
Overfitting: seeing a signal when there is none
To better understand the risks of (multiple) metaregression models, we have to understand the concept of overfitting. Overfitting occurs when we build a statistical model which fits the data too closely. In essence, this means that we build a statistical model which can predict the data at hand very well, but performs bad at predicting future data it has never seen before. This happens if our model assumes that some variation in our data stems from a true “signal” in our data, when in fact we only model random noise (Iniesta, Stahl, and McGuffin 2016). As a result, our statistical model produces false positive results: it sees relationships where there are none.
Regression methods, which usually utilize minimization or maximization procedures such as Ordinary Least Squares or Maximum Likelihood estimatation, can be prone to overfitting (Gigerenzer 2004, 2008). Unfortunately, the risk of building a nonrobust model, which produces falsepositive results, is even higher once we go from conventional regression to metaregression (Higgins and Thompson 2004). There are several reasons for this:
 In metaregression, our sample of data is mostly small, as we can only use the synthesized data of all analyzed studies \(k\).
 As our metaanalysis aims to be a comprehensive overview of all evidence, we have no additional data on which we can “test” how well our regression model can predict high or low effect sizes.
 In metaregressions, we have to deal with the potential presence of effect size heterogeneity (see Chapter 6). Imagine a case in which we have to studies with different effect sizes and nonoverlapping confidence intervals. Every variable which has different values for the different studies might be a potential explanation for effect size difference we find, while it seems straightforward that most of such explanations are spurious (Higgins and Thompson 2004).
 Metaregression as such, and multiple metaregression in particular, make it very easy to “play around” with predictors. We can test numerous metaregression models, include many more predictors or remove them in an attempt to explain the heterogeneity in our data. Such an approach is of course tempting, and often found in practice, because we, as metaanalysts, want to find a significant model which explains why effect sizes differ (Julian Higgins et al. 2002). However, such behavior has been shown to massively increase the risk of spurious findings (Higgins and Thompson 2004), because we can change parts of our model indefinitely until we find a significant model, which is then very likely to be overfitted (i.e., it mostly models statistical noise).
Some guidelines have been proposed to avoid an excessive false positive rate when building metaregression models:
 Minimize the number of investigated predictors. In multiple metaregression, this translates to the concept of parsimony, or simplicity: when evaluating the fit of a metaregression model, we prefer models which achieve a good fit with less predictors. Estimators such as the Akaike and Bayesian Information Criterion can help with such decisions. We will get to this topic again below.
 Predictor selection should be based on predefined scientific or theoretical questions we want to answer in our metaregression. Therefore it is crucial to define in the protocol before the start of the study, which predictors will be examined in the metaregression analyses.
 When the number of studies is low (which is very likely to be the case), and we want to compute the significance of a predictor, we should use the KnappHartung adjustment to obtain more reliable estimates (Julian Higgins et al. 2002).
 We can use permutation to assess the robustness of our model in resampled data. We will describe the details of this method later.
Multicollinearity
Multicollinearity means that one or more predictor in our regression model can be (linearly) predicted from another predictor in our model with relatively high accuracy (Mansfield and Helms 1982). This basically means that we have two or more predictors in our model which are highly correlated. Most of the dangers of multicollinearity are associated with the problem of overfitting which we described above. High collinearity can cause our predictor coefficient estimates \(b\) to behave erratically, and change considerably with minor changes in our data. It also limits the size of the explained variance by the model, in our case the \(R^2\) analog.
Multicollinearity in metaregression is common (Berlin and Antman 1992). Although multiple regression can handle lower degrees of collinearity, we should check and, if necessary, control for very highly correlated predictors. There is no consolidated yesnorule for the presence of multicollinearity. A crude, but often effective way is to check for very high correlations (i.e., \(r\geq0.8\)) before fitting the model. Multicollinearity can then be reduced by either (1) removing one of the closetoredundant predictors, or (2) trying to combine the predictors into one single predictor.
8.3.2 Model building methods
When building a multiple regression model, there are different approaches through which we can select and insert predictors into our model. Here, we discuss the most important ones along with their strengths and weaknesses:
 Forced entry. In forced entry methods, all relevant predictors are forced into the regression model simultaneously. In most functions in R, this is the default setting. Although this is a generally recommended procedure (Field, Miles, and Field 2012), keep in mind that all predictors to use in forced entry should still be based on a predefined, theoryled decision.
 Hierarchical. Hierchical multiple regression means including predictors into our regression model stepwise, but based on a clearly defined scientific rationale. First, only predictors which have been associated with effect size differences in previous research are included in the order of their importance. After this step, novel predictors can be added to explore if these variables explain heterogeneity which has not yet been captured by the known predictors.
 Stepwise. Stepwise methods mean that variables/predictors are added to the model one after another. At first glance, this sounds a lot like hierarchical regression, but there is a crucial difference: stepwise regression methods select predictors based on a statistical criterion. In a procedure called forward selection, the variable explaining the largest amount of variability in the data is used as the first predictor. This processes is then repeated for the remaining variables, each time selecting the variable which explains most of the remaining (unexplained) variability in the data as the next predictor, which is then added to the model. Furthermore, there is a procedure called backward selection, in which all variables are used as predictors in the model first, and then removed successively based on a predefined statistical criterion. There is an extensive literature discouraging the usage of stepwise methods (Whittingham et al. 2006; Chatfield 1995), and if we recall the common pitfalls of multiple regression models we presented above, it becomes clear that these methods have high risk of producing spurious findings. Nevertheless, stepwise methods are still frequently used in practice, making it important to know that these procedures exist. If we use stepwise methods ourselves, however, it is advised to primarily do so in an exploratory fashion and to keep the limitations of this procedure in mind.
 Multimodel inference. Multimodel methods differ from stepwise methods as they do not try to successively build the “best” one model explaining most of the variance. Instead, in this procedure, all possible combinations of a predifined selection of predictors are modeled, and evaluated using a criterion such as Akaike’s Information Criterion, which rewards simpler models. This enables a full eximination of all possible models, and how they perform. A common finding using this procedure is that there are many different kinds of predictor combinations within a model which lead to a good fit. In multimodel inference, the estimated coefficients of predictors can then be synthesized across all possible models to infer how important certain predictors are overall.
8.3.3 Using metafor
to compute Multiple MetaRegressions
After all this input, it is now time to fit our first multiple regression models. To do this, we will not use the meta
package we have used so far. Instead, we will use the metafor
package (Viechtbauer and others 2010). This package provides a vast array of advanced functionalities for metaanalysis, along with great documentation. We will also use this package in a later chapter where we focus on Multilevel MetaAnalysis, so it is worth getting accustomed to the inner workings of this package. So, to begin, let us make sure we have metafor
installed and loaded in our library.
library(metafor)
For our multiple metaregression examples, i will use my mvreg.data
dataset. This is a “toy” dataset, which we simulated for illustrative purposes. Let’s have a look at the structure of my data:
## 'data.frame': 36 obs. of 6 variables:
## $ yi : num 0.0944 0.0998 0.1693 0.1751 0.273 ...
## $ sei : num 0.196 0.192 0.119 0.116 0.165 ...
## $ reputation: num 11 0 11 4 10 9 8 8 8 0 ...
## $ quality : num 6 9 5 9 2 10 6 3 10 3 ...
## $ pubyear : num 0.855 0.753 0.66 0.563 0.431 ...
## $ continent : chr "North America" "Europe" "North America" "Europe" ...
We see that there are 6 variables in our dataset. The yi
and sei
columns store the effect size and standard error of a particular study. Thus, these columns correspond to the TE
and seTE
columns we used before. We have named these variables this way because this is the standard notation that metafor
uses: yi
corresponds to the effect size \(y_i\) we want to predict in our metaregression, while sei
is \(SE_i\), the standard error. To designate the variance of an effect size, metafor
uses vi
, or \(v_i\) in mathematical notation, which we do not need here because yi
and sei
contain all the information we need.
The other four variables we have in our dataset are potential predictors for our metaregression. We want to check if reputation
, the (meancentered) impact score of the journal the study was published in, quality
, the quality of the study rated from 0 to 10, pubyear
, the (standardized) publication year, and continent
, the continent in which the study was performed, are associated with different effect sizes.
For continent
, note that we store information as a predictor with 2 labels: Europe
and North America
, meaning that this predictor is a dummy variable. Always remember that such dummy variables have to be converted from a chr
to a factor vector before we can proceed.
mvreg.data$continent = factor(mvreg.data$continent)
8.3.3.1 Checking for multicollinearity
As we mentioned before, we have to check for multicollinearity of our predictors to make sure that our metaregression model coefficient estimates are robust. A quick way to check for high correlation is to calculate a intercorrelation matrix for all continuous variables with the following code:
cor(mvreg.data[,3:5])
## reputation quality pubyear
## reputation 1.0000000 0.3015694 0.3346594
## quality 0.3015694 1.0000000 0.1551123
## pubyear 0.3346594 0.1551123 1.0000000
The PerformanceAnalytics
package provides the chart.Correlation
function which we can use to visualize the intercorrelations. Make sure to install the PerformanceAnalytics
package first, and then use this code:
library(PerformanceAnalytics)
chart.Correlation(mvreg.data[,3:5])
We see that our variables are indeed correlated, but probably not to a degree that would warrant excluding one of them.
8.3.3.2 Fitting a metaregression model without interaction terms
Now, let us fit our first metaregression model, for now without assuming any interactions between predictors. Previously, we wanted to explore if a high reputation of the journal a study was published in predicts higher effect sizes. Let us also assume we already know very well from previous literature that the quality of a study is also associated with differences in effects. We can now perform a hierarchical regression where we first include our known predictor quality
, and then check if reputation
explains heterogeneity beyond that while taking quality
into account.
To do this, we use the rma
function in metafor
to perform a randomeffects metaregression. This function has countless parameters, but we will only concentrate on the following (to see all parameters, type ?rma
into your Console):
Parameter  Function 

yi  The column of our dataset in which the effect size for each study is stored. 
sei  The column of our dataset in which the standard error of the effect size for each study is stored. 
data  The dataset with our metaanalysis data. 
method  The estimator for tausquared we want to use. For metaregression models, it is advised to use the maximum likelihood (‘ML’) or restricted maximum likelihood (‘REML’) estimator. 
mods  This parameter defines our metaregression model. First, we specify our model with ‘~’ (a tilde). Then, we add the predictors we want to include, separating them with ‘+’ (e.g., ‘variable1 + variable2’). Interactions between two variables are denoted by ’variable1*variable2’. 
test  The test we want to apply for our regression coefficients. We can choose from ‘z’ (default) and ‘knha’ (KnappHartung method). 
First, let us perform a metaregression using only quality
as the predictor along with KnappHartung adjustments. We save the results as model1
, and then inspect the output.
model1 < rma(yi = yi,
sei = sei,
data = mvreg.data,
method = "ML",
mods = ~ quality,
test = "knha")
model1
##
## MixedEffects Model (k = 36; tau^2 estimator: ML)
##
## tau^2 (estimated amount of residual heterogeneity): 0.0667 (SE = 0.0275)
## tau (square root of estimated tau^2 value): 0.2583
## I^2 (residual heterogeneity / unaccounted variability): 60.04%
## H^2 (unaccounted variability / sampling variability): 2.50
## R^2 (amount of heterogeneity accounted for): 7.37%
##
## Test for Residual Heterogeneity:
## QE(df = 34) = 88.6130, pval < .0001
##
## Test of Moderators (coefficient 2):
## F(df1 = 1, df2 = 34) = 3.5330, pval = 0.0688
##
## Model Results:
##
## estimate se tval pval ci.lb ci.ub
## intrcpt 0.3429 0.1354 2.5318 0.0161 0.0677 0.6181 *
## quality 0.0356 0.0189 1.8796 0.0688 0.0029 0.0740 .
##
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output, we can see the results for our predictor quality
under Model Results
and the values for our regression intercept (intrcpt
). We see that the \(p\) value for our predictor is nonsignificant \(p=0.0688\), but that there is a trend for an association, because \(p<0.1\). Under Test of Moderators (coefficient(s) 2)
, we can see the overall test results for our regression model (\(F_{1,34}=3.68, p=0.0688\)). Because we only included one predictor, the \(p\)value reported there is identical to the one we saw before. In total, our model explains \(R^2=2.23\%\) of the heterogeneity in our data, which we can see next to the R^2 (amount of heterogeneity accounted for)
line in our output.
Now, let us check if we can improve our model by including reputation
as predictor. We add + reputation
in our argument to mods
and save the output to model2
.
model2 < rma(yi = yi,
sei = sei,
data = mvreg.data,
method = "ML",
mods = ~ quality + reputation,
test="knha")
model2
##
## MixedEffects Model (k = 36; tau^2 estimator: ML)
##
## tau^2 (estimated amount of residual heterogeneity): 0.0238 (SE = 0.0161)
## tau (square root of estimated tau^2 value): 0.1543
## I^2 (residual heterogeneity / unaccounted variability): 34.62%
## H^2 (unaccounted variability / sampling variability): 1.53
## R^2 (amount of heterogeneity accounted for): 66.95%
##
## Test for Residual Heterogeneity:
## QE(df = 33) = 58.3042, pval = 0.0042
##
## Test of Moderators (coefficients 2:3):
## F(df1 = 2, df2 = 33) = 12.2476, pval = 0.0001
##
## Model Results:
##
## estimate se tval pval ci.lb ci.ub
## intrcpt 0.5005 0.1090 4.5927 <.0001 0.2788 0.7222 ***
## quality 0.0110 0.0151 0.7312 0.4698 0.0197 0.0417
## reputation 0.0343 0.0075 4.5435 <.0001 0.0189 0.0496 ***
##
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We now see that a new line appears under Model Results
displaying the results for our reputation
predictor. We see that a coefficient of \(b=0.0343\) was calculated for reputation
. We also see that this predictor is highly significant (\(p<0.0001\)) and positively associated with the effect sizes. We also see that the metaregression model itself is significant (\(F_{2,33}=12.25, p<0.0001\)), explaining \(R^2=66.95\%\) of the heterogeneity. This means that journal reputation is associated with higher effect sizes, even when controlling for study quality.
But is model2
indeed better than our first model, model1
? To assess this, we can use the anova
function, feeding it with the two models we want to compare.
anova(model1, model2)
##
## df AIC BIC AICc logLik LRT pval QE tau^2
## Full 4 19.4816 25.8157 20.7720 5.7408 58.3042 0.0238
## Reduced 3 36.9808 41.7314 37.7308 15.4904 19.4992 <.0001 88.6130 0.0667
## R^2
## Full
## Reduced 64.3197%
This function performs a model test and provides us with several statistics to assess if model2
has a better fit than model1
. Here we compare our full
model, model2
, which includes both quality
and reputation
, with our Reduced
model, which only uses quality
as a predictor.
The anova
function performs a Likelihood Ratio Test, the results of which we can see in the LRT
column. The test is highly significant (\(\chi^2_1=19.50, p<0.001\)), which means that that our full model indeed provides a better fit. Another important statistic is reported in the AICc
column. This provides us with the Akaike’s Information Criterion, corrected for small samples. As we mentioned before, AICc penalizes complex models with more predictors to avoid overfitting. It is important to note that lower values of AIC(c) mean that a model performs better. Interestingly, in our output, we see that the full model (\(AICc=20.77\)) has a better AIC value than our reduced model (\(AICc=37.73\)), despite having more predictors. All of this suggests that our multiple regression does indeed provide a good fit to our data.
8.3.3.3 Modeling interaction terms
Let us say we want to model an interaction hypothesis with our additional predictors pubyear
(publication year) and continent
, where we assume that the relationship between publication year and effect size differs for Europe and North America. To model this in our rma
function, we have to connect our predictors with *
in the mods
parameter. Because we do not want to compare the models directly using the anova
function, we specify the \(\tau^2\) estimator to be "REML"
(restricted maximum likelihood) this time:
interaction.model < rma(yi=yi,
sei=sei,
data=mvreg.data,
method = "REML",
mods = ~ pubyear*continent,
test="knha")
interaction.model
##
## MixedEffects Model (k = 36; tau^2 estimator: REML)
##
## tau^2 (estimated amount of residual heterogeneity): 0 (SE = 0.0098)
## tau (square root of estimated tau^2 value): 0
## I^2 (residual heterogeneity / unaccounted variability): 0.00%
## H^2 (unaccounted variability / sampling variability): 1.00
## R^2 (amount of heterogeneity accounted for): 100.00%
##
## Test for Residual Heterogeneity:
## QE(df = 32) = 24.8408, pval = 0.8124
##
## Test of Moderators (coefficients 2:4):
## F(df1 = 3, df2 = 32) = 28.7778, pval < .0001
##
## Model Results:
##
## estimate se tval pval ci.lb
## intrcpt 0.3892 0.0421 9.2472 <.0001 0.3035
## pubyear 0.1683 0.0834 2.0184 0.0520 0.0015
## continentNorth America 0.3986 0.0658 6.0539 <.0001 0.2645
## pubyear:continentNorth America 0.6323 0.1271 4.9754 <.0001 0.3734
## ci.ub
## intrcpt 0.4750 ***
## pubyear 0.3380 .
## continentNorth America 0.5327 ***
## pubyear:continentNorth America 0.8911 ***
##
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The last line, pubyear:continentNorth America
, is the coefficient for our interaction term. Note that metafor
automatically includes not only the interaction term, but also both pubyear
and contintent
as “normal” lowerorder predictors (as one should do). Also note that, as continent
is a factor, rma
detected that this is a dummy predictor, and used our category Europe
as the \(D=0\) dummy against which the North America
category is compared. We see that our interaction term pubyear:continentNorth America
has a positive coefficient (\(b=0.6323\)), and that it is highly significant (\(p<0.0001\)), meaning that assumed interaction effect might in fact exist: there is an increase in effect sizes in recent years, but it is stronger for studies conducted in North America. We also see that model we fitted explains \(R^2=100\%\) of our heterogeneity. This is because our data was simulated for illustrative purposes. In practice, you will hardly ever explain all of the heterogeneity in your data (in fact, one should rather be concerned if one finds such results in reallife data, as this might mean we have overfitted our model).
8.3.3.4 Testing the robustness of our model through a permutation test
The idea behind permutation
Permutation is a mathematical operation in which we take a set containing numbers or objects, and iteratively draw elements from this set to put them in a sequential order. When we already have a ordered set of numbers, this equals a process in which we rearrange, or shuffle, the order of our data. As an example, imagine we have a set \(S\) containing three numbers: \(S=\{1,2,3 \}\). One possible permutation of this set is \((2,1,3)\); another is \((3,2,1)\). We see that the permuted results both contain all three numbers from before, but in a different order.
Permutation can also be used to perform permutation tests, which is a specific form of resampling methods. Broadly speaking, resampling methods are used to validate the robustness of a statistical model by providing it with (slightly) different data sampled from the same data source or generative process (Good 2013). This is a way to better assess if the coeffients in our model indeed capture a true pattern underlying our data, or if we overfitted our model, thereby falsely assuming patterns in our data when they are in fact statistical noise.
Permutation tests do not require us to have a spare “test” dataset on which we can evaluate how our metaregression actually performs in predicting effect sizes. For this reason, among others, permutation tests have been recommended to assess the robustness of our metaregression models (Higgins and Thompson 2004).
We will not go too much into the details of how a permutation test is performed for metaregression models, but the most important part is that we recalculate the pvalues of our overall metaregression model and its coefficients based on the test statitics obtained across all possible, or many randomly selected, permutations of our original dataset (or, to be more precise, our metaregression model matrix \(X_{ij}\)). The crucial indicator here is how often the test statistic we obtain from in our permuted data is equal or greater than our original test statistic. For example, if our test statistic is greater or equal to the original test statistic in 50 of 1000 permuted datasets, we get a pvalue of \(p=0.05\) (Viechtbauer et al. 2015).
To perform a permutation test on our metaregression model, we can use metafors
inbuilt permutest
function. As an example, I will now recalculate the results of my model2
model we fitted above. We only have to supply the permutest
function with the rma
object, but be aware that the permutation test can be computationally intensive, especially for large datasets, so the function might need some time to run.
permutest(model2)
##
## Test of Moderators (coefficients 2:3):
## F(df1 = 2, df2 = 33) = 12.2476, pval* = 0.0010
##
## Model Results:
##
## estimate se tval pval* ci.lb ci.ub
## intrcpt 0.5005 0.1090 4.5927 0.1810 0.2788 0.7222
## quality 0.0110 0.0151 0.7312 0.4290 0.0197 0.0417
## reputation 0.0343 0.0075 4.5435 0.0010 0.0189 0.0496 ***
##
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We again see our familiar output including the results for all predictors. Looking at the pval*
column, we see that our pvalue for the reputation
predictor has decreased from \(p<0.0001\) to \(p^*=0.001\). As this \(p\)value is still highly significant, this indicates that our model might indeed capture a real pattern underlying our data. It has been recommended to always use this permutation test on our metaregression model before we report a metaregression model to be significant in our research (Higgins and Thompson 2004).
Permutation tests in small datasets
Please note that when the number of studies \(k\) included in our model is small, conventionally used tresholds for statistical significance (i.e., p < 0.05) cannot be reached. For metaregression models, a permutation test using permutest
will only be able to reach statistical significance if \(k \geq 5\). More details on the permutest
function can be found here.
8.3.3.5 Multimodel Inference
Previously, we already described that one can also try to model all possible predictor combinations in a procedure called multimodel inference to examine which possible predictor combinations provide the best fit, and which predictors are the most important ones overall.
To perform multimodel inference, you can use the multimodel.inference
function we prepared for you. This function is part of the dmetar
package. If you have the package installed already, you have to load it into your library first.
library(dmetar)
If you don’t want to use the dmetar
package, you can find the source code for this function here. In this case, R doesn’t know this function yet, so we have to let R learn it by copying and pasting the code in its entirety into the console on the bottom left pane of RStudio, and then hit Enter ⏎. The function requires the MuMIn
, ggplot2
and metafor
package to work.
For this function, the following parameters need to be specified:
Parameter  Function 

TE  The precalculated effect size for each study. Must be supplied as the name of the effect size column in the dataset (in quotation marks; e.g. TE = ‘effectsize’). 
seTE  The precalculated standard error for each study. Must be supplied as the name of the standard error column in the dataset (in quotation marks; e.g. seTE = ‘se’). 
data  A data.frame containing columns for the effect size, standard error and metaregression predictors of each study/effect. 
predictors  A concantenated array of characters specifying the predictors to be used for multimodel inference. Names of the predictors must be identical to the names of the column names of the data.frame supplied to data. 
method  Metaanalysis model to use for pooling effect sizes. Use ‘FE’ for the fixedeffect model. Different randomeffect models are available: ‘DL’, ‘HE’, ‘SJ’, ‘ML’, ‘REML’, ‘EB’, ‘HS’, ‘GENQ’. If ‘FE’ is used, the test argument is automatically set to ‘z’, as the KnappHartung method is not meant to be used with fixedeffect models. Default is ‘REML’, and it is strongly advised to remain with this option to use a standard (mixedeffects) metaregression model. 
test  Method to use for computing test statistics and confidence intervals. Default is ‘knha’ which uses the KnappHartung (Knapp & Hartung, 2003) adjustment method. ‘Conventional’ Waldtype tests and CIs are calculated by setting this argument to ‘z’. When method=‘FE’, this argument is set to ‘z’ automatically as the KnappHartung method was not meant to be used with fixedeffect models. 
eval.criterion  Evaluation criterion to sort the multiple models by. Can be either ‘AICc’ (default; corrected Akaike’s Information Criterion), ‘AIC’ (Akaike’s Information Criterion) or ‘BIC’ (Bayesian Information Criterion). 
interaction  If set to FALSE (default), no interactions between predictors are considered. Setting this parameter to TRUE means that all interactions are modeled. 
Now, let us perform multimodelling for all possible predictor combinations using all predictors in the mvreg.data
dataset, excluding all interaction terms. Be aware that this can take some time, especially if the number of predictors is large.
multimodel.inference(TE = "yi",
seTE = "sei",
data = mvreg.data,
predictors = c("pubyear", "quality", "reputation", "continent"),
interaction = FALSE)
##

  0%

====  6%

========  12%

============  19%

================  25%

====================  31%

========================  38%

============================  44%

================================  50%

=====================================  56%

=========================================  62%

=============================================  69%

=================================================  75%

=====================================================  81%

=========================================================  88%

=============================================================  94%
##
##
## Multimodel Inference: Final Results
## 
##
##  Number of fitted models: 16
##  Full formula: ~ pubyear + quality + reputation + continent
##  Coefficient significance test: knha
##  Interactions modeled: no
##  Evaluation criterion: AICc
##
##
## Best 5 Models
## 
##
##
## Global model call: metafor::rma(yi = TE, sei = seTE, mods = form, data = glm.data,
## method = method, test = test)
## 
## Model selection table
## (Intrc) cntnn pubyr qulty rpttn df logLik AICc delta weight
## 12 + + 0.3533 0.02160 5 2.981 6.0 0.00 0.536
## 16 + + 0.4028 0.02210 0.01754 6 4.071 6.8 0.72 0.375
## 8 + + 0.4948 0.03574 5 0.646 10.7 4.67 0.052
## 11 + 0.2957 0.02725 4 1.750 12.8 6.75 0.018
## 15 + 0.3547 0.02666 0.02296 5 0.395 12.8 6.75 0.018
## Models ranked by AICc(x)
##
##
## Multimodel Inference Coefficients
## 
##
##
## Estimate Std. Error z value Pr(>z)
## intrcpt 0.38614661 0.106983583 3.6094006 0.0003069
## continent1 0.24743836 0.083113174 2.9771256 0.0029096
## pubyear 0.37816796 0.083045572 4.5537402 0.0000053
## reputation 0.01899347 0.007420427 2.5596198 0.0104787
## quality 0.01060060 0.014321158 0.7402055 0.4591753
##
##
## Predictor Importance
## 
##
##
## model importance
## 1 pubyear 0.9988339
## 2 continent 0.9621839
## 3 reputation 0.9428750
## 4 quality 0.4432826
Let’s go through the output one after another:
Multimodel Inference: Final Results
. This part of the output provides us with details about the fitted models. We see that the total number of \(2^4=16\) possible models have been fitted. We also see that the function used the corrected AIC (aicc
) to compare the models.Best 5 Models
. Displayed here are the five models with the lowest AICc, sorted from low to high. Predictors are displayed as columns of the table, and models as rows. A number (weight) or+
sign (for categorical predictors) indicates that a predictor/interaction term was used for the model, while empty cells indicate that the predictor was omitted in this model. We see thatTE ~ 1 + continent + pubyear + reputation
shows the best fit (\(AICc=6.0\)). But we also see that the other models including other predictor combinations come very close to this value. Thus, it is hard to say which model really is the “best” model. Nevertheless, we see that all of the best five models contain the predictorpubyear
, suggesting that this variable might be particularly important.Multimodel Inference Coefficients
. Here, we can see the coefficients of all predictors we used for our models aggregated over all models in which they appear. We see that the coefficientEstimate
is largest forpubyear
(\(b=0.378\)), which corroborates our finding from before. Approximate confidence intervals can be obtained by subtracting and adding the value stored inStd.Error
, multiplied by 1.96, from/toEstimate
. Modelaveraged importance of terms plot. In the plot, we see the averaged importance of each predictor across all models displayed as a bar chart. We again see that
pubyear
is the most important predictor, followed byreputation
,continent
, andquality
.
This example should make clear that Multimodel Inference can be a useful way to obtain a comprehensive look on which predictors are more or less important for predicting differences in effect sizes. Despite avoiding some of the problems of stepwise regression methods, it should be noted that this method should still be rather seen as exploratory, and may be used when we have no prior knowledge on how our predictors are related to effect sizes in the research field we metaanalyze. If you decide to build a metaregression model using the information from multimodel inference, it is essential to report that you have used this procedure before building the model, because your model will then not be based on an apriori hypothesis, but on statistical properties.
The output of the mreg.multimodel.inference
function displayed after running the function is only a small part of all results returned by the function. If we want to access all results of the function, we can save the results to an object (e.g. results < multimodel.inference(…)
), and then open them using the ‘dollar’ operator. The following results are returned, among other things:

all.models
: the AICc statistics for all fitted models, ordered from best to worst. 
top5.models
: the top 5 models by AICc we already saw in the output of the function before. 
multimodel.coef
: the averaged multimodel coefficient estimates we already saw before. 
predictor.importance
: The predictor importance table we already saw before. 
formula
: The full formula used to conduct the multimodel analysis.
References
Higgins, Julian PT, and Simon G Thompson. 2004. “Controlling the Risk of Spurious Findings from MetaRegression.” Statistics in Medicine 23 (11). Wiley Online Library: 1663–82.
Iniesta, R, D Stahl, and P McGuffin. 2016. “Machine Learning, Statistical Learning and the Future of Biological Research in Psychiatry.” Psychological Medicine 46 (12). Cambridge University Press: 2455–65.
Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of SocioEconomics 33 (5). Elsevier: 587–606.
Gigerenzer, Gerd. 2008. Rationality for Mortals: How People Cope with Uncertainty. Oxford University Press.
Higgins, Julian, Simon Thompson, Jonathan Deeks, and Douglas Altman. 2002. “Statistical Heterogeneity in Systematic Reviews of Clinical Trials: A Critical Appraisal of Guidelines and Practice.” Journal of Health Services Research & Policy 7 (1). SAGE Publications Sage UK: London, England: 51–61.
Mansfield, Edward R, and Billy P Helms. 1982. “Detecting Multicollinearity.” The American Statistician 36 (3a). Taylor & Francis: 158–60.
Berlin, Jesse A, and Elliott M Antman. 1992. “Advantages and Limitations of MetaAnalytic Regressions of Clinical Trials Data.” Controlled Clinical Trials 13 (5). Elsevier: 422.
Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. Sage publications.
Whittingham, Mark J, Philip A Stephens, Richard B Bradbury, and Robert P Freckleton. 2006. “Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?” Journal of Animal Ecology 75 (5). Wiley Online Library: 1182–9.
Chatfield, Chris. 1995. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 158 (3). Wiley Online Library: 419–44.
Viechtbauer, Wolfgang, and others. 2010. “Conducting MetaAnalyses in R with the Metafor Package.” J Stat Softw 36 (3): 1–48.
Good, Phillip. 2013. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer Science & Business Media.
Viechtbauer, Wolfgang, José Antonio LópezLópez, Julio SánchezMeca, and Fulgencio Marı'nMartı'nez. 2015. “A Comparison of Procedures to Test for Moderators in MixedEffects MetaRegression Models.” Psychological Methods 20 (3). American Psychological Association: 360.