8.3 Multiple Meta-Regression

In the beginning of Chapter 8, we already introduced the basic idea behind meta-regression. We showed that, in general, a meta-regression has the following form:

\[\hat \theta_k = \theta + \beta_1x_{1k} + ... + \beta_nx_{nk} + \epsilon_k + \zeta_k\]

Where \(\hat \theta_k\) is the estimate of our effect size of a study \(k\) and \(\beta\) are the regression coefficients. There are two parameters in this formula which cause our effect size estimate to deviate from the “true” overall effect size: \(\epsilon_k\), the sampling error, and \(\zeta_k\), which denotes that the true effect size of the study is only sampled from an overarching distribution of effect sizes. As we explained before, if \(\zeta_k>0\), the model we describe with this formula is a random-effects meta-regression model, also known as a mixed-effects model.

The idea behind multiple meta-regression

Previously, we only considered the scenario in which we use one predictor \(\beta_1x_1\) in our meta-regression (e.g., we want to check if the effect size a study reports depends on the year it was published). But imagine another scenario. Let us assume that we have made the experience in our previous research that the effect sizes we find for a certain study in some research field depends on the prestige of the journal in which the study was published (e.g., higher effects are reported in journals with a high reputation). On the other hand, it also seems apriori reasonable that journals with a good reputation publish studies of high quality, which may also be associated with higher effect sizes. So to check if journal reputation is indeed associated with higher effect sizes, we have to make sure that this relationship is not confounded by the fact that prestigious journals publish studies of higher quality, which may be the true reason we find this relationship between journal reputation and effect size. This means we have to control for study quality (e.g., rated on a scale from 0 to 10) when we examine the relationship between journal prestige (e.g., operationalized as the impact factor of a journal) and effect size.

This, and many other possible scenarios can be dealt with using multiple meta-regression. In multiple meta-regression we use several predictors (variables) to predict (differences in) effect sizes. When we look back at the general meta-regression formula we defined before, we actually see that the formula already provides us with this feature through the \(\beta_nx_{nk}\) part. Here, the parameter \(n\) denotes that we can include \(n\) more predictors/variables into our meta-regression, making it a multiple meta-regression.

While, statistically speaking, the \(n\) parameter gives us the possibility to include as many predictors into our multiple meta-regression model as we wish to, things are a little more tricky in reality. In the following, we will discuss a few important pitfalls in multiple meta-regression and how we can build multiple meta-regression models which are robust and trustworthy. But first, let us cover another important feature of multiple meta-regression: interactions.


So far, in our multiple meta-regression model, we only considered the case where we have multiple predictor variables \(x_1,x_2, ... x_n\), and along with their predictor estimates \(\beta_n\), add them together to calculate our estimate of the true effect size \(\hat \theta_k\) for each study \(k\). In multiple meta-regression models, however, we can not only model such additive relationships. We can also model so-called interactions. Interactions mean that the relationship between one predictor variable (e.g., \(x_1\)) and the estimated effect size changes for different values of another predictor variable (e.g. \(x_2\)).

Imagine a scenario where we want to model two predictors and their relationship to the effect size: the publication year (\(x_1\)) of a study and the quality (\(x_2\)) of a study, which we rate like this:

\[\begin{equation} x_2 = \left \{\begin{array}{ll} 0: bad \\1: moderate \\2: good \end{array} \right. \end{equation}\]

As we described before, we can now imagine a meta-regression model in which we combine these two predictors \(x_1\) and \(x_2\) and assume an additive relationship. We can do this by simply adding them:

\[\hat \theta_k = \theta + \beta_1x_{1k} + \beta_2x_{2k} + \epsilon_k + \zeta_k\]

Let us assume that, overall, higher publication year (\(x_1\)) is associated with higher effect sizes (i.e., reported effect sizes have risen over the years). We could now ask ourselves if this positive relationship varies depending on the quality of the studies (\(x_2\)). For example, maybe this rise in effect sizes was strongest for high-quality studies, while effect sizes stayed mostly the same over the years for studies of lower quality? We can visualize our assumed relationship between effect size (\(\hat \theta_k\)), publication year (\(x_1\)) and study quality (\(x_2\)) the following way:

To examine such questions, we can add an interaction term to our meta-regression model. This interaction term lets predictions of \(x_1\) vary for different values of \(x_2\) (and vice versa). We can denote this additional interactional relationship in our model by introducing a third predictor, \(\beta_3\), which captures this interaction \(x_{1k}x_{2k}\) we want to test in our model:

\[\hat \theta_k = \theta + \beta_1x_{1k} + \beta_2x_{2k} + \beta_3x_{1k}x_{2k}+ \epsilon_k + \zeta_k\]

8.3.1 Common pitfalls of multiple meta-regression models

As we have mentioned before, multiple meta-regression, while very useful when applied properly, comes with certain caveats we have to know and consider when fitting a model. Indeed, some argue that (multiple) meta-regression is often improperly used and interpreted in practice, leading to a low validity of many meta-regression models (Higgins and Thompson 2004). Thus, there are some points we have to keep in mind when fitting multiple meta-regression models, which we will describe in the following.

Overfitting: seeing a signal when there is none

To better understand the risks of (multiple) meta-regression models, we have to understand the concept of overfitting. Overfitting occurs when we build a statistical model which fits the data too closely. In essence, this means that we build a statistical model which can predict the data at hand very well, but performs bad at predicting future data it has never seen before. This happens if our model assumes that some variation in our data stems from a true “signal” in our data, when in fact we only model random noise (Iniesta, Stahl, and McGuffin 2016). As a result, our statistical model produces false positive results: it sees relationships where there are none.

Illustration of an overfitted model vs. model with a good fit

Illustration of an overfitted model vs. model with a good fit

Regression methods, which usually utilize minimization or maximization procedures such as Ordinary Least Squares or Maximum Likelihood estimatation, can be prone to overfitting (Gigerenzer 2004, 2008). Unfortunately, the risk of building a non-robust model, which produces false-positive results, is even higher once we go from conventional regression to meta-regression (Higgins and Thompson 2004). There are several reasons for this:

  1. In meta-regression, our sample of data is mostly small, as we can only use the synthesized data of all analyzed studies \(k\).
  2. As our meta-analysis aims to be a comprehensive overview of all evidence, we have no additional data on which we can “test” how well our regression model can predict high or low effect sizes.
  3. In meta-regressions, we have to deal with the potential presence of effect size heterogeneity (see Chapter 6). Imagine a case in which we have to studies with different effect sizes and non-overlapping confidence intervals. Every variable which has different values for the different studies might be a potential explanation for effect size difference we find, while it seems straightforward that most of such explanations are spurious (Higgins and Thompson 2004).
  4. Meta-regression as such, and multiple meta-regression in particular, make it very easy to “play around” with predictors. We can test numerous meta-regression models, include many more predictors or remove them in an attempt to explain the heterogeneity in our data. Such an approach is of course tempting, and often found in practice, because we, as meta-analysts, want to find a significant model which explains why effect sizes differ (Julian Higgins et al. 2002). However, such behavior has been shown to massively increase the risk of spurious findings (Higgins and Thompson 2004), because we can change parts of our model indefinitely until we find a significant model, which is then very likely to be overfitted (i.e., it mostly models statistical noise).

Some guidelines have been proposed to avoid an excessive false positive rate when building meta-regression models:

  • Minimize the number of investigated predictors. In multiple meta-regression, this translates to the concept of parsimony, or simplicity: when evaluating the fit of a meta-regression model, we prefer models which achieve a good fit with less predictors. Estimators such as the Akaike and Bayesian Information Criterion can help with such decisions. We will get to this topic again below.
  • Predictor selection should be based on predefined scientific or theoretical questions we want to answer in our meta-regression. Therefore it is crucial to define in the protocol before the start of the study, which predictors will be examined in the metaregression analyses.
  • When the number of studies is low (which is very likely to be the case), and we want to compute the significance of a predictor, we should use the Knapp-Hartung adjustment to obtain more reliable estimates (Julian Higgins et al. 2002).
  • We can use permutation to assess the robustness of our model in resampled data. We will describe the details of this method later.


Multicollinearity means that one or more predictor in our regression model can be (linearly) predicted from another predictor in our model with relatively high accuracy (Mansfield and Helms 1982). This basically means that we have two or more predictors in our model which are highly correlated. Most of the dangers of multicollinearity are associated with the problem of overfitting which we described above. High collinearity can cause our predictor coefficient estimates \(b\) to behave erratically, and change considerably with minor changes in our data. It also limits the size of the explained variance by the model, in our case the \(R^2\) analog.

Multicollinearity in meta-regression is common (Berlin and Antman 1992). Although multiple regression can handle lower degrees of collinearity, we should check and, if necessary, control for very highly correlated predictors. There is no consolidated yes-no-rule for the presence of multicollinearity. A crude, but often effective way is to check for very high correlations (i.e., \(r\geq0.8\)) before fitting the model. Multicollinearity can then be reduced by either (1) removing one of the close-to-redundant predictors, or (2) trying to combine the predictors into one single predictor.

8.3.2 Model building methods

When building a multiple regression model, there are different approaches through which we can select and insert predictors into our model. Here, we discuss the most important ones along with their strengths and weaknesses:

  • Forced entry. In forced entry methods, all relevant predictors are forced into the regression model simultaneously. In most functions in R, this is the default setting. Although this is a generally recommended procedure (Field, Miles, and Field 2012), keep in mind that all predictors to use in forced entry should still be based on a predefined, theory-led decision.
  • Hierarchical. Hierchical multiple regression means including predictors into our regression model stepwise, but based on a clearly defined scientific rationale. First, only predictors which have been associated with effect size differences in previous research are included in the order of their importance. After this step, novel predictors can be added to explore if these variables explain heterogeneity which has not yet been captured by the known predictors.
  • Stepwise. Stepwise methods mean that variables/predictors are added to the model one after another. At first glance, this sounds a lot like hierarchical regression, but there is a crucial difference: stepwise regression methods select predictors based on a statistical criterion. In a procedure called forward selection, the variable explaining the largest amount of variability in the data is used as the first predictor. This processes is then repeated for the remaining variables, each time selecting the variable which explains most of the remaining (unexplained) variability in the data as the next predictor, which is then added to the model. Furthermore, there is a procedure called backward selection, in which all variables are used as predictors in the model first, and then removed successively based on a predefined statistical criterion. There is an extensive literature discouraging the usage of stepwise methods (Whittingham et al. 2006; Chatfield 1995), and if we recall the common pitfalls of multiple regression models we presented above, it becomes clear that these methods have high risk of producing spurious findings. Nevertheless, stepwise methods are still frequently used in practice, making it important to know that these procedures exist. If we use stepwise methods ourselves, however, it is advised to primarily do so in an exploratory fashion and to keep the limitations of this procedure in mind.
  • Multimodel inference. Multimodel methods differ from stepwise methods as they do not try to successively build the “best” one model explaining most of the variance. Instead, in this procedure, all possible combinations of a predifined selection of predictors are modeled, and evaluated using a criterion such as Akaike’s Information Criterion, which rewards simpler models. This enables a full eximination of all possible models, and how they perform. A common finding using this procedure is that there are many different kinds of predictor combinations within a model which lead to a good fit. In multimodel inference, the estimated coefficients of predictors can then be synthesized across all possible models to infer how important certain predictors are overall.

8.3.3 Using metafor to compute Multiple Meta-Regressions

After all this input, it is now time to fit our first multiple regression models. To do this, we will not use the meta package we have used so far. Instead, we will use the metafor package (Viechtbauer and others 2010). This package provides a vast array of advanced functionalities for meta-analysis, along with great documentation. We will also use this package in a later chapter where we focus on Multilevel Meta-Analysis, so it is worth getting accustomed to the inner workings of this package. So, to begin, let us make sure we have metafor installed and loaded in our library.


For our multiple meta-regression examples, i will use my mvreg.data dataset. This is a “toy” dataset, which we simulated for illustrative purposes. Let’s have a look at the structure of my data:

## 'data.frame':    36 obs. of  6 variables:
##  $ yi        : num  0.0944 0.0998 0.1693 0.1751 0.273 ...
##  $ sei       : num  0.196 0.192 0.119 0.116 0.165 ...
##  $ reputation: num  -11 0 -11 4 -10 -9 -8 -8 -8 0 ...
##  $ quality   : num  6 9 5 9 2 10 6 3 10 3 ...
##  $ pubyear   : num  -0.855 -0.753 -0.66 -0.563 -0.431 ...
##  $ continent : chr  "North America" "Europe" "North America" "Europe" ...

We see that there are 6 variables in our dataset. The yi and sei columns store the effect size and standard error of a particular study. Thus, these columns correspond to the TE and seTE columns we used before. We have named these variables this way because this is the standard notation that metafor uses: yi corresponds to the effect size \(y_i\) we want to predict in our meta-regression, while sei is \(SE_i\), the standard error. To designate the variance of an effect size, metafor uses vi, or \(v_i\) in mathematical notation, which we do not need here because yi and sei contain all the information we need.

The other four variables we have in our dataset are potential predictors for our meta-regression. We want to check if reputation, the (mean-centered) impact score of the journal the study was published in, quality, the quality of the study rated from 0 to 10, pubyear, the (standardized) publication year, and continent, the continent in which the study was performed, are associated with different effect sizes.

For continent, note that we store information as a predictor with 2 labels: Europe and North America, meaning that this predictor is a dummy variable. Always remember that such dummy variables have to be converted from a chr to a factor vector before we can proceed.

mvreg.data$continent = factor(mvreg.data$continent) Checking for multicollinearity

As we mentioned before, we have to check for multicollinearity of our predictors to make sure that our meta-regression model coefficient estimates are robust. A quick way to check for high correlation is to calculate a intercorrelation matrix for all continuous variables with the following code:

##            reputation    quality    pubyear
## reputation  1.0000000  0.3015694  0.3346594
## quality     0.3015694  1.0000000 -0.1551123
## pubyear     0.3346594 -0.1551123  1.0000000

The PerformanceAnalytics package provides the chart.Correlation function which we can use to visualize the intercorrelations. Make sure to install the PerformanceAnalytics package first, and then use this code:


We see that our variables are indeed correlated, but probably not to a degree that would warrant excluding one of them. Fitting a meta-regression model without interaction terms

Now, let us fit our first meta-regression model, for now without assuming any interactions between predictors. Previously, we wanted to explore if a high reputation of the journal a study was published in predicts higher effect sizes. Let us also assume we already know very well from previous literature that the quality of a study is also associated with differences in effects. We can now perform a hierarchical regression where we first include our known predictor quality, and then check if reputation explains heterogeneity beyond that while taking quality into account.

To do this, we use the rma function in metafor to perform a random-effects meta-regression. This function has countless parameters, but we will only concentrate on the following (to see all parameters, type ?rma into your Console):

Parameter Function
yi The column of our dataset in which the effect size for each study is stored.
sei The column of our dataset in which the standard error of the effect size for each study is stored.
data The dataset with our meta-analysis data.
method The estimator for tau-squared we want to use. For meta-regression models, it is advised to use the maximum likelihood (‘ML’) or restricted maximum likelihood (‘REML’) estimator.
mods This parameter defines our meta-regression model. First, we specify our model with ‘~’ (a tilde). Then, we add the predictors we want to include, separating them with ‘+’ (e.g., ‘variable1 + variable2’). Interactions between two variables are denoted by ’variable1*variable2’.
test The test we want to apply for our regression coefficients. We can choose from ‘z’ (default) and ‘knha’ (Knapp-Hartung method).

First, let us perform a meta-regression using only quality as the predictor along with Knapp-Hartung adjustments. We save the results as model1, and then inspect the output.

model1 <- rma(yi = yi, 
              sei = sei, 
              data = mvreg.data, 
              method = "ML", 
              mods = ~ quality, 
              test = "knha")
## Mixed-Effects Model (k = 36; tau^2 estimator: ML)
## tau^2 (estimated amount of residual heterogeneity):     0.0667 (SE = 0.0275)
## tau (square root of estimated tau^2 value):             0.2583
## I^2 (residual heterogeneity / unaccounted variability): 60.04%
## H^2 (unaccounted variability / sampling variability):   2.50
## R^2 (amount of heterogeneity accounted for):            7.37%
## Test for Residual Heterogeneity:
## QE(df = 34) = 88.6130, p-val < .0001
## Test of Moderators (coefficient 2):
## F(df1 = 1, df2 = 34) = 3.5330, p-val = 0.0688
## Model Results:
##          estimate      se    tval    pval    ci.lb   ci.ub 
## intrcpt    0.3429  0.1354  2.5318  0.0161   0.0677  0.6181  * 
## quality    0.0356  0.0189  1.8796  0.0688  -0.0029  0.0740  . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, we can see the results for our predictor quality under Model Results and the values for our regression intercept (intrcpt). We see that the \(p\) value for our predictor is non-significant \(p=0.0688\), but that there is a trend for an association, because \(p<0.1\). Under Test of Moderators (coefficient(s) 2), we can see the overall test results for our regression model (\(F_{1,34}=3.68, p=0.0688\)). Because we only included one predictor, the \(p\)-value reported there is identical to the one we saw before. In total, our model explains \(R^2=2.23\%\) of the heterogeneity in our data, which we can see next to the R^2 (amount of heterogeneity accounted for) line in our output.

Now, let us check if we can improve our model by including reputation as predictor. We add + reputation in our argument to mods and save the output to model2.

model2 <- rma(yi = yi, 
              sei = sei, 
              data = mvreg.data, 
              method = "ML", 
              mods = ~ quality + reputation, 
## Mixed-Effects Model (k = 36; tau^2 estimator: ML)
## tau^2 (estimated amount of residual heterogeneity):     0.0238 (SE = 0.0161)
## tau (square root of estimated tau^2 value):             0.1543
## I^2 (residual heterogeneity / unaccounted variability): 34.62%
## H^2 (unaccounted variability / sampling variability):   1.53
## R^2 (amount of heterogeneity accounted for):            66.95%
## Test for Residual Heterogeneity:
## QE(df = 33) = 58.3042, p-val = 0.0042
## Test of Moderators (coefficients 2:3):
## F(df1 = 2, df2 = 33) = 12.2476, p-val = 0.0001
## Model Results:
##             estimate      se    tval    pval    ci.lb   ci.ub 
## intrcpt       0.5005  0.1090  4.5927  <.0001   0.2788  0.7222  *** 
## quality       0.0110  0.0151  0.7312  0.4698  -0.0197  0.0417      
## reputation    0.0343  0.0075  4.5435  <.0001   0.0189  0.0496  *** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now see that a new line appears under Model Results displaying the results for our reputation predictor. We see that a coefficient of \(b=0.0343\) was calculated for reputation. We also see that this predictor is highly significant (\(p<0.0001\)) and positively associated with the effect sizes. We also see that the meta-regression model itself is significant (\(F_{2,33}=12.25, p<0.0001\)), explaining \(R^2=66.95\%\) of the heterogeneity. This means that journal reputation is associated with higher effect sizes, even when controlling for study quality.

But is model2 indeed better than our first model, model1? To assess this, we can use the anova function, feeding it with the two models we want to compare.

anova(model1, model2)
##         df     AIC     BIC    AICc   logLik     LRT   pval      QE  tau^2 
## Full     4 19.4816 25.8157 20.7720  -5.7408                58.3042 0.0238 
## Reduced  3 36.9808 41.7314 37.7308 -15.4904 19.4992 <.0001 88.6130 0.0667 
##              R^2 
## Full 
## Reduced 64.3197%

This function performs a model test and provides us with several statistics to assess if model2 has a better fit than model1. Here we compare our full model, model2, which includes both quality and reputation, with our Reduced model, which only uses quality as a predictor.

The anova function performs a Likelihood Ratio Test, the results of which we can see in the LRT column. The test is highly significant (\(\chi^2_1=19.50, p<0.001\)), which means that that our full model indeed provides a better fit. Another important statistic is reported in the AICc column. This provides us with the Akaike’s Information Criterion, corrected for small samples. As we mentioned before, AICc penalizes complex models with more predictors to avoid overfitting. It is important to note that lower values of AIC(c) mean that a model performs better. Interestingly, in our output, we see that the full model (\(AICc=20.77\)) has a better AIC value than our reduced model (\(AICc=37.73\)), despite having more predictors. All of this suggests that our multiple regression does indeed provide a good fit to our data. Modeling interaction terms

Let us say we want to model an interaction hypothesis with our additional predictors pubyear (publication year) and continent, where we assume that the relationship between publication year and effect size differs for Europe and North America. To model this in our rma function, we have to connect our predictors with * in the mods parameter. Because we do not want to compare the models directly using the anova function, we specify the \(\tau^2\) estimator to be "REML" (restricted maximum likelihood) this time:

interaction.model <- rma(yi=yi,
                         method = "REML", 
                         mods = ~ pubyear*continent, 
## Mixed-Effects Model (k = 36; tau^2 estimator: REML)
## tau^2 (estimated amount of residual heterogeneity):     0 (SE = 0.0098)
## tau (square root of estimated tau^2 value):             0
## I^2 (residual heterogeneity / unaccounted variability): 0.00%
## H^2 (unaccounted variability / sampling variability):   1.00
## R^2 (amount of heterogeneity accounted for):            100.00%
## Test for Residual Heterogeneity:
## QE(df = 32) = 24.8408, p-val = 0.8124
## Test of Moderators (coefficients 2:4):
## F(df1 = 3, df2 = 32) = 28.7778, p-val < .0001
## Model Results:
##                                 estimate      se    tval    pval    ci.lb 
## intrcpt                           0.3892  0.0421  9.2472  <.0001   0.3035 
## pubyear                           0.1683  0.0834  2.0184  0.0520  -0.0015 
## continentNorth America            0.3986  0.0658  6.0539  <.0001   0.2645 
## pubyear:continentNorth America    0.6323  0.1271  4.9754  <.0001   0.3734 
##                                  ci.ub 
## intrcpt                         0.4750  *** 
## pubyear                         0.3380    . 
## continentNorth America          0.5327  *** 
## pubyear:continentNorth America  0.8911  *** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The last line, pubyear:continentNorth America, is the coefficient for our interaction term. Note that metafor automatically includes not only the interaction term, but also both pubyear and contintent as “normal” lower-order predictors (as one should do). Also note that, as continent is a factor, rma detected that this is a dummy predictor, and used our category Europe as the \(D=0\) dummy against which the North America category is compared. We see that our interaction term pubyear:continentNorth America has a positive coefficient (\(b=0.6323\)), and that it is highly significant (\(p<0.0001\)), meaning that assumed interaction effect might in fact exist: there is an increase in effect sizes in recent years, but it is stronger for studies conducted in North America. We also see that model we fitted explains \(R^2=100\%\) of our heterogeneity. This is because our data was simulated for illustrative purposes. In practice, you will hardly ever explain all of the heterogeneity in your data (in fact, one should rather be concerned if one finds such results in real-life data, as this might mean we have overfitted our model). Testing the robustness of our model through a permutation test

The idea behind permutation

Permutation is a mathematical operation in which we take a set containing numbers or objects, and iteratively draw elements from this set to put them in a sequential order. When we already have a ordered set of numbers, this equals a process in which we rearrange, or shuffle, the order of our data. As an example, imagine we have a set \(S\) containing three numbers: \(S=\{1,2,3 \}\). One possible permutation of this set is \((2,1,3)\); another is \((3,2,1)\). We see that the permuted results both contain all three numbers from before, but in a different order.

Permutation can also be used to perform permutation tests, which is a specific form of resampling methods. Broadly speaking, resampling methods are used to validate the robustness of a statistical model by providing it with (slightly) different data sampled from the same data source or generative process (Good 2013). This is a way to better assess if the coeffients in our model indeed capture a true pattern underlying our data, or if we overfitted our model, thereby falsely assuming patterns in our data when they are in fact statistical noise.

Permutation tests do not require us to have a spare “test” dataset on which we can evaluate how our meta-regression actually performs in predicting effect sizes. For this reason, among others, permutation tests have been recommended to assess the robustness of our meta-regression models (Higgins and Thompson 2004).

We will not go too much into the details of how a permutation test is performed for meta-regression models, but the most important part is that we re-calculate the p-values of our overall meta-regression model and its coefficients based on the test statitics obtained across all possible, or many randomly selected, permutations of our original dataset (or, to be more precise, our meta-regression model matrix \(X_{ij}\)). The crucial indicator here is how often the test statistic we obtain from in our permuted data is equal or greater than our original test statistic. For example, if our test statistic is greater or equal to the original test statistic in 50 of 1000 permuted datasets, we get a p-value of \(p=0.05\) (Viechtbauer et al. 2015).

To perform a permutation test on our meta-regression model, we can use metafors in-built permutest function. As an example, I will now recalculate the results of my model2 model we fitted above. We only have to supply the permutest function with the rma object, but be aware that the permutation test can be computationally intensive, especially for large datasets, so the function might need some time to run.

## Test of Moderators (coefficients 2:3):
## F(df1 = 2, df2 = 33) = 12.2476, p-val* = 0.0010
## Model Results:
##             estimate      se    tval   pval*    ci.lb   ci.ub 
## intrcpt       0.5005  0.1090  4.5927  0.1810   0.2788  0.7222      
## quality       0.0110  0.0151  0.7312  0.4290  -0.0197  0.0417      
## reputation    0.0343  0.0075  4.5435  0.0010   0.0189  0.0496  *** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We again see our familiar output including the results for all predictors. Looking at the pval* column, we see that our p-value for the reputation predictor has decreased from \(p<0.0001\) to \(p^*=0.001\). As this \(p\)-value is still highly significant, this indicates that our model might indeed capture a real pattern underlying our data. It has been recommended to always use this permutation test on our meta-regression model before we report a meta-regression model to be significant in our research (Higgins and Thompson 2004).

Permutation tests in small datasets

Please note that when the number of studies \(k\) included in our model is small, conventionally used tresholds for statistical significance (i.e., p < 0.05) cannot be reached. For meta-regression models, a permutation test using permutest will only be able to reach statistical significance if \(k \geq 5\). More details on the permutest function can be found here. Multimodel Inference

Previously, we already described that one can also try to model all possible predictor combinations in a procedure called multimodel inference to examine which possible predictor combinations provide the best fit, and which predictors are the most important ones overall.

To perform multimodel inference, you can use the multimodel.inference function we prepared for you. This function is part of the dmetar package. If you have the package installed already, you have to load it into your library first.


If you don’t want to use the dmetar package, you can find the source code for this function here. In this case, R doesn’t know this function yet, so we have to let R learn it by copying and pasting the code in its entirety into the console on the bottom left pane of RStudio, and then hit Enter ⏎. The function requires the MuMIn, ggplot2 and metafor package to work.

For this function, the following parameters need to be specified:

Parameter Function
TE The precalculated effect size for each study. Must be supplied as the name of the effect size column in the dataset (in quotation marks; e.g. TE = ‘effectsize’).
seTE The precalculated standard error for each study. Must be supplied as the name of the standard error column in the dataset (in quotation marks; e.g. seTE = ‘se’).
data A data.frame containing columns for the effect size, standard error and meta-regression predictors of each study/effect.
predictors A concantenated array of characters specifying the predictors to be used for multimodel inference. Names of the predictors must be identical to the names of the column names of the data.frame supplied to data.
method Meta-analysis model to use for pooling effect sizes. Use ‘FE’ for the fixed-effect model. Different random-effect models are available: ‘DL’, ‘HE’, ‘SJ’, ‘ML’, ‘REML’, ‘EB’, ‘HS’, ‘GENQ’. If ‘FE’ is used, the test argument is automatically set to ‘z’, as the Knapp-Hartung method is not meant to be used with fixed-effect models. Default is ‘REML’, and it is strongly advised to remain with this option to use a standard (mixed-effects) meta-regression model.
test Method to use for computing test statistics and confidence intervals. Default is ‘knha’ which uses the Knapp-Hartung (Knapp & Hartung, 2003) adjustment method. ‘Conventional’ Wald-type tests and CIs are calculated by setting this argument to ‘z’. When method=‘FE’, this argument is set to ‘z’ automatically as the Knapp-Hartung method was not meant to be used with fixed-effect models.
eval.criterion Evaluation criterion to sort the multiple models by. Can be either ‘AICc’ (default; corrected Akaike’s Information Criterion), ‘AIC’ (Akaike’s Information Criterion) or ‘BIC’ (Bayesian Information Criterion).
interaction If set to FALSE (default), no interactions between predictors are considered. Setting this parameter to TRUE means that all interactions are modeled.

Now, let us perform multimodelling for all possible predictor combinations using all predictors in the mvreg.data dataset, excluding all interaction terms. Be aware that this can take some time, especially if the number of predictors is large.

multimodel.inference(TE = "yi", 
                     seTE = "sei",
                     data = mvreg.data,
                     predictors = c("pubyear", "quality", "reputation", "continent"),
                     interaction = FALSE)
  |                                                                 |   0%
  |====                                                             |   6%
  |========                                                         |  12%
  |============                                                     |  19%
  |================                                                 |  25%
  |====================                                             |  31%
  |========================                                         |  38%
  |============================                                     |  44%
  |================================                                 |  50%
  |=====================================                            |  56%
  |=========================================                        |  62%
  |=============================================                    |  69%
  |=================================================                |  75%
  |=====================================================            |  81%
  |=========================================================        |  88%
  |=============================================================    |  94%
## Multimodel Inference: Final Results
## --------------------------
##  - Number of fitted models: 16
##  - Full formula: ~ pubyear + quality + reputation + continent
##  - Coefficient significance test: knha
##  - Interactions modeled: no
##  - Evaluation criterion: AICc 
## Best 5 Models
## --------------------------
## Global model call: metafor::rma(yi = TE, sei = seTE, mods = form, data = glm.data, 
##     method = method, test = test)
## ---
## Model selection table 
##    (Intrc) cntnn  pubyr   qulty   rpttn df logLik AICc delta weight
## 12       +     + 0.3533         0.02160  5  2.981  6.0  0.00  0.536
## 16       +     + 0.4028 0.02210 0.01754  6  4.071  6.8  0.72  0.375
## 8        +     + 0.4948 0.03574          5  0.646 10.7  4.67  0.052
## 11       +       0.2957         0.02725  4 -1.750 12.8  6.75  0.018
## 15       +       0.3547 0.02666 0.02296  5 -0.395 12.8  6.75  0.018
## Models ranked by AICc(x) 
## Multimodel Inference Coefficients
## --------------------------
##              Estimate  Std. Error   z value  Pr(>|z|)
## intrcpt    0.38614661 0.106983583 3.6094006 0.0003069
## continent1 0.24743836 0.083113174 2.9771256 0.0029096
## pubyear    0.37816796 0.083045572 4.5537402 0.0000053
## reputation 0.01899347 0.007420427 2.5596198 0.0104787
## quality    0.01060060 0.014321158 0.7402055 0.4591753
## Predictor Importance
## --------------------------
##        model importance
## 1    pubyear  0.9988339
## 2  continent  0.9621839
## 3 reputation  0.9428750
## 4    quality  0.4432826

Let’s go through the output one after another:

  • Multimodel Inference: Final Results. This part of the output provides us with details about the fitted models. We see that the total number of \(2^4=16\) possible models have been fitted. We also see that the function used the corrected AIC (aicc) to compare the models.
  • Best 5 Models. Displayed here are the five models with the lowest AICc, sorted from low to high. Predictors are displayed as columns of the table, and models as rows. A number (weight) or + sign (for categorical predictors) indicates that a predictor/interaction term was used for the model, while empty cells indicate that the predictor was omitted in this model. We see that TE ~ 1 + continent + pubyear + reputation shows the best fit (\(AICc=6.0\)). But we also see that the other models including other predictor combinations come very close to this value. Thus, it is hard to say which model really is the “best” model. Nevertheless, we see that all of the best five models contain the predictor pubyear, suggesting that this variable might be particularly important.
  • Multimodel Inference Coefficients. Here, we can see the coefficients of all predictors we used for our models aggregated over all models in which they appear. We see that the coefficient Estimate is largest for pubyear (\(b=0.378\)), which corroborates our finding from before. Approximate confidence intervals can be obtained by subtracting and adding the value stored in Std.Error, multiplied by 1.96, from/to Estimate.
  • Model-averaged importance of terms plot. In the plot, we see the averaged importance of each predictor across all models displayed as a bar chart. We again see that pubyear is the most important predictor, followed by reputation, continent, and quality.

This example should make clear that Multimodel Inference can be a useful way to obtain a comprehensive look on which predictors are more or less important for predicting differences in effect sizes. Despite avoiding some of the problems of stepwise regression methods, it should be noted that this method should still be rather seen as exploratory, and may be used when we have no prior knowledge on how our predictors are related to effect sizes in the research field we meta-analyze. If you decide to build a meta-regression model using the information from multimodel inference, it is essential to report that you have used this procedure before building the model, because your model will then not be based on an apriori hypothesis, but on statistical properties.

The output of the mreg.multimodel.inference function displayed after running the function is only a small part of all results returned by the function. If we want to access all results of the function, we can save the results to an object (e.g. results <- multimodel.inference(…)), and then open them using the ‘dollar’ operator. The following results are returned, among other things:

  • all.models: the AICc statistics for all fitted models, ordered from best to worst.
  • top5.models: the top 5 models by AICc we already saw in the output of the function before.
  • multimodel.coef: the averaged multimodel coefficient estimates we already saw before.
  • predictor.importance: The predictor importance table we already saw before.
  • formula: The full formula used to conduct the multimodel analysis.


Higgins, Julian PT, and Simon G Thompson. 2004. “Controlling the Risk of Spurious Findings from Meta-Regression.” Statistics in Medicine 23 (11). Wiley Online Library: 1663–82.

Iniesta, R, D Stahl, and P McGuffin. 2016. “Machine Learning, Statistical Learning and the Future of Biological Research in Psychiatry.” Psychological Medicine 46 (12). Cambridge University Press: 2455–65.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5). Elsevier: 587–606.

Gigerenzer, Gerd. 2008. Rationality for Mortals: How People Cope with Uncertainty. Oxford University Press.

Higgins, Julian, Simon Thompson, Jonathan Deeks, and Douglas Altman. 2002. “Statistical Heterogeneity in Systematic Reviews of Clinical Trials: A Critical Appraisal of Guidelines and Practice.” Journal of Health Services Research & Policy 7 (1). SAGE Publications Sage UK: London, England: 51–61.

Mansfield, Edward R, and Billy P Helms. 1982. “Detecting Multicollinearity.” The American Statistician 36 (3a). Taylor & Francis: 158–60.

Berlin, Jesse A, and Elliott M Antman. 1992. “Advantages and Limitations of Meta-Analytic Regressions of Clinical Trials Data.” Controlled Clinical Trials 13 (5). Elsevier: 422.

Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. Sage publications.

Whittingham, Mark J, Philip A Stephens, Richard B Bradbury, and Robert P Freckleton. 2006. “Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?” Journal of Animal Ecology 75 (5). Wiley Online Library: 1182–9.

Chatfield, Chris. 1995. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 158 (3). Wiley Online Library: 419–44.

Viechtbauer, Wolfgang, and others. 2010. “Conducting Meta-Analyses in R with the Metafor Package.” J Stat Softw 36 (3): 1–48.

Good, Phillip. 2013. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer Science & Business Media.

Viechtbauer, Wolfgang, José Antonio López-López, Julio Sánchez-Meca, and Fulgencio Marı'n-Martı'nez. 2015. “A Comparison of Procedures to Test for Moderators in Mixed-Effects Meta-Regression Models.” Psychological Methods 20 (3). American Psychological Association: 360.