## 2.5 Assessing Relative and Absolute Model Fit

For a given phylodynamic study, we typically wish (or need) to consider several candidate discrete-geographic models (where alternative models might specify a/symmetric rate matrices, different priors on the average dispersal rate and/or number of dispersal routes, etc.).
Comparing the fit of competing phylodynamic models to our study data offers two important benefits.
First, model-based inference—including phylodynamic inference—assumes that our inference model provides a reasonable description of the process that generated our study data; otherwise, our inferences—including estimates of relative and/or average dispersal rates and any summaries based on those parameter estimates (ancestral areas, dispersal histories, number of dispersal events, etc.)—are apt to be unreliable.
Additionally, comparing alternative discrete-geographic models provides a means to objectively test hypotheses regarding the history of dispersal (*i.e.*, by assessing the relative fit of our data to competing models that are specified to include/exclude a parameter relevant to the hypothesis under consideration).

To this end, `PrioriTree` implements methods to help you assess both the *relative* and *absolute* fit of discrete-geographic models to an empirical dataset.
In this section, we provide a brief description of the theory underlying these methods and their implementation in `PrioriTree`.

### 2.5.1 Marginal-Likelihood Estimation

We assess the relative fit of two or more candidate discrete-geographic models to a given dataset by computing Bayes factors, which is based on comparing the average fit (, the `marginal likelihood’) of competing models to that dataset.
The apparent simplicity of Bayes factors belies some rather challenging conceptual and computational issues.
Here, we begin by describing a relevant probability concept (*marginal likelihood*), then detail the numerical methods that we use to estimate marginal likelihoods (*stepping-stone simulation*), and then describe how to compute (and interpret the results of) *Bayes factors*.
Finally, we describe how to assess (and improve) the reliability of our marginal-likelihood estimates using and .

*Marginal likelihood*—. Marginal likelihoods can be a challenging concept.
In the simplest terms, it is the *average fit* of a model to a dataset.
More precisely, the marginal likelihood is the probability of observing the data (*i.e.*, the likelihood) averaged over all values for every parameter in the model, *weighted* by the prior probability of those parameter values (*i.e.*, it is the likelihood averaged over the joint prior probability distribution of the model parameters).

Recall that we’ve previously encountered this probability term (lurking in the denominator of Bayes theorem):

\[\begin{align*} \overbrace{P(\mathbf{Q}, \mu \mid G, \Psi)}^\text{posterior distribution} = \frac{ \overbrace{P(G \mid \mathbf{Q}, \mu, \Psi)}^\text{likelihood} \overbrace{P(\mathbf{Q}) P(\mu)}^\text{prior distribution} }{ \underbrace{P(G \mid \Psi)}_\textbf{marginal likelihood} }. \end{align*}\]

[Note that the marginal likelihood for model \(M_i\) is conditional on the model, *i.e.*, \(P(G \mid M_i)\).
By convention, however, this dependence is often suppressed or ignored, as in the equation above.]

You may also recall that the MCMC algorithms that we use to approximate the joint posterior probability density, \(P(\mathbf{Q}, \mu \mid G, \Psi)\), involves simulating a Markov chain that samples states—where each state, \(\theta\), is a fully specified model \(\theta = \{ \Psi, \mathbf{Q},\mu \}\)—based on their relative posterior probabilities:
\[\begin{align*}
R \propto \Bigg[
\underbrace{\frac{f(G \mid \theta^{\prime})}{f(G \mid \theta)}}_\text{likelihood ratio} \cdot
\underbrace{\frac{f(\theta^{\prime})}{f(\theta)}}_\text{prior ratio} \cdot
\underbrace{\frac{f(\theta\mid\theta^{\prime})}{f(\theta^{\prime}\mid\theta)}}_\text{proposal ratio}
\Bigg ]
= \underbrace{\frac{f(\theta^{\prime} \mid G)}{f(\theta \mid G)}}_{\text{posterior ratio}}.
%= \frac{\text{posterior probability of proposed state}}{\text{posterior probability of current state}}
\end{align*}\]
In other words, the MCMC algorithms that we use to estimate posterior probabilities of discrete-geographic model parameters from our data completely (and deliberately) avoid calculating the denominator of Bayes theorem; *i.e.*, the marginal likelihood that we need to compute Bayes factors!

*Estimating marginal likelihoods*—.
In order to estimate marginal likelihoods, we must resort to alternative numerical methods (*i.e.*, other than MCMC).
These methods are variously referred to as “stepping-stone” sampling or “power-posterior” sampling algorithms.
These algorithms essentially involve running a series of MCMC simulations over a sequence of “stones” that allow us to step from the joint posterior probability distribution to the joint prior probability distribution.
For each stone, \(i\), we raise the likelihood by a power, \(\beta_i\), such that MCMC for this stone is estimating the distribution:

\[\begin{align*} P(\theta \mid G, \beta_i) = \frac{P(G \mid \theta)^{\beta_i} P (\theta)}{P(G)}. \end{align*}\]

When \(\beta = 1\) the MCMC samples from the joint posterior probability distribution, and when \(\beta = 0\) the MCMC samples from the joint prior probability probability distribution. For intermediate values of \(\beta \in \{1 \rightarrow 0\}\), the MCMC samples from increasingly distorted (“heated”) versions of the posterior distribution.

To perform a stepping-stone simulation, we first need to specify the number of stones, \(k\), that we will use to span the posterior and prior distributions (we usually specify a relatively large number, *e.g.*, \(k \geq 32\)).
Next, we need to decide how to space our \(k\) stones.
The most common approach spaces the stones as \(k\) quantiles of a Beta probability distribution (where the quantiles divide the distribution into \(k-1\) intervals with equal probability).
The Beta distribution has two shape parameters, where the second one is set to one by convention (as we only need one degree of freedom for this distribution). ^{1}
We might, for example, set the first shape parameters \(\alpha\) also to one, which specifies a uniform probability distribution (as a special case of the Beta), such that the \(k\) quantiles in this case would be uniformly distributed between the posterior and prior.
However, as we move from the posterior to the prior, while the difference between \(\beta_{i+1}\) and \(\beta_i\) stays constant, the overlap between consecutive power-posterior distributions (\(P(\theta \mid G, \beta_i)\) and \(P(\theta \mid G, \beta_{i+1})\)) becomes increasingly small.
The approximation works poorly when the overlap becomes too small.
Following the `BEAST` default, `PrioriTree` specifies the sequence of \(\beta\) values following evenly-spaced quantiles of a Beta\((0.3, 1.0)\) distribution (*i.e.*, \(\alpha = 0.3\)), so that more values of \(\beta\) are put near 0 than near 1 (originally recommended by Xie et al. (2011)).

*Computing Bayes factors*—.
Often, we compare two competing models—models \(M_0\) and \(M_1\), for example—by computing the *Bayes factor*:

\[\begin{align*} \text{BF}_{01} = \frac{P(G \mid M_0)}{P(G \mid M_1)} \end{align*}\]

Bayes factors greater than 1 reflect positive support for the model in the numerator, whereas Bayes factors less than 1 reflect positive support for the model in the denominator. Bayes factors near 1 indicates that both models perform relatively the same. When comparing more than two models, we simply compute the Bayes factor between each pair of models and rank the models accordingly. Since we compute log-marginal-likelihoods, it’s convenient to express the Bayes factors as: \[\begin{align*} 2 \ln \text{BF}_{01} = 2 \left( \ln P(G \mid M_0) - \ln P(G \mid M_1) \right), \end{align*}\] where the factor of two is simply conventional.

*Interpreting Bayes factors*—.
Kass and Raftery (1995) provide *rough* guidelines for interpreting the strength of support indicated by Bayes factors:

\(\text{BF}_{01}\) | \(2 \ln \text{BF}_{01}\) | \(\text{Support for model}\ M_0\) |
---|---|---|

1 to 3 | 0 to 2 | Equivocal |

3 to 20 | 2 to 6 | Positive |

20 to 150 | 6 to 10 | Strong |

> 150 | > 10 | Decisive |

### 2.5.2 Posterior-Predictive Checking

We may wish to assess how close our inferred process (*i.e.*, the assumed the model and the posterior parameter estimates under the model) is to the true process that gave rise to the observed data.
One way to achieve this is to simulate datasets under the assumed model and posterior estimates, and then compare them with the observed data.
When the simulated data resemble the observed data closely, we consider that the assumed model provides an adequate fit to the data in an absolute sense (*i.e.*, not comparing to any other competing models).
This model-adequacy assessment approach is referred to as *posterior-predictive checking* (Gelman, Meng, and Stern 1996; Bollback 2002).

Each individual posterior-predictive simulation is performed by drawing a vector of parameters, \({\theta_i = \{\Psi_i, \boldsymbol{r_i}, \boldsymbol{\delta_i}, \mu_i \}}\), at random from the MCMC samples approximating the joint posterior distribution, and then simulating a predictive dataset, \(G^\text{sim}_i\), conditional on those parameters. Repeating this simulation procedure \(m\) times, we obtain \(m\) predictive datasets.

A difference statistic can then be calculated for the \(i^\text{th}\) simulated dataset as: \[\begin{align*} D_i = T(G^\text{sim}_i \mid \theta_i) - T(G^\text{obs} \mid \theta_i), \end{align*}\] where \(G^\text{obs}\) is the observed biogeographic dataset, and \(T( \cdot \mid \theta_i)\) is a summary statistic (detailed below).

For the \(m\) predictive datasets, the posterior-predictive \(p\)-value is calculated as:
\[\begin{align*}
P = \frac{1}{m} \sum_{i=1}^m D_i \geq 0,
\end{align*}\]
with values between \(0.025\) and \(0.975\) indicating that the model is adequate and cannot be rejected (*i.e.*, the observed statistic is within the 95% posterior-predictive interval).

Two summary statistics can be used to assess model adequacy: (1) the parsimony statistic, and; (2) the tip-wise multinomial statistic.
For the parsimony statistic, we simply calculated the parsimony score for the given simulated or observed dataset, conditional on the sampled tree, \(\Psi_i\) (achieved in `PrioriTree` by calling the `parsimony`

function in `R` package `phangorn`; Schliep (2010)).

The tip-wise multinomial statistic treats the states at the tips of the tree for the single geographic character (*i.e.*, site) as the outcomes of the multinomial trial.
Bollback (2002)
For the tip-wise multinomial statistic, we calculated:
\[\begin{align*}
T(G \mid \theta_i) = \sum_{i=1}^k n_i \ln(n_i / n),
\end{align*}\]
where \(n\) is the number of tips, and \(n_i\) is the number of tips in state \(i\).

^{This is admittedly confusing. We perform a series of MCMC simulations across a sequence of \(k\) stones, where the likelihood for the \(i^{\text{th}}\) stone is raised to a power, \(\beta_i\). We space these stones as quantiles of a Beta distribution, which has shape parameters \(\alpha\) and \(\beta\). The fact that the stones (powers) are called \(\beta\) is unrelated to the fact that we use a Beta distribution to specify the spacing of the stones.↩︎}

### References

*Molecular Biology and Evolution*19 (7): 1171–80.

*Statistica Sinica*, 733–60.

*Journal of the American Statistical Association*90 (430): 773–95.

*Bioinformatics*27 (4): 592–93.

*Systematic Biology*60: 150—160.