## 2.2 Prior Specification and Related Theory

### 2.2.1 Inherent Prior Sensitivity of Biogeographic Inference

We estimate the parameters of our discrete discrete-geographic models—including the vector of instantaneous dispersal rates between each pair of areas, \(\mathbf{Q}\), and the average dispersal rate across all geographic areas, \(\mu\)—from our observations (including the geographic data, \(G\), and the previously inferred phylogeny, \(\Psi\)). Specifically, we estimate the joint posterior probability distribution of our discrete-geographic model parameters conditional on the data using Bayes theorem:

\[\begin{align*} \overbrace{P(\mathbf{Q}, \mu \mid G, \Psi)}^\text{posterior distribution} = \frac{ \overbrace{P(G \mid \mathbf{Q}, \mu, \Psi)}^\text{likelihood} \overbrace{P(\mathbf{Q}) P(\mu)}^\text{prior distribution} }{ \underbrace{P(G)}_\text{marginal likelihood} }. \end{align*}\]

The joint posterior probability distribution, \(P(\mathbf{Q}, \mu \mid G, \Psi)\), reflects our beliefs about the parameter values *after* evaluating our data.
The posterior is an updated version of our joint prior probability distribution, \(P(\mathbf{Q}) P(\mu)\), which reflects our beliefs about the parameter values *before* evaluating our data.
Our prior is updated by the information in our data via the likelihood function, \(P(G \mid \mathbf{Q}, \mu, \Psi)\), which is the probability of observing our geographic data under the discrete-geographic model.

In many cases, Bayesian inference is robust to the choice of prior: posterior estimates are dominated by the information in the data, allowing us to safely ignore the issue of prior choice.
In other cases, however, posterior estimates will be strongly influenced by our choice of prior: specifically, when our data contain limited information about a parameter in our inference model, the posterior probability distribution inferred for that parameter will closely resemble the assumed prior probability distribution.
This phenomenon—referred to as *prior sensitivity*—is an inherent feature of inference under discrete discrete-geographic models.

To illustrate this issue, contrast typical inferences under discrete-geographic models and substitution models. Both models describe the evolution of discrete states (geographic areas or nucleotide bases) from the root, along the branches, to the tips of our study phylogeny as a continuous-time Markov chain (CTMC). For a process with \(k\) discrete states, the process is completely described by a \(k \times k\) matrix of instantaneous rates, \(\mathbf{Q}\). Each element of this matrix, \(q_{ij}\), describes the instantaneous rate of change between two states, \(i\) and \(j\). The most complex time-reversible substitution model, the GTR model, has \(k = 4\) states \(\{A,C,G,T\}\), and six instantaneous-rate parameters that are typically inferred from a sequence alignment with hundreds or thousands of sites.

By contrast, discrete discrete-geographic models typically have many more parameters and much less data.
For example, an inference problem with \(k = 10\) discrete geographic areas, the symmetric discrete-geographic model has \({k \choose 2} = 45\) instantaneous-rate parameters, and the asymmetric discrete-geographic model has \(k \times (k - 1) = 90\) instantaneous-rate parameters.
Moreover, the parameters of these discrete-geographic models must be estimated from an ‘alignment’ with a single ‘site’; *i.e.*, the geographic dataset includes a single observation (the area occupied by each tip in the tree).

There are several strategies to deal with prior sensitivity:

- specify biologically-motivated/informed priors (
*i.e.*, where the prior probability is focussed on ‘reasonable’ values); - specify diffuse/uninformative priors (
*i.e.*, where the prior probability is spread evenly over a wide range of ‘plausible’ values), or; - assess prior sensitivity (
*e.g.*, perform replicate inferences under a range of candidate priors, and/or assess the relative or absolute fit of alternative prior models to our data).

Unfortunately, the priors on discrete-geographic model parameters implemented as the defaults in `BEAST` are highly (mis)informative: *i.e.*, the default priors reflect extremely strong and biologically unrealistic assumptions about the underlying biogeographic process (Gao et al. 2022).
Worse still, these default priors have been used in the vast majority (~93%) of published studies that have used `BEAST` to infer biogeographic history, and these default priors have been shown to strongly (and adversely) distort central conclusions of biogeographic studies (Gao et al. 2022).

### 2.2.2 Prior on the Number of Dispersal Routes

Recall that—when using BSSVS in `BEAST`—each element of the instantaneous-rate matrix, \(\mathbf{Q}\), is specified as:
\[\begin{align*}
q_{ij}= r_{ij}\delta_{ij},
\end{align*}\]
where \(r_{ij}\) is the relative rate of dispersal between areas \(i\) and \(j\), and \(\delta_{ij}\) is an indicator variable that takes one of two states (0 or 1).
Each vector, \(\boldsymbol{\delta}\), specifies a unique configuration of dispersal routes, which corresponds to a unique discrete-geographic model.
The total number of dispersal routes for a given discrete-geographic model is denoted \(\Delta\).
For a given value of \(\Delta\), there may be multiple distinct discrete-geographic models.
For example, a dataset with \(k = 3\) geographic areas has the vector of relative rates \(\boldsymbol{r}= \{r_{1,2}, r_{1,3}, r_{2,3}\}\), such the space of symmetric discrete-geographic models includes: three models with \({\Delta = 1}\) dispersal route, \(\boldsymbol{\delta}= \{(1,0,0), (0,1,0), (0,0,1)\}\); three models with \({\Delta = 2}\) dispersal routes, \(\boldsymbol{\delta}= \{(1,1,0), (0,1,1), (1,0,1)\}\), and; a single model with \({\Delta = 3}\) dispersal routes, \({\boldsymbol{\delta}= \{(1,1,1)\}}\).
Lemey and colleagues (Lemey et al. 2009) impose a prior on the space of discrete-geographic models by:

- placing a prior on the total number of dispersal routes, \(\Delta\), and;
- assuming that all discrete-geographic models with a given value of \(\Delta\) are equiprobable.

Together, these assumptions induce a prior probability that a given dispersal route between areas \(i\) and \(j\) exists, *i.e.*, the probability that \({\delta_{ij}= 1}\).

`PrioriTree` allows you to specify three alternative prior probability distributions on the number of dispersal routes, \(\Delta\).

#### Poisson prior

The default prior on \(\Delta\) implemented in `BEAST` is a Poisson probability distribution.
The specific parameterization of the Poisson prior depends on the a/symmetry of the discrete-geographic model.
For **symmetric discrete-geographic models**, the default prior on \(\Delta\) is an *offset* Poisson distribution.
Specifically, for a dataset with \(k\) discrete geographic areas, the Poisson prior on \(\Delta\) is offset by \(({k - 1})\), where all discrete-geographic models for which \(\Delta < ({k - 1})\) have zero prior probability (*i.e.*, such models are disallowed *a priori*).
The motivation for this offset is mathematical: a dataset with \(k\) geographic areas cannot be realized under a CTMC with fewer than \((k - 1)\) non-zero \(q_{ij}\) values (*i.e.*, dispersal routes).
[The real constraint on the geographic model is that it must be *irreducible*; *i.e.*, it must be possible to reach each area from every other area either directly or indirectly.
A model with fewer than \((k - 1)\) dispersal routes cannot be irreducible; however, a model with at least \((k - 1)\) dispersal routes is not guaranteed to be irreducible.]
The prior probability of models for which \(\Delta \geq ({k - 1})\) is Poisson with rate parameter, \({\lambda = ln(2)}\).
As specified, this Poisson prior places approximately 50 percent of the prior probability on discrete-geographic models with the absolute minimum number of dispersal routes.

For asymmetric discrete-geographic models, `BEAST` specifies a default prior on \(\Delta\) with rate parameter \({\lambda = (k - 1)}\).
(Note that this prior does not enforce a minimum number of dispersal routes, *i.e.*, the Poisson prior distribution under the asymmetric discrete-geographic model is not offset.)

The default Poisson prior on \(\Delta\) in `BEAST` reflects a very strong preference (*i.e.*, a very informative prior) for discrete-geographic models with the minimal number of dispersal routes.
When the default Poisson prior on \(\Delta\) is specified using `BEAUti`, it is not possible for users to adjust the parameterization (*i.e.*, to make the prior less informative by changing the value of the Poisson-rate parameter, \(\lambda\)).
By contrast, when the Poisson prior on \(\Delta\) is specified using `PrioriTree`, users are able to adjust it.
By default, `PrioriTree` specifies a more diffuse Poisson prior on \(\Delta\), where the expected number of dispersal routes is about half the maximum number.
[As in `BEAST`, the Poisson prior for symmetric discrete-geographic models is also offset by \(({k - 1})\).]
`PrioriTree` allows users to adjust the Poisson-rate parameter, \(\lambda\), to specify the desired shape for the Poisson prior on \(\Delta\).
Note that both the mean and the variance of the Poisson prior distribution scale linearly with \(\lambda\).

#### Beta-Binomial prior

We can specify an alternative prior on the number of dispersal routes, \(\Delta\), by treating each dispersal-route indicator, \(\delta_{ij}\), as a Bernoulli random variable, such that the total number of dispersal routes follows a Binomial distribution.
The Binomial probability distribution has two parameters: \(n\) is the number of trials (equal to the maximum number of dispersal routes for a dataset with \({k}\) areas), and \(p\) is the success probability (equal to the probability that each dispersal routes exists; *i.e.*, that \(\delta_{ij}= 1\)).
We treat \(p\) as a random variable to be estimated from the data.
Specifically, we specify a Beta prior probability distribution on \(p\).
The Beta prior has two hyperparamters—the shape parameters \(\alpha\) and \(\beta\)—that can be interactively modified by the user so that the resulting Beta-Binomial prior on \(\Delta\) has the desired mean and \(95\%\) prior interval.
Note that the expected (mean) number of dispersal routes increases as \(\alpha / \beta\) increases.

#### Uniform prior

Finally, we can specify an alternative prior on the number of dispersal routes, \(\Delta\), by assuming that all possible values (*i.e.*, between zero and the maximum number of dispersal routes) are equiprobable.
This uniform prior on \(\Delta\) is a special case of the Beta-Binomial distribution described above, which is specified when the values of both shape parameters of the Beta prior—\(\alpha\) and \(\beta\)—are set to one.
Under the uniform prior on \(\Delta\), the prior probability that each dispersal route exists is uniformly distributed between 0 and 1.

### 2.2.3 Prior on the Average Dispersal Rate

Recall that the rate matrix, \(\mathbf{Q}\), is rescaled such that the average rate of dispersal between all areas is \(\mu\).
For a tree of length \(T\) (*i.e.*, the sum of all branch durations in the dated phylogeny), the expected number of dispersal events is \({\mu \times T}\).
Therefore, the prior on the average dispersal rate, \(\mu\), represents our prior belief about the number of dispersal events over the tree.

`PrioriTree` allows you to specify three alternative prior probability distributions on the average dispersal rate, \(\mu\).

#### CTMC-rate reference prior

The default prior on \(\mu\) in `BEAST` is a Gamma probability distribution with shape parameter \({\alpha = 0.5}\) and rate parameter \({\beta = T}\).
(Note that this Gamma prior is erroneously labelled as a `CTMC-rate reference prior`

in the `BEAST` utility, `BEAUTi`.)
The Gamma distribution has a mean of \({\alpha / \beta}\); therefore, this prior expresses the belief that the average rate of dispersal is \({0.5 / T}\).

The default Gamma prior on \(\mu\) in `BEAST` reflects a very strong preference (*i.e.*, a very informative prior) for biogeographic histories with an implausibly small number of dispersal events.
A dataset with \(k\) geographic areas requires a biogeographic history with at least \((k - 1)\) dispersal events.
The expected number of dispersal events is \({\mu \times T}\).
Accordingly, the number of dispersal events expected *a priori* under the default Gamma prior on \(\mu\) is \(0.5\), independent of the duration of the entire biogeographic history (*i.e.*, the tree length, \(T\)), or the number of areas, \(k\), involved in the geographic history.
Similarly, the prior distribution on the number of dispersal events is independent of \(T\) and \(k\): the \(95\%\) prior interval is \({[0, 3]}\) dispersal events, which implies that we would be very surprised if a biogeographic history of any duration with any number of areas involved more than three dispersal events.

#### Hierarchical-exponential prior

`PrioriTree` allows users to specify an exponential prior on the average dispersal rate, \(\mu\); this exponential prior has a single hyperparameter, the exponential-rate parameter, \(\lambda\).
The mean of this exponential prior is \({1 / \lambda}\), which we treat as a random variable to be estimated from the data.
Specifically, we specify a Gamma hyperprior on \(\lambda\).
The Gamma distribution has two parameters—\(\alpha\) and \(\beta\)—where the mean of the Gamma hyperprior is \(\alpha/\beta\) and the variance is \(\alpha/\beta^{2}\).
This Gamma hyperprior on the exponential-rate parameter, \(\lambda\), is constrained such that \(\alpha = \beta\): this constraint ensures that the resulting hierarchical-exponential prior is proper (*i.e.*, that it integrates to one, and so obeys the law of total probability).
This hierarchical prior distribution is known as the \(K\) distribution (Jakeman and Pusey 1978).

Under this prior on the average dispersal rate, the prior distribution on the number of dispersal events (sensibly) scales with \(T\); *i.e.*, the expected number of dispersal events increases with the duration of the biogeographic history.
`PrioriTree` allows users to simultaneously modify the shape/rate parameters.
Note that the mean of this hierarchical prior is always one (*i.e.*, independent of the value of the shape/rate parameter), but its variance scales inversely with the shape/rate parameter.

#### Empirical-exponential prior

Finally, we can specify an alternative exponential prior on the average dispersal rate, \(\mu\), that adopts an ‘empirical-Bayesian’ approach for specifying the hyperprior on the exponential-rate parameter, \(\lambda\).
In Bayesian inference, we specify a prior distribution for a given parameter that reflects our beliefs about its parameter values *before* we evaluate our study data.
Empirical-Bayesian inference, by contrast, effectively entails some ‘double dipping’ of the study data; that is, we first estimate the (hyper)parameters of our inference model from our study data, and then use those (hyper)parameter estimates to specify one or more of the corresponding (hyper)priors of our inference model.

Our empirical-Bayesian approach for specifying the prior on the average dispersal rate, \(\mu\), involves computing the parsimony score (the minimum number of dispersal events) required to explain our observed geographic data (*i.e.*, the distribution of areas across the tips of our study tree).
This parsimony score represents a minimum bound on the true number of dispersal events in the biogeographic history that gave rise to our observations.
We can therefore leverage these parsimony scores to inform our prior on \(\mu.\)
Specifically, we specify a value for the hyperprior on the exponential-rate parameter, \(\lambda\), such that the resulting prior on the number of dispersal events is focused on values greater than the parsimony score.

By default, `PrioriTree` sets the parsimony score at the lower quartile (*i.e.*, \(25\%\) quantile) of the prior distribution on the number of dispersal events.
Users can adjust the mean of this resulting prior distribution (while using the parsimony score as a reference) according to their biological beliefs about the dispersal intensity.

### References

*bioRxiv*.

*Physical Review Letters*40 (9): 546.

*PLoS Computational Biology*5 (9): e1000520.