1.3 Specify BEAST Analyses and Summarize Results to Assess Relative or Absolute Model Fit

Model-based inference—including phylodynamic inference—assumes that our inference model provides a reasonable description of the process that generated our study data; otherwise, our inferences—including estimates of relative and/or average dispersal rates and any summaries based on those parameter estimates (ancestral areas, dispersal histories, number of dispersal events, etc.)—are apt to be unreliable. Additionally, comparing alternative discrete-geographic models provides a means to objectively test hypotheses regarding the history of dispersal (i.e., by assessing the relative fit of our data to competing models that are specified to include/exclude a parameter relevant to the hypothesis under consideration). PrioriTree implements functions to help you assess both the relative fit and absolute fit of discrete-geographic models to an empirical dataset.

1.3.1 Compare the relative fit of competing models

PrioriTree sets up the marginal-likelihood BEAST analysis by appending a marginalLikelihoodEstimator section to the XML, after the analysis configuration section of the MCMC that approximates the joint posterior distribution, which allows you to estimate marginal likelihood through both thermodynamic integration (Lartillot and Philippe 2006) and stepping-stone sampling (Xie et al. 2011; Baele et al. 2012).

The number of powers and how many MCMC generations under each power may impact the accuracy of the estimates; default values are likely to be sufficient for most empirical datasets and models. However, the most straightforward way to check the convergence of marginal likelihood estimates is to run multiple replicates of the analyses to see if we get stable estimates across replicates. If the estimates differ significantly among replicates (say greater than a few log-likelihood units, especially if it is equal to or greater than the difference between the log marginal-likelihood estimates under competing models), consider increasing the number of powers and/or the MCMC chain length under each power.

Marginal likelihood estimation analyses.

Figure 1.28: Marginal likelihood estimation analyses.

1.3.2 Posterior-predictive checking

To perform posterior-predictive checking, PrioriTree requires you to provide the observed data as well as the estimates (as log and tree files produced by BEAST) inferred from the data, assuming these inference outputs are generated by BEAST using the XML scripts produced by PrioriTree (only when this is the case, PrioriTree can reliably parse the log file and figure out the exact discrete-geographic model used in the inference, so that it can simulate data under that model). Once all the required input files are provided, you can start the simulation in PrioriTree, and then PrioriTree will generate plots to show the posterior-predictive distributions for each replicate analysis (as well as under different priors if they exist).

Step 1: Import files

Discrete-geographic data file
Import discrete-geographic data.

Figure 1.29: Import discrete-geographic data.

The discrete-geographic data file needs to be either a .csv or .tsv file which contains two (or more) columns. The header (first row) of the file contains the column names. By default, PrioriTree assumes that the first column contains the taxon names, and the second column contains the geographic area for that taxon. If the columns in your data file are ordered differently, select the columns containing the taxon name and geographic data from the drop-down menu after uploading the discrete-geographic data file into PrioriTree (the other columns are ignored). Check the Load example discrete-geography file box to read in an example discrete-geographic data file.

BEAST analysis output log and tree files

Upload one or multiple (analysis replicates) BEAST log/tree files (that contain parameter estimates under the a given model and prior combination) to each input field; different input fields correspond to different priors. Note here, not only the log file(s), you also need to upload the associated tree file(s). PrioriTree needs to know the tree sampled simultaneously with each parameter-estimate sample to simulate the dataset, as well as to compute the parsimony statistic.

Upload <tt>BEAST</tt> output files.

Figure 1.30: Upload BEAST output files.

PrioriTree assumes each sample in the log file and in its associated tree file match each other; i.e., they should have been sampled from the same iteration of a BEAST analysis (the default behavior when the XML scripts generated by PrioriTree were used). If you have combined replicate analyses or have thinned their estimates files, identical operations need to be applied to both the log file and the associated tree file. Check theLoad example log and tree files box to read in an example sets of log and tree files.

Perform posterior-predictive simulations

Once all the input files (including the geographic-data file, parameter-estimates log file(s) and the associated tree file(s)) are uploaded and they are valid (both in terms of their own format and the match between them), the Perform Simulations tab will be enabled. In the Perform Simulations panel, you can either choose to simulate a dataset using each sample in the uploaded distribution by checking the Perform one simulation for each sample box, or specify the desired number of simulated datasets by unchecking the box first and then editing the simulation-number field. Finally, once the simulation configuration is done, click the Start posterior-predictive simulation button to initiate the computationally demanding log-parsing and forward-simulation actions. This simulation step may take a noticeable amount of time to complete, which scales with the number of sequences and number of geographic areas of the dataset, as well as the number of simulations specified. If later you change any of the uploaded files or the number of simulations to perform, click the Start posterior-predictive simulation button to re-execute the simulation step.

Start posterior-predictive simulations.

Figure 1.31: Start posterior-predictive simulations.

Step 2: Configure post-processing settings

Once PrioriTree finishes the posterior-predictive simulations, two panels will be enabled automatically: the processing-setting panel (left) and the result-visualization panel (right). All the operations under this section should be computationally inexpensive so that the changes to the figure and/or table should be seen immediately.

Configure post-processing settings.

Figure 1.32: Configure post-processing settings.

Output processing settings

First, you can choose to visualize the posterior-predictive distributions under one of the two available statistics: 1) parsimony statistic and 2) tip-wise multinomial statistic (see the theoretical-background section for details), and switch between them using the radio buttons on the top of the post-processing settings panel.

Posterior-predictive distributions of the tip-wise multinomial likelihood statistic.

Figure 1.33: Posterior-predictive distributions of the tip-wise multinomial likelihood statistic.

Below it, there is a checkbox that you can click to combine all the replicate analysis (i.e., estimates under identical model and prior specification), once confirming that the replicate MCMCs have converged.

Combine replicates.

Figure 1.34: Combine replicates.

Below the checkbox, a separate scrollable collapsible subpanel is displayed under each prior model; each repeated chunk within that subpanel contains the settings you can adjust for each log file (led by the log file name). The first item allows you to exclude some log files without having to re-execute the log parsing step. With the second item, you can adjust the burnin proportion of each analysis independently using the slider object.

Exclude some log files.

Figure 1.35: Exclude some log files.

Figure edits
Edit figure.

Figure 1.36: Edit figure.

You may perform further cosmetic edits to the figure and/or table before they go to next step to save the output. The fields under the figure-edit panel should give you flexibility in modifying the appearance of the figure. You can also view the exact posterior-predictive p-values of each analysis for both statistics together under the table tab.

Posterior-predictive p-values table.

Figure 1.37: Posterior-predictive p-values table.

Step 3: Save output

Save figure or table output
Download figure and/or table.

Figure 1.38: Download figure and/or table.

At the end, when all the settings are complete, download the figure and/or the table under the desired format. The default figure/table name generated by PrioriTree indicates the type analysis and the selected posterior-predictive statistic (only for the figure as the table contains both statistics).

Save the simulated datasets
Download the simulated datasets.

Figure 1.39: Download the simulated datasets.

You can also download the posterior-predictive simulated datasets and summarize them in other ways (e.g., using alternative summary statistics other than the two provided by PrioriTree). For each analysis, PrioriTree simulates datasets (each of which contains state of every tip in the tree) and then writes them out as a single .tsv file, where each column indicates a tip (first row contains tip names as column names) while each row contains a simulated dataset (so the number of rows, after the first header row, is identical to the number of post-burnin samples of the corresponding analysis). When there are multiple analyses (replicates and/or under different prior models), PrioriTree will produce a zipped folder that contains all the .tsv files (where each .tsv file contains the simulated datasets for the corresponding analysis). The name of each .tsv file contains the prior model name as well as the replicate id as part of its string, so that you can match them with the uploaded analysis files. An example zipped folder that contains the simulated datasets is available here.

References

Baele, Guy, Philippe Lemey, Trevor Bedford, Andrew Rambaut, Marc A Suchard, and Alexander V Alekseyenko. 2012. “Improving the Accuracy of Demographic and Molecular Clock Model Comparison While Accommodating Phylogenetic Uncertainty.” Molecular Biology and Evolution 29 (9): 2157–67.
Lartillot, N., and H. Philippe. 2006. “Computing Bayes Factors Using Theromodynamic Integration.” Systematic Biology 55: 195–207.
Xie, W., P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011. “Improving Marginal Likelihood Estimation for Bayesian Phylogenetic Model Selection.” Systematic Biology 60: 150—160.