Chapter 3 Getting data and trees into R

3.1 Data and tree object types

In R, there are many kinds of objects. These kinds are called “classes”. For example, take the text string “Darwin”.

## [1] "character"

It is a character class. pi is a defined constant in R:

## [1] 3.141593
## [1] "numeric"

Its class is numeric [and note that its value is stored with more precision than is printed on screen].

Objects can sometimes be converted from one class to another, often using an as.* function:

example.1 <- "6"
## [1] "6"
## [1] "character"
example.2 <- as.numeric(example.1)
## [1] 6
## [1] "numeric"
example.2 * 7
## [1] 42

Trying to multiply example.1 by seven results in an error: you are trying to multiply a character string by a number, and R does not automatically convert classes. Classes have many uses in R; for example, one can write a different plot() function for each class, so that a tree is plotted one way, while a result from a regression model is plotted a different way, but users just have to call plot() on each and R knows what to do.

In phylogenetics, we mostly care about classes for trees, for data, and for things to hold trees and data.

3.1.1 Tree classes

The main tree class in R is phylo and is defined in the ape package. Let’s look at one in the wild:

phy <- ape::rcoal(5) #to make a random five taxon tree
## Phylogenetic tree with 5 tips and 4 internal nodes.
## Tip labels:
##   t2, t3, t5, t1, t4
## Rooted; includes branch lengths.
## List of 4
##  $ edge       : int [1:8, 1:2] 6 8 9 9 8 6 7 7 8 9 ...
##  $ edge.length: num [1:8] 1.6235 0.1432 0.0806 0.0806 0.2238 ...
##  $ tip.label  : chr [1:5] "t2" "t3" "t5" "t1" ...
##  $ Nnode      : int 4
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

This is the one used in most packages. However, it has some technical disadvantages (sensitivity to internal structure, no checking of objects) that has led to the phylo4 format for trees and phylo4d for trees plus data in the phylobase package. Other packages add on to the phylobase format (i.e., phytool’s simmap format) but these are typically not shared across packages.

3.2 Sequence data

The BioConductor community has put a lot of effort into making R work with high throughput data, though python is still likely more popular (just remember that python 3.x is the only currently supported version – python 2.x should not be used any longer). The seqinr package has some useful functionality for handling sequences. ape and phangorn have functions for handling data, including reading FASTA and NEXUS files in ape.

3.3 Other character data

This can include data such as discrete traits (has wings / wingless), continuous traits (body mass), geographic traits (latitudes and longitudes, which continents they occur on), and many more. These are typically loaded either as csv files from some other source or directly from an R package. This is a rapidly developing field, but many of the most useful packages are supported by rOpenSci – there are packages for getting information from GBIF, eBIRD, iNaturalist, NCBI, and many more sources.

3.4 Phylogenies

The most common way to load trees is to use ape’s functions:

phy <- ape::read.tree(file='treefile.phy')

To get a tree in Newick format (sometimes called Phylip format): essentially a series of parenthetical statements. An example (from ape’s documentation) is ((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3);. The format name comes from the name of the lobster house where several major phylogenetic software developers met to agree on a tree format.

You can use the same function to enter tree strings directly, changing the argument from the file containing the tree to text containing the tree string:

phy <- ape::read.tree(text = '((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3);')

Note the trailing semicolon.

One thing that can trip users up with read.tree() (and the function, below) is that the output class depends on the input. If you read from a file with one tree, the returned output is a single tree object with class phylo. You can then use plot() on this object to draw the tree, pass this object into a comparative methods package to estimate rates, and so forth. If the file has more than one tree, the returned class is multiphylo: plot() will automatically cycle through plots as you type return, most comparative method implementations will fail (they are written to expect one tree of class phylo, not a vector of trees in a different class). read.tree() has an optional keep.multi function: if set to TRUE, the class is always multiphylo, and you can always get the first tree by getting the first element in the returned object:

phy.multi <- ape::read.tree(file='treefile.phy', keep.multi = TRUE)
phy <- phy.multi[[1]]

For NEXUS formatted files (Maddison et al., 2007), ape’s function can pull in the trees (and its function can pull in data from a NEXUS file). NEXUS is a very flexible format, and there are valid NEXUS files that still cause errors with ape’s function. A more robust function to read in NEXUS trees is the package phylobase’s readNexus() function (note the lack of a period and different capitalization of Nexus from ape’s similar function). phylobase uses a different structure to store trees than ape does.

tidytree is still fairly new, but it is a popular way for dealing with trees for those who have been converted to the tidyverse.

3.4.1 Great scientists steal

Scientists have been creating trait-based phylogenetic trees for decades. These scientists are often experts in their group, in potential problems in their data, in how to use relevant software. In other words, their trees are likely to be better than any you make. Traditionally, these trees are published as a figure in a paper, largely unavailable for reuse. This hurts reproducibility, makes it less likely for the work to be cited, and stymies scientific progress in general. However, the field is increasingly moving to more frequent deposition of trees in reusable form: sometimes based on author initiative, sometimes based on journal requirements. The main repository for this is TreeBase: if you are reading a paper, and want to use its tree, that’s the first place to look. You can also use their website to search for taxa. The trees can be downloaded and loaded into R using phylobase (the NEXUS format used by TreeBase is hard for ape to load).

Another approach that is growing in importance is Open Tree of Life. It seeks to synthesize thousands of trees to create a single tree of life. The rotl package can download this synthetic tree or components of it (the tree for a particular genus, for example). For most groups, however, the synthetic tree is largely based on taxonomy, so it is not very resolved. This is improving as the database of trees available for Open Tree’s synthesis grows (to add to it, go to ___________________), but for most scientific studies, I wouldn’t currently suggest using the synthetic tree (but for getting a sense for a group, making a tree for a class, it can be useful; also see the Phylotastic project for ways to use trees in teaching or other purposes). However, the Open Tree project also has a cache of thousands of trees that have been hand curated (taxonomic names resolved, ingroups specified, tree type recorded, etc.). The rotl package lets you download these, too. For most analyses, you want trees with branch lengths, and so you can download just chronograms. For example,


Two important notes about reusing trees:

  • Give credit: If your entire paper is based on the tree from one other paper, you must cite that paper (and also the ways you got the tree, including the packages and/or repositories). If it’s based on trees from around a dozen papers, you should cite them, too. If you’re getting into the hundreds, many editors will object to properly citing them all, but one compromise approach until a better way of giving credit appears is to have supplemental info or an appendix with citations for all the relevant papers (including DOIs to make these easier to parse later)
  • Tree quality matters: As you will see in later sections, many comparative methods are based on using branch lengths: look at different rates of character evolution, looking at diversification rates over time, etc. If your starting tree is wrong, even if the topology is perfect but the branch lengths are wrong, later downstream analyses are also likely to be wrong. Some methods (like independent contrasts) are fairly robust to this (__________________), but the field has not tested many others yet, and most should be far more sensitive than contrasts. This matters less if you are testing dull hypotheses (see Chapter __________) but for folks working on biology where understanding processes, especially using parameter estimates, is the point, just taking a tree and making up branch lengths is often a bad idea.

3.5 Reconciling datasets

We use scientific names to communicate clearly. In the picture below ________________________, “Look at the robin!” will have an American glance at the bird on the left, and a Brit look at the bird on the right, but both, if trained sufficiently will know which to look at if told to look at _____Scientific name_______. We thus use scientific names in writing. However, the correct scientific name for a specimen can change for various reasons:

  • A species is split into two species: some individual specimens remain in the original species, others are given a new species names (rules of taxonomy allow this, and give constraints on how the new species can be named and described)
  • Two species are lumped into one species: some individual specimens thus have their names changed (and which name persists after the merge is specified by the rules of taxonomy)
  • A higher level group is changed. For example, _________ proposed to split the Anolis genus into eight genera. Thus the genus name for some species changes, and sometimes the species name itself changes to match the genus names: ______ becomes ____________. This can be a merge or a split. This is often motivaed by a new discovery (the group known as acacias are not a clade (an ancestor and all its descendants) and since we only want to name clades, one of the groups needs a new name).
  • An error is fixed. For example, it could be discovered that there was an earlier name for a species in the literature, and so the species name must be changed based on the rules of priority.

Importantly, for all but the last point, it is perfectly valid based on the rules of taxonomy for different scientists to use the names before and after the change.