3 Getting data and trees into R
3.1 Data and tree object types
In R, there are many kinds of objects. These kinds are called “classes”. For example, take the text string “Darwin”.
##  "character"
It is a
pi is a defined constant in R:
##  3.141593
##  "numeric"
Its class is
numeric [and note that its value is stored with more precision than is printed on screen].
Objects can sometimes be converted from one class to another, often using an
##  "6"
##  "character"
##  6
##  "numeric"
##  42
Trying to multiply example.1 by seven results in an error: you are trying to multiply a character string by a number, and R does not automatically convert classes. Classes have many uses in R; for example, one can write a different
plot() function for each class, so that a tree is plotted one way, while a result from a regression model is plotted a different way, but users just have to call
plot() on each and R knows what to do.
In phylogenetics, we mostly care about classes for trees, for data, and for things to hold trees and data.
3.1.1 Tree classes
The main tree class in R is
phylo and is defined in the
ape package. Let’s look at one in the wild:
## ## Phylogenetic tree with 5 tips and 4 internal nodes. ## ## Tip labels: ##  "t2" "t3" "t5" "t1" "t4" ## ## Rooted; includes branch lengths.
## List of 4 ## $ edge : int [1:8, 1:2] 6 8 9 9 8 6 7 7 8 9 ... ## $ edge.length: num [1:8] 1.6235 0.1432 0.0806 0.0806 0.2238 ... ## $ tip.label : chr [1:5] "t2" "t3" "t5" "t1" ... ## $ Nnode : int 4 ## - attr(*, "class")= chr "phylo" ## - attr(*, "order")= chr "cladewise"
This is the one used in most packages. However, it has some technical disadvantages (sensitivity to internal structure, no checking of objects) that has led to the
phylo4 format for trees and
phylo4d for trees plus data in the
phylobase package. Other packages add on to the
phylobase format (i.e.,
simmap format) but these are typically not shared across packages.
3.2 Sequence data
The BioConductor community has put a lot of effort into making R work with high throughput data, though python is still likely more popular (just remember that python 3.x is the only currently supported version – python 2.x should not be used any longer). The seqinr package has some useful functionality for handling sequences. ape and phangorn have functions for handling data, including reading FASTA and NEXUS files in ape.
3.3 Other character data
This can include data such as discrete traits (has wings / wingless), continuous traits (body mass), geographic traits (latitudes and longitudes, which continents they occur on), and many more. These are typically loaded either as csv files from some other source or directly from an R package. This is a rapidly developing field, but many of the most useful packages are supported by rOpenSci – there are packages for getting information from GBIF, eBIRD, iNaturalist, NCBI, and many more sources.
The most common way to load trees is to use
phy <- ape::read.tree(file='treefile.phy')
To get a tree in Newick format (sometimes called Phylip format): essentially a series of parenthetical statements. An example (from
ape’s documentation) is
((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3);. The format name comes from the name of the lobster house where several major phylogenetic software developers met to agree on a tree format.
You can use the same function to enter tree strings directly, changing the argument from the
file containing the tree to
text containing the tree string:
Note the trailing semicolon.
One thing that can trip users up with
read.tree() (and the
read.nexus() function, below) is that the output class depends on the input. If you read from a file with one tree, the returned output is a single tree object with class
phylo. You can then use
plot() on this object to draw the tree, pass this object into a comparative methods package to estimate rates, and so forth. If the file has more than one tree, the returned class is
plot() will automatically cycle through plots as you type return, most comparative method implementations will fail (they are written to expect one tree of class
phylo, not a vector of trees in a different class).
read.tree() has an optional
keep.multi function: if set to TRUE, the class is always
multiphylo, and you can always get the first tree by getting the first element in the returned object:
phy.multi <- ape::read.tree(file='treefile.phy', keep.multi = TRUE) phy <- phy.multi[]
For NEXUS formatted files (Maddison et al., 2007),
read.nexus() function can pull in the trees (and its
read.nexus.data() function can pull in data from a NEXUS file). NEXUS is a very flexible format, and there are valid NEXUS files that still cause errors with
ape’s function. A more robust function to read in NEXUS trees is the package
readNexus() function (note the lack of a period and different capitalization of Nexus from
ape’s similar function).
phylobase uses a different structure to store trees than
3.4.1 Great scientists steal
Scientists have been creating trait-based phylogenetic trees for decades. These scientists are often experts in their group, in potential problems in their data, in how to use relevant software. In other words, their trees are likely to be better than any you make. Traditionally, these trees are published as a figure in a paper, largely unavailable for reuse. This hurts reproducibility, makes it less likely for the work to be cited, and stymies scientific progress in general. However, the field is increasingly moving to more frequent deposition of trees in reusable form: sometimes based on author initiative, sometimes based on journal requirements. The main repository for this is TreeBase: if you are reading a paper, and want to use its tree, that’s the first place to look. You can also use their website to search for taxa. The trees can be downloaded and loaded into R using
phylobase (the NEXUS format used by TreeBase is hard for
ape to load).
Another approach that is growing in importance is Open Tree of Life. It seeks to synthesize thousands of trees to create a single tree of life. The
rotl package can download this synthetic tree or components of it (the tree for a particular genus, for example). For most groups, however, the synthetic tree is largely based on taxonomy, so it is not very resolved. This is improving as the database of trees available for Open Tree’s synthesis grows (to add to it, go to ___________________), but for most scientific studies, I wouldn’t currently suggest using the synthetic tree (but for getting a sense for a group, making a tree for a class, it can be useful; also see the Phylotastic project for ways to use trees in teaching or other purposes). However, the Open Tree project also has a cache of thousands of trees that have been hand curated (taxonomic names resolved, ingroups specified, tree type recorded, etc.). The
rotl package lets you download these, too. For most analyses, you want trees with branch lengths, and so you can download just chronograms. For example,
Two important notes about reusing trees:
- Give credit: If your entire paper is based on the tree from one other paper, you must cite that paper (and also the ways you got the tree, including the packages and/or repositories). If it’s based on trees from around a dozen papers, you should cite them, too. If you’re getting into the hundreds, many editors will object to properly citing them all, but one compromise approach until a better way of giving credit appears is to have supplemental info or an appendix with citations for all the relevant papers (including DOIs to make these easier to parse later)
- Tree quality matters: As you will see in later sections, many comparative methods are based on using branch lengths: look at different rates of character evolution, looking at diversification rates over time, etc. If your starting tree is wrong, even if the topology is perfect but the branch lengths are wrong, later downstream analyses are also likely to be wrong. Some methods (like independent contrasts) are fairly robust to this (__________________), but the field has not tested many others yet, and most should be far more sensitive than contrasts. This matters less if you are testing dull hypotheses (see Chapter __________) but for folks working on biology where understanding processes, especially using parameter estimates, is the point, just taking a tree and making up branch lengths is often a bad idea.
3.5 Reconciling datasets
We use scientific names to communicate clearly. In the picture below ________________________, “Look at the robin!” will have an American glance at the bird on the left, and a Brit look at the bird on the right, but both, if trained sufficiently will know which to look at if told to look at _____Scientific name_______. We thus use scientific names in writing. However, the correct scientific name for a specimen can change for various reasons:
- A species is split into two species: some individual specimens remain in the original species, others are given a new species names (rules of taxonomy allow this, and give constraints on how the new species can be named and described)
- Two species are lumped into one species: some individual specimens thus have their names changed (and which name persists after the merge is specified by the rules of taxonomy)
- A higher level group is changed. For example, _________ proposed to split the Anolis genus into eight genera. Thus the genus name for some species changes, and sometimes the species name itself changes to match the genus names: ______ becomes ____________. This can be a merge or a split. This is often motivaed by a new discovery (the group known as acacias are not a clade (an ancestor and all its descendants) and since we only want to name clades, one of the groups needs a new name).
- An error is fixed. For example, it could be discovered that there was an earlier name for a species in the literature, and so the species name must be changed based on the rules of priority.
Importantly, for all but the last point, it is perfectly valid based on the rules of taxonomy for different scientists to use the names before and after the change.