Chapter 10. Phylogenetic Tree Reconstruction: Geometric Approaches

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10

Phylogenetic Tree Reconstruction: Geometric Approaches

David Haws^*, Terrell L. Hodge^† and Ruriko Yoshida^‡, ^*IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, ^†Department of Mathematics and College of Arts and Sciences Dean’s Office, Western Michigan University, Kalamazoo, MI, USA, ^‡Department of Statistics, University of Kentucky, Lexington, KY, USA, [email protected]

10.1 Introduction

In this section, we give a brief, high-level overview that encompasses the main topics to be covered in this chapter, and the relationships between them. As will be seen, phylogenetic tree reconstruction is a multi-layered process, and covering all aspects of it would take far more than a single chapter. In the next section, we begin in earnest our investigations into a subcollection of the concepts raised in the Introduction that we intend to treat in more detail (and from scratch, with definitions). For comprehensive mathematical treatments of terms, methods, and details that we do not cover, see, e.g., [1,2] for graph-theoretical/combinatorial approaches; see also [3] for an algebraic geometry and statistics perspective on many of these topics. For a less mathematical, more biologically oriented text, see, e.g., [4].

Through the ages, the diversity of species on earth and the sources of that diversity have always featured prominently in human thought, as evidenced by Aurignacian cave paintings, religious scriptures, and modern science. A prevailing biological concept is that the diversification of life forms is related to the separation of gene pools. If geographical or other barriers separate gene pools, the process of gene flow can be insufficient to counteract genetic drift, and genetic or behavioral barriers emerge against future gene flow (even after the removal of physical barriers) [5]. Of course, genetic isolation alone is insufficient to explain diversity, which further requires the raw material of genetic mutation, inevitably acted upon by natural selection.

A phylogenetic tree (or phylogeny) is a diagram/graph which represents relations of evolutionary descent of different species, organisms, or genes from a common ancestor. A phylogenetic tree is a useful tool to understand and organize information on biological diversity, structuring classifications, and providing insight into events that occurred during evolution. An excellent preliminary guide to the biological perspective, upon which we build a more mathematical approach in this chapter, is [6]; see http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956.

One may reconstruct a phylogenetic tree among distinct groups or species from morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms via, for example, parsimony. However, recent phylogenetic analysis uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Therefore in this chapter we focus on phylogenetic tree reconstruction methods based on sequenced genes or genomic data sets. Through evolutionary history, a character (e.g., a nucleotide) of the sequence might be changed to another one, deleted or inserted. Thus, before reconstructing a phylogenetic tree, we have to align an input sequence data set, e.g., identify sequences of characters (DNA bases, amino acids, etc.) which are thought to be representing the same (“homologous”) regions of genes (from different species or different gene families, etc.) and then line up the sequences so that nucleotides in differing sequences can be compared site-by-site, where one site may vary from another by mutation, i.e., insertions, deletions, or substitutions of characters. Such a line-up of two sequences is an alignment; if we have multiple sequences then the result is properly called a multiple alignment, although we will often abuse terminology and use “alignment” for both. Aligning multiple sequences is generally known to be an NP-hard problem [7,8], and there are scores of approaches to creating alignments heuristically. Here we assume that we have a perfectly aligned sequence data set and focus on the reconstruction of phylogenetic trees from a multiple alignment.

There are several methods to infer a phylogenetic tree from a given alignment, including the maximum-likelihood (ML) method, distance-based methods, parsimony-based methods, Bayesian inference methods, and so on (see [1] for more details). In this chapter we will focus on distance-based methods.

Distance-based methods build phylogenetic trees from aligned sequences by making use of pairwise “distances” between the sequences. These distances arise from a model of sequence evolution, or evolutionary model, which encodes certain hypotheses about how sequences evolve, often including the probabilities that one character at a given site will transition to another, as well as assumptions about how sequences as a whole then transform, via evolution, into one another. Models of sequence evolution are needed to address the problem that observed sequences may have experienced much more change over time that their elements alone might show; by way of analogy, just as an observation that a light switch is currently off does not alone completely specify the number of times previously that it had been flipped on and off again. Many common models of sequence evolution are, in the lingo of mathematicians, given by a continuous-time Markov model with a substitution rate matrix whose entries are the probabilities of characters changing to one another.

Each such evolutionary model given by continuous-time Markov chains corresponds to a phylogenetic tree for which each node is a sequence, and the (directed) edges link sequences which evolve from one to another by data given by substitution rate matrices. In this way, the phylogenetic tree summarizes the relationships between the species (or other organisms represented by the sequences) in terms of common ancestors (nodes) and evolutionary changes via edge lengths (e.g., times since divergence, number of substitution events, etc.).

Under the assumption that a phylogenetic tree describes evolution via a continuous-time Markov process, the main problem of phylogenetic tree reconstruction is to find a tree when only the character sequences, such as DNA sequences or protein sequences, etc., are given. In the lingo of trees and Markov processes, one assumes that the only sequence data that observed is at the tips, i.e., in the leaves, while other information on the phylogenetic tree, the particulars of the substitution events, and edge lengths are missing. In a distance-based approach, distances between sequences (i.e., what should be sums of edge lengths on the path in the phylogenetic tree between the nodes for the sequences) are derived directly from the sequences by using the evolutionary model to compute the most likely distance between each pair of sequences. This is formalized in mathematical and statistical terms by so-called maximum likelihood estimates.¹ Computing these estimates will not be our focus; using them will be.

Pairwise distances, along with a distance-based method for reconstructing phylogenetic trees from the set of all pairwise distances, can be used to reconstruct a particular phylogenetic tree that relates the sequences. Usually the set of all pairwise distances computed from an alignment, collectively often called a distance matrix (or an example of a dissimilarity map), does not give a tree metric, which is a distance matrix realizing a phylogenetic tree. Thus a distance-based method tries to find a tree metric closest to the given set of pairwise distances computed from the alignment under some criteria.

To date, of all the tree reconstruction methods, distance-based methods for phylogeny reconstruction have been seen to be the best hope for accurately building phylogenies on very large sets of taxa such as the data sets for tree of life for the insects Hymenoptera [2,9]. More precisely, distance-based methods have been shown to be statistically consistent in all settings (such as the long branch attraction problem; see [1] for details), in contrast with other methods, such as parsimony methods, e.g., [10–12]. Distance-based methods also have a huge speed advantage over parsimony and likelihood methods, and hence enable the reconstruction of trees on greater numbers of taxa. However, a distance-based method is not a perfect method for reconstructing a phylogenetic tree from a given sequence data set. This is because, in computing pairwise distances, one ignores both the interior nodes and the overall tree topology and so there is a concern that one loses some information from the input data sets. Therefore, it is important to understand how a distance-based method works and how robust it is with noisy data sets. (However, it is noteworthy that from an information-theoretic point of view, a recent article [13] argues that at least some distance-based methods can be proved to preserve more information than may be obtainable from maximum likelihood methods for tree reconstruction, contrary to commonly expressed concerns in the mathematical and biological literature.)

A distance-based method is related to geometry and combinatorics. In fact, one can describe the space of phylogenetic trees, i.e., a set of all tree metrics over the set of all distance matrices, as points in a high-dimensional space which form a union of so-called polyhedral cones. In Section 10.3, we will take an elementary approach to realizing trees as points, and pave the way for further study of the space of phylogenetic trees from the view of polyhedral geometry.

There has been much work to understand distance-based methods, such as the balanced minimum evolution method and neighbor-joining method. In 2002, Desper and Gascuel introduced a balanced minimum evolution (BME) principle, based on a branch length estimation scheme of Pauplin [14]. The guiding principle of minimum evolution tree reconstruction methods is to return a tree whose total length (sum of branch lengths) is minimal, given an input dissimilarity map. The BME method is a special case of these distance-based methods wherein branch lengths are estimated by a weighted least-squares method (in terms of the input dissimilarity map and the tree in question) that puts more emphasis on shorter distances than longer ones. Each labeled tree topology gives rise to a vector, called herein the BME vector, which is obtained from Pauplin’s formula.

Implementing, exploring, and better understanding the BME method have been focal points of several recent works. The software FastME, developed by Desper and Gascuel, heuristically optimizes the BME principle using nearest-neighbor interchanges (NNI) [15]. In simulations, FastME gives superior trees compared to other distance-based methods, including one of biologists’ most popular distance-based methods, the Neighbor-Joining (NJ) Algorithm, developed by Saitou and Nei [16]. In 2000, Pauplin showed that the BME method is equivalent to optimizing a linear function, the dissimilarity map, over the BME representations of binary trees, given by the BME vectors [14]. Eickmeyer et. al. defined the nth BME polytope as the convex hull of the BME vectors for all binary trees on a fixed number n of taxa. Hence the BME method is equivalent to optimizing the input dissimilarity map (a linear function), over a BME polytope. In 2010, Cueto and Matsen [17] studied how the BME method works when the addition of an extra taxon to a data set alters the structure of the optimal phylogenetic tree. They characterized the behavior of the BME phylogenetics on such data sets, using the BME polytopes and the BME cones, i.e., the normal cones of the BME polytope. We will discuss some details of BME method and NJ method from the view of polyhedral geometry in Section 10.4, again opening the doors to further reading and review.

10.2 Basics on Trees and Phylogenetic Trees

Trees are ubiquitous objects in any theory in which branching processes occur. Most generally, as a graph, a tree is a directed, connected graph consisting of a nonempty set of vertices (or nodes), V, and set of edges, , in which there are no cycles or loops. We will assume all our trees are finite, that is, V is a finite set. Each edge identifies with a single ordered pair wherein, by definition, a is the initial vertex of e and b is the terminal vertex of e, so that e is drawn as the arrow , and we also often say e starts at a and ends at b. Given a vertex , the number of edges ending in v is the indegree of v, while the number of edges starting in v is the outdegree of v. The leaves of a tree T are those vertices for which the outdegree of v is 0 (also sometimes called terminal vertices). A tree may be rooted or unrooted; by definition, if T is rooted, then there is a vertex which has , and otherwise T is unrooted. An interior vertex is one which is not a node, although if T is rooted and , the root is usually not regarded as an interior vertex or as a leaf. A pair of two leaves is a cherry, if i and j have a common parent node,² say, a, i.e., there are edges starting at a and ending at i, respectfully, j. More generally, any leaves sharing a common parent node are called sisters.

Alternatively, when working with unrooted trees, as we will most often do below for reconstruction methods, one can dispense with directionality (orientation) and start by defining trees simply as connected graphs (on nonempty vertex sets) with no cycles or loops. In this case, edges are defined by pairs of (distinct) elements in V. It is common to represent an edge simply as a line segment (or ). One simply defines the degree of to be the number of edges incident to . Leaves are vertices with , and is an interior vertex if . Binary trees are then trees for which for all interior vertices. Cherries are then pairs of leaves joined by two edges that meet in an interior vertex , with (and is still the parent, or ancestor). Sisters are then defined analogously. An unrooted tree can be rooted by picking a leaf and adding an orientation directing every edge away from it; more commonly for biological purposes, a node is added along some interior edge, the two new edges split from it, as well as the remaining edges are, again, directed so that the node has indegree zero. If T is a directed tree in the earlier sense, then upon “forgetting” the directions of the edges, one gets an (undirected) tree in this new sense, and for any vertex v of . For unrooted trees, it is convenient to dispense with orientation, but helpful to view unrooted trees as graphs that could be directed and rooted as needed. We will assume unrooted trees are just graphs, rather than directed graphs, and rooted trees are directed graphs. We may use the term “directed tree” where there might be need or confusion.

There are many ways of representing trees, and many types of software available for drawing and viewing trees. In the following exercise, we explore the Newick format for representing rooted phylogenetic trees, along with some of the tree-drawing programs using Newick-style input to output rooted and unrooted trees. We do so through free access at the website, Phylogeny.fr at http://www.phylogeny.fr. This website is specifically geared toward providing phylogenetic tools for nonspecialists. First, we note that a compact representation for a rooted tree is to group sisters by parentheses, e.g., (A,(B,C)); represents the tree with cherry and then a root joining this cherry to A, as in Example 10.1. Such a format is the Newick format or Newick representation of the tree. (Note that the end semicolon in (A,(B,C)); is part of the standard Newick format, and not meant to be punctuation in regular English.)

Example 10.1

The rooted tree given by (A,(B,C)); see Figure 10.1.

Figure 10.1 (A,(B,C)) ;

Note that neither the root nor the parent node of the cherry is shown explicitly; instead, vertical lines are used to indicate that a branching event occurs. Shrinking these vertical edges to points would produce the nodes that would be commonly shown in a mathematical graphical representation of a tree. This is a common difference between rooted trees as often represented in biology, and trees in mathematical graph theory. Also, if a tree is unrooted, it is common to represent it in Newick format by arbitrarily picking a root. In this case, still, drawing programs may neglect to display interior vertices; they are to be inferred at a branching point. (This will be seen in Exercise 10.1 below.)

Exercise 10.1

1. Obtain the graph in Example 10.1 yourself, as follows:

a. Go to the link “TreeDyn” in the list “Tree Viewers” under the heading “Online Phylogeny Programs” at the bottom of the homepage of Phylogeny.fr at http://www.phylogeny.fr/, or similarly from the pop-up menu for “TreeDyn” that appears under “Tree Viewers” when mousing over the heading “Online Programs” in the bar across the top of the Phylogeny.fr homepage.

b. In the available textbox at the top of the “TreeDyn” page, type (A,(B,C)); click the “Submit” button, and await the output.

c. Once a picture of a tree is displayed, scroll down the page to “Tree Style,” and change from the default option “Phylogram” to the option “Cladogram.” The program will automatically redraw the graph.

2. Compare the result of part (1) with the original drawing (produced by “Phylogram”), and with the figure in Example 10.1 provided above. How do they differ?

3. Replace (A,(B,C)); by (A,(C,B)); in parts (1)–(2) with the “Cladogram” setting as above. Describe how the output differs from Example 10.1. Note that you can restart “TreeDyn,” or more easily, after completing any drawing, you can click on the “Data and Settings” tab at the top of the page to return to the data you originally input into the textbox, and edit it.

4. Draw a rooted tree on three leaves with cherry .

5. Draw an unrooted tree by inputing (A,(B,C)); but selecting “Radial (by Drawtree)” under the “Tree Style” option. What, if anything, changes when the letters , and C are permuted in the Newick input?

6. Use the Newick input (A,(B,(C,D))); and the “Cladogram” and “Radial (by Drawtree)” tree style options as in parts (1)–(5) above to create graphs of rooted and unrooted trees. Examine also the outcome of permuting the leaves in the Newick format.

7. Repeat part (6) of this exercise, but with a Newick representation of a tree with two pairs of cherries, and , with the rooted version having these two pairs join in a common parent node.

8. Apply the setting “Radial (by Drawtree)” (or just select the “Drawtree” program directly) with the Newick input (,(,)); and compare with part (5) of this exercise.

Exercise 10.2

Create other examples of a tree that is rooted and another that is unrooted. List the indegrees, outdegrees, and degrees for the vertices in your trees, as appropriate.

Exercise 10.3

Given an arbitrary tree T, show that if a root of T exists, it is unique. (If there is a root, there is only one root.)

An important means for comparing trees is that of an isomorphism. Directed trees and are isomorphic if they are isomorphic as graphs, meaning that there is a bijection which extends to a bijection on the edges by setting , given . Without loss of generality, we write for this isomorphism. Likewise, one may define isomorphisms of undirected trees and as bijections for with , for . Informally, the notion of an isomorphism captures the idea that trees are like mobiles connecting a set of objects, with edges that are springs. No matter how they are suspended by a string from the ceiling, and no matter how they are stretched or spun about, if they can be turned and their edges can be elongated or compressed so that the only difference between mobiles is the particular choices of the objects dangling from the ends (but not how many objects there are or the ways in which those objects are hooked together), then the trees the mobiles represent are isomorphic. Accordingly, isomorphisms of directed trees preserve the indegree and outdegree of each vertex, so if T and are isomorphic as directed trees, then T is rooted if and only if is.

As we will focus on applications and representations of trees in phylogenetics, we will be particularly interested in relationships between trees and their sets of leaves. In this context, for a finite set X, a phylogenetic X-tree (rooted or unrooted) is a tree for which is the set of all leaves of T, and, by definition, its leaves are labeled by X. That is, when pictured, the leaves are visibly labeled by the elements of X, while interior nodes are left unlabeled. An alternative, more formal mathematical formulation is to say that a phylogenetic X-tree is a pair consisting of a tree and a leaf-labeling , which, by definition, is an injection , but the informal idea will suffice. Although the interior vertices of a phylogenetic X-tree are, by definition, unlabeled, in what follows, it will be helpful in several scenarios to add labelings to interior nodes so as to describe or compare phylogenetic X-trees that underlie these more fully labeled trees.

Example 10.2

1. The trees in parts (1)–(5) of Exercise 10.1 were phylogenetic X-trees with the leaves labeled by , whereas the interior vertices were not labeled. The tree in part (8) of Exercise 10.1 had no labeled leaves and hence is not a phylogenetic X-tree, but does represent the “underlying tree” or “tree topology” of an unrooted phylogenetic X-tree on three leaves, as will be defined a little later.

2. An approximation to the evolutionary history of all organisms alive today would be a phylogenetic X-tree wherein the leaves X consist of all species currently in existence; in this case, . The Tree of Life project http:// tolweb.org/tree/ , while noting that a tree is an imperfect model, uses phylogenetic X-trees as an organizational framework, and the use there is representative of hundreds of more specialized scientific articles employing phylogenetic trees. The reader is encouraged to explore this website for current versions and many visualizations of phylogenetic trees on multiple organizational levels of species and other biological taxa.

A tree may simply be called phylogenetic if it is a phylogenetic X-tree for some (finite) set X. Without loss of generality, it is possible to use in place of X for . A binary tree that is a phylogenetic tree is called a binary phylogenetic tree. Binary phylogenetic X-trees form the most commonly used group of trees in biology, since branching events that arise from evolutionary events, like mutations or speciation, correspond to a splitting of one lineage into two, rather than more than two simultaneously. However, non-binary trees do arise when ancestry is uncertain (e.g., the order of speciation or other branching events is uncertain), and in this case, there may be sisters that are not just cherries. An extreme example is a star tree on n leaves, which has nodes and n edges, with the single interior vertex joined to each and every leaf.

Exercise 10.4

1. Show any rooted binary tree on leaves always has a cherry, and hence conclude any unrooted binary tree on leaves does, too.

2. Explain why the total number of vertices of any unrooted binary tree on leaves is .

3. Argue that the number of edges of any unrooted binary tree on leaves is .

4. Suppose . Using the previous part of this exercise, argue that the number of unrooted binary phylogenetic X-trees for a set X of n leaves is the double factorial . As a hint, consider how to build an unrooted binary phylogenetic X- tree on n leaves from any one on leaves by adding a new edge ending in the new leaf to an edge on the tree with one less leaf. (By definition of the double factorial, , and, generally, . (Alternatively, one can show that , for .)

Suppose is a phylogenetic X-tree and is a phylogenetic -tree. Since T and are trees with additional properties, and since isomorphisms of directed trees preserve indegrees and outdegrees, respectfully, preserves degrees for undirected trees, a function can be an isomorphism of the phylogenetic trees X and only if restricts to a bijection on the leaf sets. Necessarily, then, . A tree isomorphism for which the restriction of is in fact an identity map (so and for all ), is, by definition, an isomorphism of phylogenetic X-trees. In our biological context, the notion of isomorphism makes explicit those ways in which pictures of phylogenetic trees might differ, but still represent the same evolutionary relationships among the leaves.

Example 10.3

For , let be the unrooted binary phylogenetic X-tree ((A,B),(C,D)); where the cherry has ancestor u and the cherry has ancestor v. Let be the unrooted binary phylogenetic tree ((A,C),(B,D)); where cherry has ancestor s and cherry has ancestor t. Then setting given by , and creates an isomorphism of with as phylogenetic X-trees. (Note that the lengths of edges play no part here, only the connectivity relationships.)

Exercise 10.5

1. For , let be the unrooted binary phylogenetic X-tree ((A,B),(C,D)); and let be the unrooted binary phylogenetic tree ((A,C),(B,D)); and let be as in as in Example 10.3.

a. Explain why although is an isomorphism of trees, it is not an isomorphism of phylogenetic X-trees.

b. Find another distinct isomorphism of and just as trees.

2. Explain why trees (A,(B,C)); and (A,(C,B)); are isomorphic as phylogenetic trees, but (A,(B,C)); and (B,(A,C)); are not.

3. Explain why if (A,(B,C)); and (2,(1,3)); are isomorphic as phylogenetic trees, then (A,(B,C)); and (1,(2,3)); are not.

Formally, then, the unlabeled or underlying tree associated to a phylogenetic X-tree T is an equivalence class of all trees isomorphic to T. Informally, it is represented by any tree isomorphic to T, and is said to have the same topology or tree topology as T.

Example 10.4

There is only one tree topology for unrooted phylogenetic X-trees on a set X of four leaves; it is given in Newick form by ((,),(,)); (see Figure 10.2).

Figure 10.2 Quartet ((,),(,));

If there is an isomorphism of rooted phylogenetic X-trees, then T and are said to be equivalent trees, and likewise for two unrooted phylogenetic X-trees.

Exercise 10.6

1. (Quartets) There are three distinct unrooted binary phylogenetic X-trees on a set of four leaves X. That is, up to isomorphisms of binary phylogenetic X-trees, there are only three distinct such non-equivalent trees—every other unrooted binary phylogenetic X-tree on four leaves will be an unrooted binary phylogenetic X-tree equivalent to one of these. Draw (representatives of) these three trees.

2. Up to equivalence, how many distinct rooted binary phylogenetic X-trees on a fixed set X of three leaves are there? Draw representatives of each.

3. For , how many distinct tree topologies are there underlying the rooted phylogenetic X-trees? What if these X-trees are unrooted?

More formally, there is an equivalence relation on the set of all rooted phylogenetic trees (resp., unrooted phylogenetic trees) given by setting two such trees to be related if they are equivalent phylogenetic X-trees. Just as with equivalence relations in general, one often picks a representative of each equivalence class and identifies the whole class with any representative (e.g., just as a fraction in lowest common terms represents all fractions which reduce to it). Consequently, from now on, for any fixed X, when we speak of the set of “all rooted phylogenetic X-trees” or “all unrooted phylogenetic X-trees” we understand that any particular such tree can be represented by one on with the same underlying tree topology. For (rooted) phylogenetic X-trees, this formalizes the notion in biology of a “cladogram,” so when drawn in the plane, neither horizontal axis nor vertical axis has any particular meaning, e.g., as in [4, p. 15].

From now on, for any set X of n elements, we will let denote the set of all unrooted binary phylogenetic X-trees T. For example, by Exercise 10.4, consists of three distinct such trees (again, up to equivalence), and by Exercise 10.5, all these leaf-labeled trees in have the same underlying tree topology.

Exercise 10.7

a. Identify the subsets of trees below which are isomorphic to one another simply as trees (see Figure 10.3).

Figure 10.3 Figure for Exercise 10.7

b. For the same collection of trees, identify which are isomorphic as phylogenetic X-trees, for .

c. How many distinct tree topologies are there for unrooted binary trees with , and 6 leaves?

d. How many distinct (unrooted) binary phylogenetic X-trees are there in , if , for

10.3 Tree Space

10.3.1 Trees as Points

In the last section, we discussed trees, phylogenetic X-trees (and more so, binary ones) in particular, as graphs, and developed a bit of intuition supporting their interpretation in some biological situations. In this section, we move beyond the visual appeal of trees, as graphical models, in order to interpret them in a geometric framework that better sets the stage for implementing algorithmic and computational methods to reconstruct the phylogeny of a set of species (or genes, or other organisms) and for describing geometrically the advantages and pitfalls of those methods.

By definition, phylogenetic X-trees have leaves that are labeled, but for phylogenetic tree reconstruction, one may also want to label the edges. Thinking of the vertices V, of a phylogenetic X-tree , as species for the moment, then labels on the edges can, depending on the biological context, be regarded as providing some attribute of the sequences, often a measure of evolutionary change from one species to the next. In graph-theoretic terms, a labeling of the edges E of T is an edge-weighting, a function that assigns a real number , say, to every edge . In many cases, such edge-weightings are assumed to take nonnegative values, but for phylogenetic tree reconstruction algorithms, it is useful to allow for more general edge-weightings. In phylogenetics, the graphical notion of an edge-weighting corresponds to an evolutionary distance map. Determination of evolutionary distances is a process anchored in the choice of a so-called model of evolution, which describes how (and how frequently or with what probabilities) sequences change, e.g., via substitutions, insertions, or deletions in their character strings. This is a very large and significant area of research for which there are a number of nice treatments in both biology and biomathematics texts; see the references in the Introduction.

We also do not discuss in any detail here the many ways by which evolutionary distances are defined and determined, we do give two examples in Exercise 10.8 below.

Trees , together with weightings , will be the outcomes of the distance-based reconstruction methods we examine. We let denote the set of all ordered pairs of unrooted binary phylogenetic X-trees T together with positive edge weightings on T. Some distance-based reconstruction methods may produce edge weightings with negative values or zero, so it may be useful to extend to include these weightings.

Exercise 10.8

1. The Hamming distance, , is defined to be the number of characters at which two sequences x and y differ. Suppose we have two sequences over ; for example:

Compute the Hamming distance between .

2. The Jukes-Cantor correction dJC to the Hamming distance is defined as

where f is the frequency of the different sites between two sequences. For example, suppose we have two sequences of length 10, for which the Hamming distance between is . Then .

a. Compute the Jukes-Cantor correction for the sequences as in part (1) of this exercise.

b. Observe that for any two sequences of the same length, (and the same is true for dH). What can you say about (respectfully, ) if x and y are the same?

The Hamming distance represents an easy, but rather crude measure of difference between sequences. For example, it fails to take into account the possibility that sequences could have characters change over time, and then change back. Also, there is no accounting for well-known biochemical phenomenon such as the fact that the probability that one DNA character might change to another is not generally uniform, but likely differs on the particular DNA bases themselves, or how they are arranged or grouped along the sequence. So-called “evolutionary models” describe special sets of additional assumptions that are made to account for these kinds of issues, and methods for determining the related evolutionary distances between any two given leaves , that are represented by two aligned sequences (of DNA, RNA, proteins, etc.) , generally depend on the choice of particular models of evolution. For example, the model of evolution that eventually gives rise to the Jukes-Cantor correction in Exercise 10.8 above is obtained from the Jukes-Cantor model of evolution; see e.g., [4] or [1], for more details on this model and [3] a derivation of this distance from the model by an algebraic geometry approach.

The evolutionary model for Hamming distance is very simple; it assumes all bases have equal probability of changing into one another, all sites in the string of DNA are independent, and that the only change that has occurred over time is that which is observed. As with the Hamming distance or the Jukes-Cantor model and Jukes-Cantor correction, in general, evolutionary distances incorporate information differences or distinctions between sequences and , and hence give rise to a so-called dissimilarity map, to be defined precisely further below. Sequences which are the same provide no new information; it’s the sense in which they are dissimilar that shows change and hence evolution.

10.3.2 Fitting Trees: A Distance-Based Approach

With the previous comments in mind, one approach (a “distance-based approach”) to solving the central problem of phylogenetic tree reconstruction is to

1. start with the set of (aligned) sequences ,

2. use an evolutionary model to derive a measure of evolutionary distance between x and y on the basis of the information encoded in and , and then

3. use the numerical data , to try to reconstruct a phylogenetic tree T on X which is consistent with this data.

Each of the three steps above is highly significant; for this chapter, we will be focusing on the last step. In particular, since real sequence data contains “noise,” it is not usually possible to find a tree which “fits” the data exactly, so in the last step above it becomes necessary to look for a tree T on X which “best fits” the data , and we will be exploring some ways to do this in much greater detail. As preparation, we want first to return to some mathematical formulations that will be instrumental in what follows. We start with the definition of a dissimilarity measure (also called a dissimilarity map).

Mathematically, for any finite set X, a dissimilarity map on X is simply a real-valued ( for all ) symmetric ( for all ) square matrix with diagonal entries zero ( for all ).

Exercise 10.9

Assign values to the as-yet unspecified entries of the matrix below so as to make D a dissimilarity map on .

Exercise 10.10

(Just for fun; the result will not be used or needed in the rest of this chapter.) For those familiar with the linear algebraic interpretation of a real-valued n by n square matrix A as a linear transformation , via the matrix product , for represented as an n by 1 column vector, what is the effect of applying for D a dissimilarity map, to the standard bases vectors

As matrices, dissimilarity maps show up frequently in mathematics. For example, by the polar decomposition theorem, every real-valued square matrix M can be expressed as a sum of a symmetric and a skew-symmetric matrix via , so if M begins as a zero-diagonal matrix, then the symmetric component, , will be a dissimilarity map. As another example, a map is called a metric if , and for all . The last inequality is the well-known triangle inequality, familiar from Euclidean geometry; think of points , and z as lying on the vertices of a triangle. Recall that, for is the distance from x to 0 on the real line, and is the distance between any two real numbers on the number line. Likewise, for two points in the plane , setting gives the usual distance between the two points (length of line segment joining x and y). For two points and in three-space , by reducing to a picture with two triangles, each in their own plane, it is not hard to generalize the Pythagorean theorem and show that setting gives the standard distance between the two points (length of line segment joining x and y) in . Distances can take on other forms, but going back to general metrics, we have, by definition, that they are dissimilarity maps with the triangle inequality as an additional property. Metrics are prominent in the study of mathematical real analysis, as they generalize distance measures like the absolute value, as we have just seen.

Exercise 10.11

a. Give an argument to show that the standard distance map for truly is a metric.

b. Given the comments above, generalize the distance given by the Pythagorean theorem for and its analog in to points with four coordinates, and then five. It is a very important point that the notion of the Cartesian coordinate plane generalizes to spaces , and so on for , for any integer , where these spaces exist as collections of all the ordered n-tuples with n entries drawn from .

c. If you have not (or not recently) seen a proof of the Pythagorean theorem for points in , look for one in any linear algebra book, multivariable calculus text, or find an online source. Generalizing to for any positive integer is a common exercise in courses on linear algebra and advanced calculus. For a suggestive picture, see http://en.wikipedia.org/wiki /File:Cube_diagonals.svg.

Turning back to biology, starting with a labeled X-tree (T rooted or unrooted) with edge-weighting , a dissimilarity map arises naturally. As demonstrated in the example below, simply sum up the edge-weightings , over all distinct edges e on the path in T joining any two leaves in X, to get each value .

Example 10.5

For the quartet tree T with leaves and the edge-weighting indicated above (see Figure 10.4),

(10.1)

Figure 10.4 Edge-weighted quartet tree on for Example 10.5

Exercise 10.12

a. For the tree T and edge-weighting as in Example 10.5, for each set of three elements , calculate and compare it to . For example, for , and , while . How do these compare? What happens for the other combinations, including when two or more of are equal? Interpret your observations graphically.

b. Starting again with the notation of Example 10.5, for the fixed choices , calculate the quantities . What do you notice? Does your conclusion change if you take other choices of

c. For any quartet tree (that is, any unrooted tree with four leaves) for which i and j are sisters (cherries, here), use a graphical interpretation of relevant paths along the tree to argue that

and

for any positive edge-weighting on T.

The four-point condition for a dissimilarity map D on a set X is satisfied if

for any , where by definition the right-hand side of the inequality is the largest (“max”) of the two sums indicated.

Exercise 10.13

1. Show that any quartet tree satisfies the four-point condition.

2. If T is a phylogenetic X-tree, for any set X with , and with any positive edge-weighting and induced dissimilarity map , show that the four-point condition holds for T.

Returning to an arbitrary labeled X-tree with edge-weighting , we have, by assumption, that there’s no edge between a leaf and itself, so for any . This is consistent with notions of evolutionary distance: there’s zero dissimilarity between any sequence and itself. It is also presumed that the information encoded by evolutionary distance measures on pairs of sequences is independent of the ordering of the two (i.e., it does not matter which is “first” or “second”). From Exercise 10.8, one can see explicitly that the evolutionary distances dJ and dH are dissimilarity maps.

We have seen that if one begins with an edge-weighted X-tree T, using the tree and the weighting, one can use the edge labels to find a value for each pair of leaves , that has an interpretation as a “distance,” namely, the path length between x and y along the tree T, if the “length” of an edge e is taken to be its edge weight . The resulting matrix of all values is a dissimilarity map. Again, the fundamental biological problem of phylogenetics is to start the other way around—from a relevant set X (species, genes, etc. or sequences standing in for the species, genes, and so on) and a collection of relevant “distances,” find a tree T and an edge-weighting for which the natural dissimilarity map fits the data “well.” This is what is meant by “reconstructing a phylogenetic tree” from the given data, using a distance-based approach. In the most ideal case of “fitting well,” fits D exactly, that is, they agree as functions: for all . If one begins with a phylogenetic X-tree T, and a nonnegative edge-weighting for T, and takes as data D by setting , then trivially fits D exactly. This raises the issue of consistency in tree reconstruction methods, that is, whether a tree reconstruction method applied to data D that is derived from a tree T (perhaps weighted, with weight ) actually outputs this tree T (and the corresponding weights ). One can also speak more generally of statistical consistency of a tree reconstruction method, namely, the probability that the method outputs the correct tree given sufficient data about the input.

Exercise 10.14

1. For the quartet tree T with cherry having parent node u and cherry having parent node v, and for the representative sequences on the leaves below, if , can you find an edge-weighting for T so that

2. In the basic parsimony method of tree reconstruction, cherries are selected by linking nodes for which is minimal. For the same quartet tree T with cherry having parent node u and cherry having parent node v, suppose , and , with corresponding edge weight . (That is, if , is an edge, we let .)

a. Would the parsimony method applied to T with reconstruct this tree? Explain your answer.

b. At least how many characters long would the sequences , have to be to create a tree T with the given edge-weightings?

3. If D is a dissimilarity map on X for , is D necessarily equal to for some choice of quartet T and some edge-weighting

10.3.3 Tree Metrics and Tree Space

Dissimilarity maps D for which there is some labeled tree T and some nonnegative edge-weighting for T such that are called tree metrics.

Exercise 10.15

a. Show that a tree metric is necessarily a metric.

b. Show that every tree metric satisfies the four-point condition.

These results lead, in part, to a fundamental characterization theorem of tree metrics:

Theorem 10.1

1. A dissimilarity map D on a set X is a tree metric if and only if D satisfies the four-point condition.

2. If a dissimilarity map D on a set X is a tree metric for some edge-weighted tree on X, and likewise for an X-tree and edge-weighting on T, then T and are isomorphic phylogenetic trees. Moreover, for an isomorphism, for any edge ,if , then .

Part (2) of Exercise 10.15 gives one half of part (1) of Theorem 10.1, showing how to recognize tree metrics out of all possible dissimilarity maps; for a proof, see, e.g., [2, Theorem 7.2.6]. Part (2) of Theorem 10.1 shows that any dissimilarity map which is a tree metric for a phylogenetic X-tree not only comes from such a tree, but corresponds to essentially one and only one pair of a phylogenetic X-tree and edge-weighting on T; see [2, Theorem 7.1.8] and its proof. Part (2) of Theorem 10.1 implies that tree metrics on X correspond exactly to edge-weighted phylogenetic trees on X.

If a given dissimilarity map D on a set X is in fact a tree metric, then by definition, there is a tree and an edge weighting that fits D exactly. As previously noted, in general, evolutionary distance data D from real sequence data for species, genes, etc., is noisy, and in our new language, although D will be a dissimilarity map, D will usually not already be a tree metric. Consequently, one looks for ways to find a phylogenetic X-tree T with edge-weighting for which and D will be “close” in some specified way, and in this sense, say a tree with a “best fit” has been found. To discuss “closeness,” it would be helpful to have a notion of distance that could be used to compare dissimilarity maps D coming from sequence data with induced tree metrics . Having discussed previously how to find distances between two points on a line , in the plane , or in three space , and more generally in for by means of a standard distance (via Pythagorean theorem analog), we might be motivated to seek a way to regard both D and as points, and then seek a distance between those points that we would then try to minimize in order to find an edge-weighted tree T that “best fits” the original data. Taking that step is our next focal point.

Since any matrix is really just a nicely ordered list of elements (i.e., listed by row and cross-listed by column), there is already an immediate way to regard matrices as points (vectors): just start writing out the entries of the matrix across row 1; when you get to the end, start adding the elements of row 2, and so on, until all elements are listed. For example, if M is a matrix, then there is a correspondence of M and a vector with entries:

(10.2)

Other orderings of the elements of M to create a 16-tuple could be taken, of course, but as long as we are consistent this has no effect on our definition of distance.

Given the symmetry and existence of all zero-diagonal elements of any dissimilarity map , there is a lot of redundancy in the entries; in fact, if , then for the non-zero entries of D, there are at most distinct entries in the n by n matrix D. Starting with D, it is straightforward to create a streamlined vector in carrying the same information as D, by choosing an ordering for reading off, and then listing in that order, the elements of D above the diagonal. This is illustrated for one choice of such an ordering below, for D as in Eq. (10.1):

Exercise 10.16

We have just seen how to take any n by n dissimilarity map D and, having fixed an ordering of elements in D, turn it into a point (vector) , for . Starting from any vector in , explain how to reverse this process to obtain an n by n dissimilarity map . Apply this to the particular case . Is a tree metric?

With this adjustment, it is now possible to represent any two n by n dissimilarity maps by vectors , where . One can use the standard Euclidean distance measure to measure a distance between D and in . In particular, suppose D is a dissimilarity map coming from evolutionary distances between elements in a set X (or their representative sequences). Suppose for an edge-weighted phylogenetic X-tree T (with ). Then one can phrase the problem of finding a ( edge-weighted) phylogenetic tree T that “best fits” the observed data D on X as one of finding a pair so that is as small as possible. Equivalently, one may take a least-squares approach, since minimizing is the same as minimizing . When there is an exact fit, .

To summarize:

For a set , each positive edge-weighted phylogenetic X-tree gives rise to a tree metric .

For a fixed ordering of reading off upper diagonal elements in the matrix one obtains a vector .

There are one-to-one correspondences between

– pairs of trees and positive edge-weightings,

– the n by n dissimilarity maps D that are tree metrics , and

– the vectors .

In this way, edge-weighted phylogenetic X-trees become points in , and matching such an edge-weighted tree to a dissimilarity map D arising from evolutionary distances on sequences corresponding to a set X of species (or genes, etc.) becomes a problem of minimizing distances of the vectors and in .

Regarding positively (or nonnegatively) edge-weighted trees with as a subset of points , the goal is, when presented with a point , to seek one of these points as close to D in as is possible.

In this way, a fundamental problem of biology is turned into a geometric problem. In particular, one wants to understand better the set regarded as a set of points . The set is often called “tree space,” though there is one for each n.

Example 10.6

If , then there is only one tree topology, and there is only one (unrooted binary) phylogenetic tree , but infinitely many edge-weightings for any choice of T. More precisely, for each such T, there are three edges, and for each edge can take all positive real values , independent of the other edges. The edge-weighting on T is completely described by the ordered triple . In this way, geometrically, the significance of a choice of edge-weighting for T is that describes a point in the positive orthant of associated to T. Taking all possible positive weightings on on T corresponds precisely to the full positive orthant for T. For , for any , there are edges and phylogenetic X-trees . The pairs correspond to points in three copies of . (For those not so familiar with high-dimensional geometry, this is analogous to the way in which, for every point in time, there is a three-dimensional copy of space at that point in time, only here the finitely many trees play the role of selecting out just finitely many points in time to consider, and the are subsets of -dimensional space, rather than 3-dimensional space.) In more mathematical terms, lives nicely in a product vector space , and looks like three copies of (the positive orthants of) . More generally, for is the product set , of many copies of , where and , both as per Exercise 10.4.

The treatment of the tree space is handled somewhat differently by different authors, to take advantage of simplifications in representing trees (and to use rooted trees in some cases, unrooted in others). For instance, if one extends to include nonnegative edge-weightings (that is, some values can be zero), then one may include positively weighted, (unrooted) nonbinary trees on n leaves as part of tree space, and then compare how edge-weighted binary n-trees are related to one another through the collapsing of one or more edges. For an original definition of tree space and helpful pictures of this sort in the rooted tree case, as well as a discussion of how the positive orthants are “glued” together along spaces corresponding to (nonbinary) trees with collapsed edges, see [18]; for another analogous but different perspective on the unrooted case, also with nice pictures, see [19]. Using another variation on tree space, there has been a lot of work done to examine distances, particularly geodesics, in tree space; see [18,20,21]. Regardless of the precise formulation of tree space, as the example and brief treatment above may geometrically suggest, it can be useful to group points by their phylogenetic trees T or even by their underlying topologies. This is, again, the perspective that for each there is the “same” copy of attached to it, so to determine a point in , it is often useful to focus on determining just the tree alone, and then think of the associated weighting , even though the two are linked. (Again, this is analogous to thinking of a point in our everyday four-dimensional space-time lives first by its time coordinate, then as a point in space at that time.)

In any case, for practical purposes, however, the notion of standard distance in may not be the best distance measure for comparing trees and fitting trees to dissimilarity maps. In fact, it is shown in [22] that taking a least-squares approach to picking an edge-weighted tree T with weighting and induced metric to best fit a given dissimilarity map is an NP-complete problem. The same holds true for minimizing (the f-statistic) rather than the corresponding sum of squares.

In the search for heuristics to carry out tree reconstruction, many fitting algorithms proceed instead by wandering among the trees (points) in tree space, and rather than use a least-squares or f-statistic approach, use metrics that record points as being “close” when their trees (edge-weighted by , resp., ) can be transformed into one another explicitly by means of a small number of iterations of a few well-known operations on trees as graphs. In the case of Robinson-Foulds (RF) topological distance, the two relevant operations are a form of merging nodes and then splitting nodes on the underlying tree topologies (ignoring weightings), as illustrated in the picture below for two trees of RF-distance apart (see Figure 10.5).

Figure 10.5 Trees and of RF distance one apart

Other well-known operations in tree space include the nearest-neighbor interchange (NNI move) the subtree-pruning-and-regrafting move (SPR move), and the tree bisection and reconnection move (TBR move). A generic NNI move is illustrated in Figure 10.6 above, where each triangle represents a clade, that is, a subtree arising from a node and all its descendants and edges connecting them; here it could be a single leaf. For clades , an NNI move is essentially a move on a quartet, exchanging the neighbor of one cherry for a neighbor of the other cherry (see Figure 10.6). The SPR moves and TBR moves are more general but can be similarly pictured; see, e.g., [2, Section 2.6] for definitions and representative pictures, or [23, Section 3] for a definition and picture of an SPR move relevant to our later discussion.

Figure 10.6 Generic NNI moves; each tree is related to the other by a single NNI move

10.4 Neighbor-Joining and BME

10.4.1 Neighbor-Joining

The Neighbor-Joining (NJ) Algorithm [16] takes as input a dissimilarity map and builds an edge-weighted unrooted binary phylogenetic X-tree , by iteratively picking cherries out of X. There are three essential steps in the NJ Algorithm:

1. Pick a cherry , for leaves .

2. Create a node joining leaves i and j.

3. Compute the distances from other nodes (including and all other leaves in X) to the new node (and so produce a new dissimilarity map).

If , one then continues the procedure by eliminating from the set of leaves X (i.e., replace by in the set X) and seeking the next cherry in the remaining set of leaves to join, until the total number of remaining leaves n is 3. The result is an edge-weighted unrooted binary phylogenetic X-tree. Visually, one can proceed by picturing a star tree on -leaves and seeing the NJ Algorithm as a set of instructions for successively splitting off leaves onto their own branches (see Figure 10.7).

Figure 10.7 Pictorial representation of successive steps in the Neighbor-Joining Algorithm. From http://en.wikipedia.org/wiki/File:Neighbor-joining_7_taxa_start_to_finish.png.

The key component of the NJ Algorithm is the method for picking cherries, accomplished by the Q-criterion, described in Theorem 10.2 below. The formulation of the cherry-picking step of the NJ Algorithm, as a problem of minimizing the Q-values, comes from a reworking of [16] by [24].

Theorem 10.2

(Cherry-picking criterion (also known as Q-criterion) [16,24])Let be a tree metric for a tree with leaf set X, and define the -matrix with entries:

(10.3)

Then any pair of leaves , for which is minimal, is a cherry in the tree T.

The NJ Algorithm is predicated upon the Q-criterion in the sense that if one starts with a tree metric, then the minimum value of Q gives a cherry, so if one is looking to build a tree from an arbitrary dissimilarity map, one might likewise try calculating the Q-values, and hope it works. As it turns out, this is practically and mathematically effective, yielding an algorithm with running time [24], but in its current formulation, the rationale for the Q-criterion is nonobvious. The next exercise helps shed some light on the cherry-picking criterion, when .

Exercise 10.17

For the quartet ((A,B),(C,D)); as in Example 10.3, there are five edges . No matter what values an edge-weighting actually assigns to these five edges, if we use the tree metric in place of in the cherry-picking criteria, then, using the first expression for Q in Theorem 10.2, the summands in the expression for the Q-values are basically just sums of path lengths along T. For example, the first term in is , or twice the length of the path from A to B. The next summand subtracted off is the sum of three terms: the length of the path from A to B, the length of the path from A to C, and the length of the path from A to D. The last summand subtracted off is the sum of three terms just like the last, but starting from B and going to A, to C, and then to D. Thus the path lengths from A to B cancel out completely, and in what remains, each length from a leaf to its parent node (i.e., , and ) appears twice, while the length from u to appears four times, and all are then multiplied by to give . (By tracing over the graph, one can even just graphically represent all the path pieces described above and visually see the cancellations/reduction to this expression for .) To complete this exercise:

1. Do a similar analysis for and explain why .

2. Explain why likewise .

3. Observe by symmetric arguments that only two possible things can happen: for all other pairs in X OR , and otherwise both are less than when or .

Exercise 10.17 shows that the cherry-picking criteria at least makes sense for unrooted binary quartet trees, since there is only one topology to consider. There is a much stronger connection between quartets and the NJ Algorithm, as is given by the next theorem, shown, e.g., in [25].

Theorem 10.3

If and D is a dissimilarity map on X, then the NJ Algorithm returns a tree that satisfies the four-point condition.

As suggested by the last part of Exercise 10.17, in general, while the NJ Algorithm will pick out the pair of cherries with minimal distance between them first, if there are two or more pairs of cherries with the same distance between them, any of these may be chosen first. Thus, more generally, the order of choosing cherries in the NJ Algorithm may not be unique (i.e., when multiple pairs of leaves have the same Q-values). If the NJ Algorithm selects taxa as a cherry, and is the new node joining , then the new dissimilarity map is defined to be

Exercise 10.18

1. Apply the NJ Algorithm to the dissimilarity map D specified by the information given below. (Recall that any element marked can be inferred from the data given.) Take note of the order in which cherries are picked, and when more than one choice of cherry could be made.³

2. The software package BIONJ is an improved version of the NJ Algorithm that generally produces the same trees when the number of leaves is small. Access BIONJ at http://www.phylogeny.fr, e.g., from the homepage under “Phylogeny” pop-up, located under the heading “Online Programs” in the bar near the top of the page.

a. First, click on the option to “load an example of a distance matrix.” Scroll down and click the “Submit” button to see the output. Click on the “Data and Settings” tab above the text box to return to the input dissimilarity map.

b. Based on that experience, now replace the example data with the corresponding distance matrix data for D in part (1) of this problem, adjusted accordingly, and apply BIONJ. Use the leaf names “A, B, C, D, E, F,” in that order, with A corresponding to the first row/column of the dissimilarity map D, and so on. If Java is supported, the button “Visualize your tree with ATV” can also be used for another view of the tree.

c. For more practice with real species:

i. Go to http://www.atgc-montpellier.fr/fastme/fastme_ex.txtThree dissimilarity maps appear there. Try loading the first one into BIONJ. (Note: For implementation at http://www.phylogeny.fryou will need to put the species name and the corresponding numerical data for its row together on the same line.)

ii. Predict, from the output, the first cherry that would be picked by the NJ Algorithm, and then verify your guess by computing the Q-criterion for the dissimilarity map (i.e., carry out the first step in the NJ Algorithm).

iii. What species are being related, here? You may wish to make use of the search feature at http://tolweb.org/tree/.

Recall our geometric perspective that, for a given a fixed , and a dissimilarity map (distance matrix) D, to “fit” a tree to D is to find a pair so that and D are “close.” The NJ method is at least consistent: If a tree metric is given as the input D, then will be returned as the output.

Exercise 10.19

Suppose D is an dissimilarity map () and , for some positive constant , that is, for all .

1. Explain why, in the first iteration of the Q-criterion for D and , there will be the same choices of cherries to pick, i.e., if is a cherry for D then it is also one for , and vice versa.

2. Continuing, suppose that if there is more than one choice of cherry to pick, then one picks the same cherry for both D and . Now explain why the distance from the new node joining this cherry for is times that for D.

3. If one always continues to pick the same cherries (whenever a choice is offered), explain why the trees returned by applying the NJ Algorithm to D and to will have the same underlying leaf-labeled tree topology.

The NJ Algorithm constructs a weighted (binary) phylogenetic X-tree from a dissimilarity map D on X. As a tree, T has vertices V, with , and the NJ Algorithm proceeds by constructing the set V, edges E, and edge-weights from the data D on X. An order of picking cherries in the NJ Algorithm when outputting the pair from D could be defined as a sequence of pairs of elements , for which is the kth cherry to be picked. The first cherry is a subset of X, but the remaining pairs may involve interior vertices. For fixed, no matter what phylogenetic X-tree T is constructed, by Exercise 10.4, the number of vertices is always the same. Without loss of generality (that is, by relabeling the elements of V), it makes sense to speak of an ordering of cherries or cherry-picking order as a sequence of pairs of elements of (with, say, the elements of corresponding to ), for which the NJ Algorithm produces some binary phylogenetic X-tree T from some input dissimilarity map , by picking cherries according to .

For a cherry-picking order on V, with , and , for , take to be the set of all dissimilarity maps D in for which the NJ Algorithm applied to D proceeds using . (The NJ Algorithm also outputs a weighting on T, but that’s not important for the definition of .) Applying Exercise 10.19, if then for any positive multiple . Thus, is a cone in . This cone is called a neighbor-joining (NJ) cone. The entire space is the union of NJ cones, running over all cherry-picking orderings . Take to be the set of all dissimilarity maps D in for which the NJ Algorithm outputs the unrooted binary phylogenetic X-tree with some weighting . For each (binary) phylogenetic X-tree , one has for cherry-picking orders that are compatible with T. In some cases, a tree can arise from more than one ordering, and in that case, the boundaries of the corresponding NJ cones intersect. Neighbor-joining cones describe the NJ Algorithm geometrically—the output of the NJ on a point is a weighted tree with underlying topology T if and only if D is in the union of NJ cones .

Example 10.7

For an unrooted binary phylogenetic X-tree on a set of five leaves, one has , and the orders of picking cherries in the NJ Algorithm applied to any dissimilarity map will consist of sequences of pairs from V. Necessarily, T has two pairs of cherries, say and , one other leaf b, and three interior nodes . In the NJ Algorithm, for any order of picking cherries that yields T (with some weighting) the first cherry picked will be either or . After this, if, say, u is the created ancestor node for , then either is a cherry to be picked, or a cherry to be picked. Once either cherry is chosen, the other is completely determined for the next step, since there is only one (binary, unrooted) tree topology for . Thus, for and cherries and , the NJ cones , and are the same, for starting with the subsequence and for starting with or , where is the ancestor of . Switching the order of and in and does yield different orders and cones. Thus, for any binary phylogenetic X tree T on leaves, there are two essentially distinct NJ cones which encompass all dissimilarity maps D that output T; in [26] these are labelled , and . Consequently, there are total NJ cones for . For more details, see [26].

Neighbor-joining cones are defined and studied in detail for small trees () in [26]. By studying and comparing the volumes of the NJ cones, [26] can geometrically formulate results about the behavior of the NJ Algorithm. These include how likely the NJ Algorithm is to return a particular tree topology given a random input vector (dissimilarity map), and the robustness of the NJ Algorithm. More complete information about these NJ cones, including information about “dimensions” of the cones, rays of the cones, faces, and other results, also appears in [27]. In both [27,23], geometric information is used to gain further insight into a comparison between the NJ Algorithm as a phylogenetic tree reconstruction method and another distance-based method, the balanced minimum evolution (BME) method, a topic we now explore.

10.4.2 Balanced Minimum Evolution

Recall that for any set with , and any edge-weighted phylogenetic X-tree , with , there is a naturally associated tree metric . One can set the total length of the tree to be the total sum of all the edge weights: . The minimum evolution principle for tree reconstruction uses the idea of minimizing tree length in order to find the best-fitting tree to a given dissimilarity map . Specifically, given a dissimilarity map as input, a minimum evolution method seeks to locate as output an edge-weighted tree so that for all , and is as small as possible. Biologically, the minimum evolution principle is driven by the idea that evolution is efficient. Since branch lengths represent amount of evolutionary change leading from one, say, species to the next, the total tree length is a measure of that efficiency across the whole tree, and hence should be minimal under a minimum evolution perspective.

There are many different minimum evolution methods; for all, it is necessary to start from D and come up with not only with the phylogenetic tree T, but also the edge weighting (branch lengths) . Since there are finitely many possible phylogenetic X-trees, one approach is, starting with any dissimilarity map and starting with any tree , to come up with a branch length estimation scheme for T based on the data D. One then computes the resulting tree length and, looking over all possible T, seeks T so that in minimized.

One approach for producing from D, the Balanced Minimum Evolution (BME) method, uses an approach of Pauplin’s [14]. Pauplin’s scheme for branch length estimation from D, applied to the case of just a quartet, should now be very familiar to us from our previous exercises on the four-point condition and the NJ Algorithm. Specifically, if is the interior edge joining u and in the (unrooted) quartet tree ((A,B), (C,D)); as in Example 10.3, then is given by setting . This idea is extended in [14] to the case when are replaced by entire subtrees in an arbitrary tree . The result gives an iterative formula for for any interior edge in T that accounts for the relative size (say, number of leaves) in each subtree. (This is the “balance” of the BME method.) The (easier) case of edges that are external is also handled there. Rather than say more about these formulas, however, we instead make use of a very nice result by which the total tree length can be calculated more easily, without first finding all the branch lengths .

Starting from , define to be one divided by the product of for every interior node x encountered on the path in T from i to j. Equivalently, set to be the number of edges between leaves i and j on the path . In this case, we can see that . Now set

Notice that only depends on the topology of T, so we may call it the BME vector of T.

Example 10.8

For shown in Figure 10.8, below, one obtains the BME vector

Figure 10.8 An example of ; see Example 10.8

Exercise 10.20

Find the BME vectors for each .

Using the BME vectors, Pauplin’s[14]formula for the balanced tree length estimation (i.e., estimated BME length) , starting with the dissimilarity map D, is given by

(10.4)

Equivalently, in the standard language of innner products of vectors, upon rewriting D as a vector d using the ordering from before, one gets . The BME method proceeds by seeking so that in Eq. 10.4 is minimal.

Exercise 10.21

Using the results of Exercise 10.20 and the dissimilarity map from Example 10.5, use the BME method to find a tree that best fits D under Pauplin’s branch length estimation scheme. Also find for the interior edge e of T in this case as well.

In [28], the authors in fact defined terms for any phylogenetic X-tree T (not necessarily binary) in terms of certain cyclic permutations of (“circular orderings”) of X that respect the structure of T. In the case of edge-weighted binary X-trees, one recovers the expression for in the BME vector and Pauplin’s formula. In [28] this perspective was used to establish the consistency of the balanced tree length estimation. The consistency (and statistical consistency) of the BME method as a whole was established in [29].

Like all methods that would require a complete search over all trees , combinatorial explosion is a problem. The large number of trees to consider quickly overwhelms computer capacity, so that in practice, one must come up with a heuristic. In [29,15], the theoretical underpinnings for the BME method were further examined, and a heuristic proposed and implemented for the BME method via the program FastME. This program essentially uses only nearest-neighbor interchange (NNI) moves among trees to seek out a tree T/BME vector that is a good candidate for having minimal . For more details, see [29,15].

Exercise 10.22

To access FastME, go to http://www.atgc-montpellier.fr/fastme/

1. Use the example file (under input data) to explore the option “Balanced GME.” Note that one can click on “file” in “example file” to see the data, some of which will be familiar from previous exercises. Results will be e-mailed back to you in Newick format.

2. Review [15] for a discussion of the success of this implementation versus the NJ Algorithm.

Options in FastME also include implementations of the NJ Algorithm and its improved version, the BioNJ Algorithm. There is a special reason why. Although the NJ Algorithm was long a favorite among molecular and other biologists for its effectiveness in practice, after many years of investigation, reservations about the NJ Algorithm remained, due in part to an ongoing sense of uncertainty about just what the algorithm was doing (and hence why it could be successful), from a more theoretical perspective. Illumination came in the form of a result by [29] linking the BME method and NJ Algorithm by showing that the NJ Algorithm is a so-called greedy algorithm for implementing the BME principle. Since the BME method is well grounded in standard mathematical approaches (minimization problems), this link shed light on the “true” nature of the NJ Algorithm. As a greedy algorithm, the NJ Algorithm acts locally, however, and like any greedy algorithm, does not necessarily actually produce a tree T which is optimal for the BME principle (i.e., has total length minimal). The BME method for phylogenetic tree reconstruction can be reformulated still further in a neat geometric way. By definition, a convex set B (say ) is one for which any point that is on any line joining any two points is also in B, i.e., z on implies . It is an easy exercise to see that the intersection of any set of convex sets is also convex. Let be the convex hull in of all the BME vectors for ; by definition, the convex hull of a set of points is the intersection of all convex sets containing these points. This produces a polytope (analog of a polygon in higher dimensional space), which we call the nth BME polytope , inside . As a consequence of the definition of the BME vectors and the BME method, the vertices of the BME polytope are actually the BME vectors . Using the generalized notion of BME vectors as in [2], the BME vector of the star phylogeny lies in the interior of the BME polytope , and all other BME vectors lie on the boundary of the BME polytope.

Exercise 10.23

Explain why is just a triangle in .

By Exercise 10.23, the BME polytope for is a familiar two-dimensional object. More generally, the dimension of . If , then thinking not of the full polytope in , but considering only vertices and edges of alone, there is a graph isomorphism between and the complete graph on n vertices.

Example 10.9

For , so is graph isomorphic to the complete graph on 15 vertices, shown below (see Figure 10.9).

Figure 10.9 A representation of the BME polytope

These and many other facts about BME polytopes appear in [27,23]. For , [27] observed that in the actual polytope , the graph isomorphism with the complete graph on vertices no longer holds: there are pairs of trees for which the BME vectors are not connected by an edge of the polytope. That is, the line joining and is not itself an edge of the polytope, but must sit somewhere inside , since is convex. See [p. 5][27] for a picture and description of these pairs, and other information about faces, facets, and more for small n. For , computational complexity prevents the hands-on geometric methods explored in [27] from being exploited for higher n.

Describing more fully the edges and other features of the the BME polytope could have practical implications for phylogenetic tree reconstruction. In the language of the well-known mathematical methods of linear optimization, the BME method becomes a linear programming problem: using the correspondence between and , carrying out the BME method for D is the same as minimizing the linear objective d over the (convex) BME polytope . This framing of the problem motivated further theoretical studies of the BME polytope in [23], where a new version of a cherry-picking algorithm, this time for the BME method, was developed and used to show that both NNI moves and SPR (subtree-pruning-and-regrafting) moves between trees give rise to edges in the BME polytope. Since FastME uses NNI moves to heuristically implement the BME method, Haws et al. [23] implies that FastME is an edge-walk along the BME polytope. With better knowledge of the BME polytope, it might be possible to find an even faster heuristic. Results in [23] show that clades (subtrees obtained by taking all children of a parent node and the edges that join them) correspond to “faces” in the BME polytope and identify themselves with BME polytopes values inside .

Finally, as it was for NJ cones, it is possible to define BME cones. For any , a BME cone is simply the set of all for which as in Eq. 10.4 is minimal. Again, geometrically, the BME cone describes the BME method, for the cone for T consists of all D for which the BME method applied to D produces T (with weighting ). As before, is partitioned into the BME cones, running over all , although a given D may lie in more than one BME cone. A consequence in [23] of this analysis, in combination with previous work on the NJ cones, is a further refinement of the relationship between the relationship of NJ as a greedy algorithm for the BME principle. In mathematical lingo, for any cherry-picking order , the intersection of the NJ cone and the BME cone for T has positive measure. In particular, this shows that for any order of joining nodes to form a tree, there is a dissimilarity map so that the NJ Algorithm returns the BME tree, a useful phylogenetic result. The performance of the NJ Algorithm and BME method can be further analyzed through studying these cones, and the implications of their differing geometries are explored further in [27,23]. A complete characterization of edges of the BME polytope in terms of tree operations remains an open question, however.

Also, in terms of further work, [17] studied how the BME method works when the addition of an extra taxon (e.g., species) to a data set alters the structure of the optimal phylogenetic tree. They characterized the behavior of the BME phylogenetics on such data sets, using the BME polytopes and the BME cones which are the normal cones of the BME polytope, as noted in [27]. A related study for another tree reconstruction method, UPGMA, was carried out in [30].

10.5 Summary

Many open questions in the treatment given in this chapter of some distance-based methods for phylogenetic tree reconstruction, via geometric and computational perspectives, remain. For example, giving a complete characterization of edges of the BME polytope in terms of tree operations has yet to be done. Other opportunities for further research include expanding from trees to networks, since it is known that trees do not capture well several aspects of molecular biology, including hybridization. Currently, distance-based methods have been superseded (at least in common use by many biologists) by other approaches. These other approaches include maximum likelihood methods and Bayesian approaches, the latter of which produce not just a single tree, but a distribution of possible trees.

Still, there remain good reasons for using distance-based methods. We have, quite misleadingly used trees with small examples of leaves in this chapter in order to make the underlying ideas accessible, and because much of the underlying mathematical theory can be reduced to small trees, like quartets. However, in the “real world” of biology, the most obvious reason for continuing interest in distance-based methods is due to their effectiveness via good polynomial time algorithms in the face of combinatorial explosion and computational complexity when the number of leaves is large. Even can overwhelm many other methods, but we have (also rather scandalously), with the exclusion of consistency, not gone into details in this chapter about comparing the success, via additional factors such as efficiency, robustness, power, and falsifiability, of distance-based algorithms and their competitors. (See [4, Section 6.1.4] for a brief but comprehensive treatment, up to its publication date.)

The link between the NJ Algorithm and the BME method described above, but not as well known as it might be, also addresses some issues which have perhaps mistakenly led biologists to pull back from using the NJ Algorithm or its variants. A very recent paper [13] also provides some thought-provoking arguments against another commonly voiced concern among biologists, that in providing only information between leaves, distance-based methods may lose (too much) information (e.g., see [4, Section 6.2.4] for a sample of such concerns in a biology-friendly text). Looking at covariances in the dissimilarity maps reveals more information that has commonly disregarded, given calculations like that for the Q-criterion which explicitly use only the values for an input dissimilarity map D. Still, we hope this chapter’s treatment will open the doors to these questions and issues, and to further readings in the articles and books referenced herein.

References

1. Felsenstein J. Inferring phylogenies. Sinauer Associates Inc. 2004.

2. Semple C, Steel M. Phylogenetics Oxford lecture series in mathematics and its applications. vol. 24 Oxford: Oxford University Press; 2003.

3. Pachter L, Sturmfels B editors. Algebraic statistics for computational biology. Cambridge University Press 2005.

4. Page RDM, Holmes EC. Molecular evolution: a phylogenetic approach. Blackwell Science Ltd 1998.

5. Turelli M, Barton NH, Coyne JA. Theory and speciation. Trends Ecol Evol. 2001;16:330–343.

6. Baum D. Reading a phylogenetic tree: the meaning of monophyletic groups. Nature Educ 2008;1.

7. Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;1:337–348.

8. Elias I. Settling the intractability of multiple alignment. J Comput Biol. 2006;13:1323–1339.

9. Desper R, Gascuel O. Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evo. 2004;2:587–598.

10. Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool. 1978;27:401–410.

11. DeBry RW. The consistency of several phylogeny-inference methods under varying evolutionary rates. Mol Biol Evol. 1992;9:537–551.

12. Denis F, Gascuel O. On the consistency of the minimum evolution principle of phylogenetic inference. Discrete Appl Math. 2003;127:63–77.

13. Roch S. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science. 2010;327:1376–1379.

14. Pauplin Y. Direct calculation of a tree length using a distance matrix. J Mol Evol. 2000;51:41–47.

15. Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J Comp Bio. 2002;9:687–705.

16. Saitou N, Nei M. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425.

17. Cueto MA, Matsen FA. Polyhedral geometry of phylogenetic rogue taxa; 2010. Preprint <arXiv:1001.5241>.

18. Billera LP, Holmes SJ, Vogtman K. The geometry of space of phylogenetic trees. Adv Appl Math. 2001;27:733–767.

19. Nye T. Principle component analysis in the space of phylogenetic trees. Ann Stat. 2011;39:2716–2739.

20. Owen M. Computing geodesic distances in tree space. SIAM J Discrete Math. 2011;25:1506–1529.

21. Owen M, Provan JS. A fast algorithm for computing geodesics in tree space. IEEE/ACM Trans Comput Biol Bioinform. 2011;8:2–13.

22. Day W. Computational complexity of inferring phylogenies from dissimilarity matrices. Bull Math Biol. 1987;49:461–467.

23. Haws D, Hodge T, Yoshida R. Optimality of the neighbor joining algorithm and faces of the balanced minimum evolution polytope. Bull Math Biol 2011. doi 10.1007/s11538-011-9640-x.

24. Studier JA, Keppler KJ. A note on the neighbor-joining method of Saitou and Nei. Mol Biol Evol. 1988;5:729–731.

25. Mihaescu R, Levy D, Pachter L. Why neighbor-joining works Algorithmica. 2009;54:1–24.

26. Eickmeyer K, Yoshida R. The geometry of the neighbor-joining algorithm for small trees. vol. 1547 Lecture notes in computer science 2008; p. 81–85.

27. Eickmeyer K, Huggins P, Pachter L, Yoshida R. On the optimality of the neighbor-joining algorithm. Algorithms Mol Biol. 2008;3 In: http://www.almob.org/content/3/1/5; 2008.

28. Steel M, Semple C. Cyclic permutations and evolutionary trees. Adv Appl Math. 2004;32:669–680.

29. Gascuel O, Steel M. Neighbor-joining revealed. Mol Biol Evol. 2006;23:1997–2000.

30. Davidson R, Sullivant S. Polyhedral combinatorics of UPGMA cones; 2012. Preprint. <arxiv.org/abs/1206.1621>.

¹Separate from the ML approach to tree reconstruction.

²We use this terminology although in a biological interpretation, interior nodes may not be viewed as ancestors, but only as demarcation points for speciation.

³Though tedious, carrying out these computations will provide the experience necessary to make sense of the algorithm. A tutorial for the first step, using an equivalent arithmetic formulation of the Q-criterion presented here, is given at http://www.icp.ucl.ac.be/∼opperd/private/neighbor.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Phylogenetic Tree Reconstruction: Geometric Approaches

Create new playlist

Sign In

Sign Up