Chapter 10

Phylogenetic Tree Reconstruction: Geometric Approaches

David Haws*, Terrell L. Hodge and Ruriko Yoshida, *IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, Department of Mathematics and College of Arts and Sciences Dean’s Office, Western Michigan University, Kalamazoo, MI, USA, Department of Statistics, University of Kentucky, Lexington, KY, USA, [email protected]

10.1 Introduction

In this section, we give a brief, high-level overview that encompasses the main topics to be covered in this chapter, and the relationships between them. As will be seen, phylogenetic tree reconstruction is a multi-layered process, and covering all aspects of it would take far more than a single chapter. In the next section, we begin in earnest our investigations into a subcollection of the concepts raised in the Introduction that we intend to treat in more detail (and from scratch, with definitions). For comprehensive mathematical treatments of terms, methods, and details that we do not cover, see, e.g., [1,2] for graph-theoretical/combinatorial approaches; see also [3] for an algebraic geometry and statistics perspective on many of these topics. For a less mathematical, more biologically oriented text, see, e.g., [4].

Through the ages, the diversity of species on earth and the sources of that diversity have always featured prominently in human thought, as evidenced by Aurignacian cave paintings, religious scriptures, and modern science. A prevailing biological concept is that the diversification of life forms is related to the separation of gene pools. If geographical or other barriers separate gene pools, the process of gene flow can be insufficient to counteract genetic drift, and genetic or behavioral barriers emerge against future gene flow (even after the removal of physical barriers) [5]. Of course, genetic isolation alone is insufficient to explain diversity, which further requires the raw material of genetic mutation, inevitably acted upon by natural selection.

A phylogenetic tree (or phylogeny) is a diagram/graph which represents relations of evolutionary descent of different species, organisms, or genes from a common ancestor. A phylogenetic tree is a useful tool to understand and organize information on biological diversity, structuring classifications, and providing insight into events that occurred during evolution. An excellent preliminary guide to the biological perspective, upon which we build a more mathematical approach in this chapter, is [6]; see http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956.

One may reconstruct a phylogenetic tree among distinct groups or species from morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms via, for example, parsimony. However, recent phylogenetic analysis uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Therefore in this chapter we focus on phylogenetic tree reconstruction methods based on sequenced genes or genomic data sets. Through evolutionary history, a character (e.g., a nucleotide) of the sequence might be changed to another one, deleted or inserted. Thus, before reconstructing a phylogenetic tree, we have to align an input sequence data set, e.g., identify sequences of characters (DNA bases, amino acids, etc.) which are thought to be representing the same (“homologous”) regions of genes (from different species or different gene families, etc.) and then line up the sequences so that nucleotides in differing sequences can be compared site-by-site, where one site may vary from another by mutation, i.e., insertions, deletions, or substitutions of characters. Such a line-up of two sequences is an alignment; if we have multiple sequences then the result is properly called a multiple alignment, although we will often abuse terminology and use “alignment” for both. Aligning multiple sequences is generally known to be an NP-hard problem [7,8], and there are scores of approaches to creating alignments heuristically. Here we assume that we have a perfectly aligned sequence data set and focus on the reconstruction of phylogenetic trees from a multiple alignment.

There are several methods to infer a phylogenetic tree from a given alignment, including the maximum-likelihood (ML) method, distance-based methods, parsimony-based methods, Bayesian inference methods, and so on (see [1] for more details). In this chapter we will focus on distance-based methods.

Distance-based methods build phylogenetic trees from aligned sequences by making use of pairwise “distances” between the sequences. These distances arise from a model of sequence evolution, or evolutionary model, which encodes certain hypotheses about how sequences evolve, often including the probabilities that one character at a given site will transition to another, as well as assumptions about how sequences as a whole then transform, via evolution, into one another. Models of sequence evolution are needed to address the problem that observed sequences may have experienced much more change over time that their elements alone might show; by way of analogy, just as an observation that a light switch is currently off does not alone completely specify the number of times previously that it had been flipped on and off again. Many common models of sequence evolution are, in the lingo of mathematicians, given by a continuous-time Markov model with a substitution rate matrix whose entries are the probabilities of characters changing to one another.

Each such evolutionary model given by continuous-time Markov chains corresponds to a phylogenetic tree for which each node is a sequence, and the (directed) edges link sequences which evolve from one to another by data given by substitution rate matrices. In this way, the phylogenetic tree summarizes the relationships between the species (or other organisms represented by the sequences) in terms of common ancestors (nodes) and evolutionary changes via edge lengths (e.g., times since divergence, number of substitution events, etc.).

Under the assumption that a phylogenetic tree describes evolution via a continuous-time Markov process, the main problem of phylogenetic tree reconstruction is to find a tree when only the character sequences, such as DNA sequences or protein sequences, etc., are given. In the lingo of trees and Markov processes, one assumes that the only sequence data that observed is at the tips, i.e., in the leaves, while other information on the phylogenetic tree, the particulars of the substitution events, and edge lengths are missing. In a distance-based approach, distances between sequences (i.e., what should be sums of edge lengths on the path in the phylogenetic tree between the nodes for the sequences) are derived directly from the sequences by using the evolutionary model to compute the most likely distance between each pair of sequences. This is formalized in mathematical and statistical terms by so-called maximum likelihood estimates.1 Computing these estimates will not be our focus; using them will be.

Pairwise distances, along with a distance-based method for reconstructing phylogenetic trees from the set of all pairwise distances, can be used to reconstruct a particular phylogenetic tree that relates the sequences. Usually the set of all pairwise distances computed from an alignment, collectively often called a distance matrix (or an example of a dissimilarity map), does not give a tree metric, which is a distance matrix realizing a phylogenetic tree. Thus a distance-based method tries to find a tree metric closest to the given set of pairwise distances computed from the alignment under some criteria.

To date, of all the tree reconstruction methods, distance-based methods for phylogeny reconstruction have been seen to be the best hope for accurately building phylogenies on very large sets of taxa such as the data sets for tree of life for the insects Hymenoptera [2,9]. More precisely, distance-based methods have been shown to be statistically consistent in all settings (such as the long branch attraction problem; see [1] for details), in contrast with other methods, such as parsimony methods, e.g., [1012]. Distance-based methods also have a huge speed advantage over parsimony and likelihood methods, and hence enable the reconstruction of trees on greater numbers of taxa. However, a distance-based method is not a perfect method for reconstructing a phylogenetic tree from a given sequence data set. This is because, in computing pairwise distances, one ignores both the interior nodes and the overall tree topology and so there is a concern that one loses some information from the input data sets. Therefore, it is important to understand how a distance-based method works and how robust it is with noisy data sets. (However, it is noteworthy that from an information-theoretic point of view, a recent article [13] argues that at least some distance-based methods can be proved to preserve more information than may be obtainable from maximum likelihood methods for tree reconstruction, contrary to commonly expressed concerns in the mathematical and biological literature.)

A distance-based method is related to geometry and combinatorics. In fact, one can describe the space of phylogenetic trees, i.e., a set of all tree metrics over the set of all distance matrices, as points in a high-dimensional space which form a union of so-called polyhedral cones. In Section 10.3, we will take an elementary approach to realizing trees as points, and pave the way for further study of the space of phylogenetic trees from the view of polyhedral geometry.

There has been much work to understand distance-based methods, such as the balanced minimum evolution method and neighbor-joining method. In 2002, Desper and Gascuel introduced a balanced minimum evolution (BME) principle, based on a branch length estimation scheme of Pauplin [14]. The guiding principle of minimum evolution tree reconstruction methods is to return a tree whose total length (sum of branch lengths) is minimal, given an input dissimilarity map. The BME method is a special case of these distance-based methods wherein branch lengths are estimated by a weighted least-squares method (in terms of the input dissimilarity map and the tree in question) that puts more emphasis on shorter distances than longer ones. Each labeled tree topology gives rise to a vector, called herein the BME vector, which is obtained from Pauplin’s formula.

Implementing, exploring, and better understanding the BME method have been focal points of several recent works. The software FastME, developed by Desper and Gascuel, heuristically optimizes the BME principle using nearest-neighbor interchanges (NNI) [15]. In simulations, FastME gives superior trees compared to other distance-based methods, including one of biologists’ most popular distance-based methods, the Neighbor-Joining (NJ) Algorithm, developed by Saitou and Nei [16]. In 2000, Pauplin showed that the BME method is equivalent to optimizing a linear function, the dissimilarity map, over the BME representations of binary trees, given by the BME vectors [14]. Eickmeyer et. al. defined the nth BME polytope as the convex hull of the BME vectors for all binary trees on a fixed number n of taxa. Hence the BME method is equivalent to optimizing the input dissimilarity map (a linear function), over a BME polytope. In 2010, Cueto and Matsen [17] studied how the BME method works when the addition of an extra taxon to a data set alters the structure of the optimal phylogenetic tree. They characterized the behavior of the BME phylogenetics on such data sets, using the BME polytopes and the BME cones, i.e., the normal cones of the BME polytope. We will discuss some details of BME method and NJ method from the view of polyhedral geometry in Section 10.4, again opening the doors to further reading and review.

10.2 Basics on Trees and Phylogenetic Trees

Trees are ubiquitous objects in any theory in which branching processes occur. Most generally, as a graph, a tree image is a directed, connected graph consisting of a nonempty set of vertices (or nodes), V, and set of edges, image, in which there are no cycles or loops. We will assume all our trees are finite, that is, V is a finite set. Each edge image identifies with a single ordered pair image wherein, by definition, a is the initial vertex of e and b is the terminal vertex of e, so that e is drawn as the arrow image, and we also often say e starts at a and ends at b. Given a vertex image, the number of edges ending in v is the indegree image of v, while the number of edges starting in v is the outdegree image of v. The leaves of a tree T are those vertices imagefor which the outdegree of v is 0 (also sometimes called terminal vertices). A tree may be rooted or unrooted; by definition, if T is rooted, then there is a vertex image which has image, and otherwise T is unrooted. An interior vertex is one which is not a node, although if T is rooted and image, the root is usually not regarded as an interior vertex or as a leaf. A pair of two leaves image is a cherry, if i and j have a common parent node,2 say, a, i.e., there are edges starting at a and ending at i, respectfully, j. More generally, any leaves sharing a common parent node are called sisters.

Alternatively, when working with unrooted trees, as we will most often do below for reconstruction methods, one can dispense with directionality (orientation) and start by defining trees simply as connected graphs (on nonempty vertex sets) with no cycles or loops. In this case, edges are defined by pairs image of (distinct) elements in V. It is common to represent an edge image simply as a line segment image (or image). One simply defines the degree image of image to be the number of edges incident to image. Leaves are vertices image with image, and image is an interior vertex if image. Binary trees are then trees for which image for all interior vertices. Cherries are then pairs of leaves joined by two edges that meet in an interior vertex image, with image (and image is still the parent, or ancestor). Sisters are then defined analogously. An unrooted tree can be rooted by picking a leaf and adding an orientation directing every edge away from it; more commonly for biological purposes, a node is added along some interior edge, the two new edges split from it, as well as the remaining edges are, again, directed so that the node has indegree zero. If T is a directed tree in the earlier sense, then upon “forgetting” the directions of the edges, one gets an (undirected) tree in this new sense, and for any vertex v of image. For unrooted trees, it is convenient to dispense with orientation, but helpful to view unrooted trees as graphs that could be directed and rooted as needed. We will assume unrooted trees are just graphs, rather than directed graphs, and rooted trees are directed graphs. We may use the term “directed tree” where there might be need or confusion.

There are many ways of representing trees, and many types of software available for drawing and viewing trees. In the following exercise, we explore the Newick format for representing rooted phylogenetic trees, along with some of the tree-drawing programs using Newick-style input to output rooted and unrooted trees. We do so through free access at the website, Phylogeny.fr at http://www.phylogeny.fr. This website is specifically geared toward providing phylogenetic tools for nonspecialists. First, we note that a compact representation for a rooted tree is to group sisters by parentheses, e.g., (A,(B,C)); represents the tree with cherry image and then a root joining this cherry to A, as in Example 10.1. Such a format is the Newick format or Newick representation of the tree. (Note that the end semicolon in (A,(B,C)); is part of the standard Newick format, and not meant to be punctuation in regular English.)

Example 10.1

The rooted tree given by (A,(B,C)); see Figure 10.1.

image

Figure 10.1 (A,(B,C)) ;

Note that neither the root nor the parent node of the cherry image is shown explicitly; instead, vertical lines are used to indicate that a branching event occurs. Shrinking these vertical edges to points would produce the nodes that would be commonly shown in a mathematical graphical representation of a tree. This is a common difference between rooted trees as often represented in biology, and trees in mathematical graph theory. Also, if a tree is unrooted, it is common to represent it in Newick format by arbitrarily picking a root. In this case, still, drawing programs may neglect to display interior vertices; they are to be inferred at a branching point. (This will be seen in Exercise 10.1 below.)

Exercise 10.1

1. Obtain the graph in Example 10.1 yourself, as follows:

a. Go to the link “TreeDyn” in the list “Tree Viewers” under the heading “Online Phylogeny Programs” at the bottom of the homepage of Phylogeny.fr at http://www.phylogeny.fr/, or similarly from the pop-up menu for “TreeDyn” that appears under “Tree Viewers” when mousing over the heading “Online Programs” in the bar across the top of the Phylogeny.fr homepage.

b. In the available textbox at the top of the “TreeDyn” page, type (A,(B,C)); click the “Submit” button, and await the output.

c. Once a picture of a tree is displayed, scroll down the page to “Tree Style,” and change from the default option “Phylogram” to the option “Cladogram.” The program will automatically redraw the graph.

2. Compare the result of part (1) with the original drawing (produced by “Phylogram”), and with the figure in Example 10.1 provided above. How do they differ?

3. Replace (A,(B,C)); by (A,(C,B)); in parts (1)–(2) with the “Cladogram” setting as above. Describe how the output differs from Example 10.1. Note that you can restart “TreeDyn,” or more easily, after completing any drawing, you can click on the “Data and Settings” tab at the top of the page to return to the data you originally input into the textbox, and edit it.

4. Draw a rooted tree on three leaves image with cherry image.

5. Draw an unrooted tree by inputing (A,(B,C)); but selecting “Radial (by Drawtree)” under the “Tree Style” option. What, if anything, changes when the letters image, and C are permuted in the Newick input?

6. Use the Newick input (A,(B,(C,D))); and the “Cladogram” and “Radial (by Drawtree)” tree style options as in parts (1)–(5) above to create graphs of rooted and unrooted trees. Examine also the outcome of permuting the leaves in the Newick format.

7. Repeat part (6) of this exercise, but with a Newick representation of a tree with two pairs of cherries, image and image, with the rooted version having these two pairs join in a common parent node.

8. Apply the setting “Radial (by Drawtree)” (or just select the “Drawtree” program directly) with the Newick input (,(,)); and compare with part (5) of this exercise. ent

Exercise 10.2

Create other examples of a tree that is rooted and another that is unrooted. List the indegrees, outdegrees, and degrees for the vertices in your trees, as appropriate. ent

Exercise 10.3

Given an arbitrary tree T, show that if a root of T exists, it is unique. (If there is a root, there is only one root.) ent

An important means for comparing trees is that of an isomorphism. Directed trees image and image are isomorphic if they are isomorphic as graphs, meaning that there is a bijection image which extends to a bijection on the edges by setting image, given image. Without loss of generality, we write image for this isomorphism. Likewise, one may define isomorphisms of undirected trees image and image as bijections for image with image, for image. Informally, the notion of an isomorphism captures the idea that trees are like mobiles connecting a set of objects, with edges that are springs. No matter how they are suspended by a string from the ceiling, and no matter how they are stretched or spun about, if they can be turned and their edges can be elongated or compressed so that the only difference between mobiles is the particular choices of the objects dangling from the ends (but not how many objects there are or the ways in which those objects are hooked together), then the trees the mobiles represent are isomorphic. Accordingly, isomorphisms of directed trees preserve the indegree and outdegree of each vertex, so if T and image are isomorphic as directed trees, then T is rooted if and only if image is.

As we will focus on applications and representations of trees in phylogenetics, we will be particularly interested in relationships between trees and their sets of leaves. In this context, for a finite set X, a phylogenetic X-tree (rooted or unrooted) is a tree image for which image is the set of all leaves of T, and, by definition, its leaves are labeled by X. That is, when pictured, the leaves are visibly labeled by the elements of X, while interior nodes are left unlabeled. An alternative, more formal mathematical formulation is to say that a phylogenetic X-tree is a pair image consisting of a tree image and a leaf-labeling image, which, by definition, is an injection image, but the informal idea will suffice. Although the interior vertices of a phylogenetic X-tree are, by definition, unlabeled, in what follows, it will be helpful in several scenarios to add labelings to interior nodes so as to describe or compare phylogenetic X-trees that underlie these more fully labeled trees.

Example 10.2

1. The trees in parts (1)–(5) of Exercise 10.1 were phylogenetic X-trees with the leaves labeled by image, whereas the interior vertices were not labeled. The tree in part (8) of Exercise 10.1 had no labeled leaves and hence is not a phylogenetic X-tree, but does represent the “underlying tree” or “tree topology” of an unrooted phylogenetic X-tree on three leaves, as will be defined a little later.

2. An approximation to the evolutionary history of all organisms alive today would be a phylogenetic X-tree wherein the leaves X consist of all species currently in existence; in this case, image. The Tree of Life project http:// tolweb.org/tree/ , while noting that a tree is an imperfect model, uses phylogenetic X-trees as an organizational framework, and the use there is representative of hundreds of more specialized scientific articles employing phylogenetic trees. The reader is encouraged to explore this website for current versions and many visualizations of phylogenetic trees on multiple organizational levels of species and other biological taxa.

A tree may simply be called phylogenetic if it is a phylogenetic X-tree for some (finite) set X. Without loss of generality, it is possible to use image in place of X for image. A binary tree that is a phylogenetic tree is called a binary phylogenetic tree. Binary phylogenetic X-trees form the most commonly used group of trees in biology, since branching events that arise from evolutionary events, like mutations or speciation, correspond to a splitting of one lineage into two, rather than more than two simultaneously. However, non-binary trees do arise when ancestry is uncertain (e.g., the order of speciation or other branching events is uncertain), and in this case, there may be sisters that are not just cherries. An extreme example is a star tree on n leaves, which has image nodes and n edges, with the single interior vertex joined to each and every leaf.

Exercise 10.4

1. Show any rooted binary tree on image leaves always has a cherry, and hence conclude any unrooted binary tree on image leaves does, too.

2. Explain why the total number of vertices of any unrooted binary tree on image leaves is image.

3. Argue that the number of edges of any unrooted binary tree on image leaves is image.

4. Suppose image. Using the previous part of this exercise, argue that the number of unrooted binary phylogenetic X-trees for a set X of n leaves is the double factorial image. As a hint, consider how to build an unrooted binary phylogenetic X- tree on n leaves from any one on image leaves by adding a new edge ending in the new leaf to an edge on the tree with one less leaf. (By definition of the double factorial, image, and, generally, image. (Alternatively, one can show that image, for image.) ent

Suppose image is a phylogenetic X-tree and image is a phylogenetic image-tree. Since T and image are trees with additional properties, and since isomorphisms of directed trees preserve indegrees and outdegrees, respectfully, preserves degrees for undirected trees, a function image can be an isomorphism of the phylogenetic trees X and image only if image restricts to a bijection image on the leaf sets. Necessarily, then, image. A tree isomorphism image for which the restriction image of image is in fact an identity map (so image and image for all image), is, by definition, an isomorphism of phylogenetic X-trees. In our biological context, the notion of isomorphism makes explicit those ways in which pictures of phylogenetic trees might differ, but still represent the same evolutionary relationships among the leaves.

Example 10.3

For image, let image be the unrooted binary phylogenetic X-tree ((A,B),(C,D)); where the cherry image has ancestor u and the cherry image has ancestor v. Let image be the unrooted binary phylogenetic tree ((A,C),(B,D)); where cherry image has ancestor s and cherry image has ancestor t. Then setting image given by image, and image creates an isomorphism of image with image as phylogenetic X-trees. (Note that the lengths of edges play no part here, only the connectivity relationships.)

Exercise 10.5

1. For image, let image be the unrooted binary phylogenetic X-tree ((A,B),(C,D)); and let image be the unrooted binary phylogenetic tree ((A,C),(B,D)); and let image be as in as in Example 10.3.

a. Explain why although image is an isomorphism of trees, it is not an isomorphism of phylogenetic X-trees.

b. Find another distinct isomorphism of image and image just as trees.

2. Explain why trees (A,(B,C)); and (A,(C,B)); are isomorphic as phylogenetic image trees, but (A,(B,C)); and (B,(A,C)); are not.

3. Explain why if (A,(B,C)); and (2,(1,3)); are isomorphic as phylogenetic trees, then (A,(B,C)); and (1,(2,3)); are not. ent

Formally, then, the unlabeled or underlying tree associated to a phylogenetic X-tree T is an equivalence class of all trees image isomorphic to T. Informally, it is represented by any tree image isomorphic to T, and image is said to have the same topology or tree topology as T.

Example 10.4

There is only one tree topology for unrooted phylogenetic X-trees on a set X of four leaves; it is given in Newick form by ((,),(,)); (see Figure 10.2).

image

Figure 10.2 Quartet ((,),(,));

If there is an isomorphism image of rooted phylogenetic X-trees, then T and image are said to be equivalent trees, and likewise for two unrooted phylogenetic X-trees.

Exercise 10.6

1. (Quartets) There are three distinct unrooted binary phylogenetic X-trees on a set of four leaves X. That is, up to isomorphisms of binary phylogenetic X-trees, there are only three distinct such non-equivalent trees—every other unrooted binary phylogenetic X-tree on four leaves image will be an unrooted binary phylogenetic X-tree equivalent to one of these. Draw (representatives of) these three trees.

2. Up to equivalence, how many distinct rooted binary phylogenetic X-trees on a fixed set X of three leaves are there? Draw representatives of each.

3. For image, how many distinct tree topologies are there underlying the rooted phylogenetic X-trees? What if these X-trees are unrooted?

 ent

More formally, there is an equivalence relation on the set of all rooted phylogenetic trees (resp., unrooted phylogenetic trees) given by setting two such trees to be related if they are equivalent phylogenetic X-trees. Just as with equivalence relations in general, one often picks a representative of each equivalence class and identifies the whole class with any representative (e.g., just as a fraction in lowest common terms represents all fractions which reduce to it). Consequently, from now on, for any fixed X, when we speak of the set of “all rooted phylogenetic X-trees” or “all unrooted phylogenetic X-trees” we understand that any particular such tree can be represented by one on image with the same underlying tree topology. For (rooted) phylogenetic X-trees, this formalizes the notion in biology of a “cladogram,” so when drawn in the plane, neither horizontal axis nor vertical axis has any particular meaning, e.g., as in [4, p. 15].

From now on, for any set X of n elements, we will let imagedenote the set of all unrooted binary phylogenetic X-trees T. For example, by Exercise 10.4, image consists of three distinct such trees (again, up to equivalence), and by Exercise 10.5, all these leaf-labeled trees in image have the same underlying tree topology.

Exercise 10.7

a. Identify the subsets of trees below which are isomorphic to one another simply as trees (see Figure 10.3).

image

Figure 10.3 Figure for Exercise 10.7

b. For the same collection of trees, identify which are isomorphic as phylogenetic X-trees, for image.

c. How many distinct tree topologies are there for unrooted binary trees with image, and 6 leaves?

d. How many distinct (unrooted) binary phylogenetic X-trees are there in image, if image, for image

 ent

10.3 Tree Space

10.3.1 Trees as Points

In the last section, we discussed trees, phylogenetic X-trees (and more so, binary ones) in particular, as graphs, and developed a bit of intuition supporting their interpretation in some biological situations. In this section, we move beyond the visual appeal of trees, as graphical models, in order to interpret them in a geometric framework that better sets the stage for implementing algorithmic and computational methods to reconstruct the phylogeny of a set of species (or genes, or other organisms) and for describing geometrically the advantages and pitfalls of those methods.

By definition, phylogenetic X-trees have leaves that are labeled, but for phylogenetic tree reconstruction, one may also want to label the edges. Thinking of the vertices V, of a phylogenetic X-tree image, as species for the moment, then labels on the edges can, depending on the biological context, be regarded as providing some attribute of the sequences, often a measure of evolutionary change from one species to the next. In graph-theoretic terms, a labeling of the edges E of T is an edge-weighting, a function image that assigns a real number image, say, to every edge image. In many cases, such edge-weightings image are assumed to take nonnegative values, but for phylogenetic tree reconstruction algorithms, it is useful to allow for more general edge-weightings. In phylogenetics, the graphical notion of an edge-weighting corresponds to an evolutionary distance map. Determination of evolutionary distances is a process anchored in the choice of a so-called model of evolution, which describes how (and how frequently or with what probabilities) sequences change, e.g., via substitutions, insertions, or deletions in their character strings. This is a very large and significant area of research for which there are a number of nice treatments in both biology and biomathematics texts; see the references in the Introduction.

We also do not discuss in any detail here the many ways by which evolutionary distances are defined and determined, we do give two examples in Exercise 10.8 below.

Trees image, together with weightings image, will be the outcomes of the distance-based reconstruction methods we examine. We let imagedenote the set of all ordered pairs of unrooted binary phylogenetic X-trees T together with positive edge weightings imageon T. Some distance-based reconstruction methods may produce edge weightings with negative values or zero, so it may be useful to extend image to include these weightings.

Exercise 10.8

1. The Hamming distance, image, is defined to be the number of characters at which two sequences x and y differ. Suppose we have two sequences image over image; for example:

image

Compute the Hamming distance image between image.

2. The Jukes-Cantor correction dJC to the Hamming distance is defined as

image

where f is the frequency of the different sites between two sequences. For example, suppose we have two sequences image of length 10, for which the Hamming distance between image is image. Then image.

a. Compute the Jukes-Cantor correction for the sequences image as in part (1) of this exercise.

b. Observe that for any two sequences image of the same length, image (and the same is true for dH). What can you say about image (respectfully, image) if x and y are the same?

 ent

The Hamming distance represents an easy, but rather crude measure of difference between sequences. For example, it fails to take into account the possibility that sequences could have characters change over time, and then change back. Also, there is no accounting for well-known biochemical phenomenon such as the fact that the probability that one DNA character might change to another is not generally uniform, but likely differs on the particular DNA bases themselves, or how they are arranged or grouped along the sequence. So-called “evolutionary models” describe special sets of additional assumptions that are made to account for these kinds of issues, and methods for determining the related evolutionary distances between any two given leaves image, that are represented by two aligned sequences (of DNA, RNA, proteins, etc.) image, generally depend on the choice of particular models of evolution. For example, the model of evolution that eventually gives rise to the Jukes-Cantor correction in Exercise 10.8 above is obtained from the Jukes-Cantor model of evolution; see e.g., [4] or [1], for more details on this model and [3] a derivation of this distance from the model by an algebraic geometry approach.

The evolutionary model for Hamming distance is very simple; it assumes all bases have equal probability of changing into one another, all sites in the string of DNA are independent, and that the only change that has occurred over time is that which is observed. As with the Hamming distance or the Jukes-Cantor model and Jukes-Cantor correction, in general, evolutionary distances incorporate information differences or distinctions between sequences image and image, and hence give rise to a so-called dissimilarity map, to be defined precisely further below. Sequences which are the same provide no new information; it’s the sense in which they are dissimilar that shows change and hence evolution.

10.3.2 Fitting Trees: A Distance-Based Approach

With the previous comments in mind, one approach (a “distance-based approach”) to solving the central problem of phylogenetic tree reconstruction is to

1. start with the set of (aligned) sequences image,

2. use an evolutionary model to derive a measure of evolutionary distance image between x and y on the basis of the information encoded in image and image, and then

3. use the numerical data image, to try to reconstruct a phylogenetic tree T on X which is consistent with this data.

Each of the three steps above is highly significant; for this chapter, we will be focusing on the last step. In particular, since real sequence data contains “noise,” it is not usually possible to find a tree which “fits” the data exactly, so in the last step above it becomes necessary to look for a tree T on X which “best fits” the data image, and we will be exploring some ways to do this in much greater detail. As preparation, we want first to return to some mathematical formulations that will be instrumental in what follows. We start with the definition of a dissimilarity measure (also called a dissimilarity map).

Mathematically, for any finite set X, a dissimilarity map image on X is simply a real-valued (image for all image) symmetric (image for all image) square matrix with diagonal entries zero (image for all image).

Exercise 10.9

Assign values to the as-yet unspecified entries image of the matrix image below so as to make D a dissimilarity map on image.

image

 ent

Exercise 10.10

(Just for fun; the result will not be used or needed in the rest of this chapter.) For those familiar with the linear algebraic interpretation of a real-valued n by n square matrix A as a linear transformation image, via the matrix product image, for image represented as an n by 1 column vector, what is the effect of applying image for D a dissimilarity map, to the standard bases vectors image ent

As matrices, dissimilarity maps show up frequently in mathematics. For example, by the polar decomposition theorem, every real-valued square matrix M can be expressed as a sum of a symmetric and a skew-symmetric matrix via image, so if M begins as a zero-diagonal matrix, then the symmetric component, image, will be a dissimilarity map. As another example, a map image is called a metric if image, and image for all image. The last inequality is the well-known triangle inequality, familiar from Euclidean geometry; think of points image, and z as lying on the vertices of a triangle. Recall that, for image is the distance from x to 0 on the real line, and image is the distance between any two real numbers on the number line. Likewise, for two points image in the plane image, setting image gives the usual distance between the two points (length of line segment joining x and y). For two points image and image in three-space image, by reducing to a picture with two triangles, each in their own plane, it is not hard to generalize the Pythagorean theorem and show that setting image gives the standard distance between the two points (length of line segment joining x and y) in image. Distances can take on other forms, but going back to general metrics, we have, by definition, that they are dissimilarity maps with the triangle inequality as an additional property. Metrics are prominent in the study of mathematical real analysis, as they generalize distance measures like the absolute value, as we have just seen.

Exercise 10.11

a. Give an argument to show that the standard distance map image for image truly is a metric.

b. Given the comments above, generalize the distance image given by the Pythagorean theorem for image and its analog in image to points image with four coordinates, and then five. It is a very important point that the notion of the Cartesian coordinate plane image generalizes to spaces image, and so on for image, for any integer image, where these spaces exist as collections of all the ordered n-tuples with n entries drawn from image.

c. If you have not (or not recently) seen a proof of the Pythagorean theorem for points in image, look for one in any linear algebra book, multivariable calculus text, or find an online source. Generalizing to image for any positive integer image is a common exercise in courses on linear algebra and advanced calculus. For a suggestive picture, see http://en.wikipedia.org/wiki /File:Cube_diagonals.svg.  

 ent

Turning back to biology, starting with a labeled X-tree image (T rooted or unrooted) with edge-weighting image, a dissimilarity map image arises naturally. As demonstrated in the example below, simply sum up the edge-weightings image, over all distinct edges e on the path image in T joining any two leaves image in X, to get each value image.

Example 10.5

For the quartet tree T with leaves image and the edge-weighting image indicated above (see Figure 10.4),

image (10.1)

image

Figure 10.4 Edge-weighted quartet tree image on image for Example 10.5

Exercise 10.12

a. For the tree T and edge-weighting image as in Example 10.5, for each set of three elements image, calculate image and compare it to image. For example, for image, and image, while image. How do these compare? What happens for the other combinations, including when two or more of image are equal? Interpret your observations graphically.

b. Starting again with the notation of Example 10.5, for the fixed choices image, calculate the quantities image. What do you notice? Does your conclusion change if you take other choices of image

c. For any quartet tree (that is, any unrooted tree with four leaves) for which i and j are sisters (cherries, here), use a graphical interpretation of relevant paths along the tree to argue that

image

and

image

for any positive edge-weighting image on T.

 ent

The four-point condition for a dissimilarity map D on a set X is satisfied if

image

for any image, where by definition the right-hand side of the inequality is the largest (“max”) of the two sums indicated.

Exercise 10.13

1. Show that any quartet tree satisfies the four-point condition.

2. If T is a phylogenetic X-tree, for any set X with image, and with any positive edge-weighting image and induced dissimilarity map image, show that the four-point condition holds for T.

 ent

Returning to an arbitrary labeled X-tree image with edge-weighting image, we have, by assumption, that there’s no edge between a leaf and itself, so image for any image. This is consistent with notions of evolutionary distance: there’s zero dissimilarity between any sequence image and itself. It is also presumed that the information encoded by evolutionary distance measures on pairs of sequences image is independent of the ordering of the two (i.e., it does not matter which is “first” or “second”). From Exercise 10.8, one can see explicitly that the evolutionary distances dJ and dH are dissimilarity maps.

We have seen that if one begins with an edge-weighted X-tree T, using the tree and the weighting, one can use the edge labels to find a value image for each pair of leaves image, that has an interpretation as a “distance,” namely, the path length between x and y along the tree T, if the “length” of an edge e is taken to be its edge weight image. The resulting matrix of all values image is a dissimilarity map. Again, the fundamental biological problem of phylogenetics is to start the other way around—from a relevant set X (species, genes, etc. or sequences standing in for the species, genes, and so on) and a collection image of relevant “distances,” find a tree T and an edge-weighting image for which the natural dissimilarity map image fits the data image “well.” This is what is meant by “reconstructing a phylogenetic tree” from the given data, using a distance-based approach. In the most ideal case of “fitting well,” image fits D exactly, that is, they agree as functions: image for all image. If one begins with a phylogenetic X-tree T, and a nonnegative edge-weighting image for T, and takes as data D by setting image, then trivially image fits D exactly. This raises the issue of consistency in tree reconstruction methods, that is, whether a tree reconstruction method applied to data D that is derived from a tree T (perhaps weighted, with weight image) actually outputs this tree T (and the corresponding weights image). One can also speak more generally of statistical consistency of a tree reconstruction method, namely, the probability that the method outputs the correct tree given sufficient data about the input.

Exercise 10.14

1. For the quartet tree T with cherry image having parent node u and cherry image having parent node v, and for the representative sequences on the leaves image below, if image, can you find an edge-weighting image for T so that image

image

2. In the basic parsimony method of tree reconstruction, cherries are selected by linking nodes for which image is minimal. For the same quartet tree T with cherry image having parent node u and cherry image having parent node v, suppose image, and image, with corresponding edge weight image. (That is, if image, is an edge, we let image.)

a. Would the parsimony method applied to T with image reconstruct this tree? Explain your answer.

b. At least how many characters long would the sequences image, have to be to create a tree T with the given edge-weightings?

3. If D is a dissimilarity map on X for image, is D necessarily equal to image for some choice of quartet T and some edge-weighting image ent

10.3.3 Tree Metrics and Tree Space

Dissimilarity maps D for which there is some labeled tree T and some nonnegative edge-weighting image for T such that image are called tree metrics.

Exercise 10.15

a. Show that a tree metric is necessarily a metric.

b. Show that every tree metric satisfies the four-point condition. ent

These results lead, in part, to a fundamental characterization theorem of tree metrics:

Theorem 10.1

1. A dissimilarity map D on a set X is a tree metric if and only if D satisfies the four-point condition.

2. If a dissimilarity map D on a set X is a tree metric for some imageedge-weighted tree imageon X, and likewise for an X-tree imageand edge-weighting imageon T, then T and imageare isomorphic phylogenetic trees. Moreover, for imagean isomorphism, for any edge image,if image, then image.

Part (2) of Exercise 10.15 gives one half of part (1) of Theorem 10.1, showing how to recognize tree metrics out of all possible dissimilarity maps; for a proof, see, e.g., [2, Theorem 7.2.6]. Part (2) of Theorem 10.1 shows that any dissimilarity map which is a tree metric for a phylogenetic X-tree not only comes from such a tree, but corresponds to essentially one and only one pair image of a phylogenetic X-tree and edge-weighting image on T; see [2, Theorem 7.1.8] and its proof. Part (2) of Theorem 10.1 implies that tree metrics on X correspond exactly to edge-weighted phylogenetic trees on X.

If a given dissimilarity map D on a set X is in fact a tree metric, then by definition, there is a tree and an edge weighting that fits D exactly. As previously noted, in general, evolutionary distance data D from real sequence data for species, genes, etc., is noisy, and in our new language, although D will be a dissimilarity map, D will usually not already be a tree metric. Consequently, one looks for ways to find a phylogenetic X-tree T with edge-weighting image for which image and D will be “close” in some specified way, and in this sense, say a tree with a “best fit” has been found. To discuss “closeness,” it would be helpful to have a notion of distance that could be used to compare dissimilarity maps D coming from sequence data with induced tree metrics image. Having discussed previously how to find distances between two points on a line image, in the plane image, or in three space image, and more generally in image for image by means of a standard distance (via Pythagorean theorem analog), we might be motivated to seek a way to regard both D and image as points, and then seek a distance between those points that we would then try to minimize in order to find an edge-weighted tree T that “best fits” the original data. Taking that step is our next focal point.

Since any matrix image is really just a nicely ordered list of elements (i.e., listed by row and cross-listed by column), there is already an immediate way to regard matrices as points (vectors): just start writing out the entries of the matrix across row 1; when you get to the end, start adding the elements of row 2, and so on, until all elements are listed. For example, if M is a image matrix, then there is a correspondence of M and a vector with image entries:

image (10.2)

Other orderings of the elements of M to create a 16-tuple could be taken, of course, but as long as we are consistent this has no effect on our definition of distance.

Given the symmetry and existence of all zero-diagonal elements of any dissimilarity map image, there is a lot of redundancy in the entries; in fact, if image, then for the non-zero entries of D, there are at most image distinct entries image in the n by n matrix D. Starting with D, it is straightforward to create a streamlined vector image in image carrying the same information as D, by choosing an ordering for reading off, and then listing in that order, the elements of D above the diagonal. This is illustrated for one choice of such an ordering below, for D as in Eq. (10.1):

image

Exercise 10.16

We have just seen how to take any n by n dissimilarity map D and, having fixed an ordering of elements in D, turn it into a point (vector) image, for image. Starting from any vector image in image, explain how to reverse this process to obtain an n by n dissimilarity map image. Apply this to the particular case image. Is image a tree metric? ent

With this adjustment, it is now possible to represent any two n by n dissimilarity maps image by vectors image, where image. One can use the standard Euclidean distance measure image to measure a distance between D and image in image. In particular, suppose D is a dissimilarity map coming from evolutionary distances between elements in a set X (or their representative sequences). Suppose image for an image edge-weighted phylogenetic X-tree T (with image). Then one can phrase the problem of finding a (image edge-weighted) phylogenetic tree T that “best fits” the observed data D on X as one of finding a pair image so that image is as small as possible. Equivalently, one may take a least-squares approach, since minimizing image is the same as minimizing image. When there is an exact fit, image.

To summarize:

image For a set image, each positive edge-weighted phylogenetic X-tree image gives rise to a tree metric image.

image For a fixed ordering of reading off upper diagonal elements in the matrix image one obtains a vector image.

image There are one-to-one correspondences between

– pairs image of trees image and positive edge-weightings,

– the n by n dissimilarity maps D that are tree metrics image, and

– the vectors image.

image In this way, edge-weighted phylogenetic X-trees become points in image, and matching such an edge-weighted tree to a dissimilarity map D arising from evolutionary distances on sequences corresponding to a set X of species (or genes, etc.) becomes a problem of minimizing distances of the vectors image and image in image.

image Regarding positively (or nonnegatively) edge-weighted trees image with image as a subset of points image, the goal is, when presented with a point image, to seek one of these points image as close to D in image as is possible.

In this way, a fundamental problem of biology is turned into a geometric problem. In particular, one wants to understand better the set image regarded as a set of points image. The set image is often called “tree space,” though there is one for each n.

Example 10.6

If image, then there is only one tree topology, and there is only one (unrooted binary) phylogenetic tree image, but infinitely many edge-weightings image for any choice of T. More precisely, for each such T, there are three edges, and for each edge image can take all positive real values image, independent of the other edges. The edge-weighting image on T is completely described by the ordered triple image. In this way, geometrically, the significance of a choice of edge-weighting image for T is that image describes a point in the positive orthant image of image associated to T. Taking all possible positive weightings on image on T corresponds precisely to the full positive orthant image for T. For image, for any image, there are image edges and image phylogenetic X-trees image. The pairs image correspond to points in three copies of image. (For those not so familiar with high-dimensional geometry, this is analogous to the way in which, for every point in time, there is a three-dimensional copy of space at that point in time, only here the finitely many trees image play the role of selecting out just finitely many points in time to consider, and the image are subsets of -dimensional space, rather than 3-dimensional space.) In more mathematical terms, image lives nicely in a product vector space image, and looks like three copies of (the positive orthants of) image. More generally, for image is the product set image, of image many copies of image, where image and image, both as per Exercise 10.4.

The treatment of the tree space image is handled somewhat differently by different authors, to take advantage of simplifications in representing trees (and to use rooted trees in some cases, unrooted in others). For instance, if one extends to include nonnegative edge-weightings (that is, some values image can be zero), then one may include positively weighted, (unrooted) nonbinary trees on n leaves as part of tree space, and then compare how edge-weighted binary n-trees are related to one another through the collapsing of one or more edges. For an original definition of tree space and helpful pictures of this sort in the rooted tree case, as well as a discussion of how the positive orthants are “glued” together along spaces corresponding to (nonbinary) trees with collapsed edges, see [18]; for another analogous but different perspective on the unrooted case, also with nice pictures, see [19]. Using another variation on tree space, there has been a lot of work done to examine distances, particularly geodesics, in tree space; see [18,20,21]. Regardless of the precise formulation of tree space, as the example and brief treatment above may geometrically suggest, it can be useful to group points image by their phylogenetic trees T or even by their underlying topologies. This is, again, the perspective that for each image there is the “same” copy of image attached to it, so to determine a point in image, it is often useful to focus on determining just the tree image alone, and then think of the associated weighting image, even though the two are linked. (Again, this is analogous to thinking of a point in our everyday four-dimensional space-time lives first by its time coordinate, then as a point in space at that time.)

In any case, for practical purposes, however, the notion of standard distance in image may not be the best distance measure for comparing trees and fitting trees to dissimilarity maps. In fact, it is shown in [22] that taking a least-squares approach to picking an edge-weighted tree T with weighting image and induced metric image to best fit a given dissimilarity map image is an NP-complete problem. The same holds true for minimizing image (the f-statistic) rather than the corresponding sum of squares.

In the search for heuristics to carry out tree reconstruction, many fitting algorithms proceed instead by wandering among the trees (points) in tree space, and rather than use a least-squares or f-statistic approach, use metrics that record points image as being “close” when their trees image (edge-weighted by image, resp., image) can be transformed into one another explicitly by means of a small number of iterations of a few well-known operations on trees as graphs. In the case of Robinson-Foulds (RF) topological distance, the two relevant operations are a form of merging nodes and then splitting nodes on the underlying tree topologies (ignoring weightings), as illustrated in the picture below for two trees of RF-distance apart (see Figure 10.5).

image

Figure 10.5 Trees image and image of RF distance one apart

Other well-known operations in tree space image include the nearest-neighbor interchange (NNI move) the subtree-pruning-and-regrafting move (SPR move), and the tree bisection and reconnection move (TBR move). A generic NNI move is illustrated in Figure 10.6 above, where each triangle represents a clade, that is, a subtree arising from a node and all its descendants and edges connecting them; here it could be a single leaf. For clades image, an NNI move is essentially a move on a quartet, exchanging the neighbor of one cherry for a neighbor of the other cherry (see Figure 10.6). The SPR moves and TBR moves are more general but can be similarly pictured; see, e.g., [2, Section 2.6] for definitions and representative pictures, or [23, Section 3] for a definition and picture of an SPR move relevant to our later discussion.

image

Figure 10.6 Generic NNI moves; each tree is related to the other by a single NNI move

10.4 Neighbor-Joining and BME

10.4.1 Neighbor-Joining

The Neighbor-Joining (NJ) Algorithm [16] takes as input a dissimilarity map image and builds an edge-weighted unrooted binary phylogenetic X-tree image, by iteratively picking cherries out of X. There are three essential steps in the NJ Algorithm:

1. Pick a cherry image, for leaves image.

2. Create a node image joining leaves i and j.

3. Compute the distances from other nodes (including image and all other leaves in X) to the new node image (and so produce a new dissimilarity map).

If image, one then continues the procedure by eliminating image from the set of leaves X (i.e., replace image by image in the set X) and seeking the next cherry in the remaining set of leaves to join, until the total number of remaining leaves n is 3. The result is an edge-weighted unrooted binary phylogenetic X-tree. Visually, one can proceed by picturing a star tree on image-leaves and seeing the NJ Algorithm as a set of instructions for successively splitting off leaves onto their own branches (see Figure 10.7).

image

Figure 10.7 Pictorial representation of successive steps in the Neighbor-Joining Algorithm. From http://en.wikipedia.org/wiki/File:Neighbor-joining_7_taxa_start_to_finish.png.

The key component of the NJ Algorithm is the method for picking cherries, accomplished by the Q-criterion, described in Theorem 10.2 below. The formulation of the cherry-picking step of the NJ Algorithm, as a problem of minimizing the Q-values, comes from a reworking of [16] by [24].

Theorem 10.2

(Cherry-picking criterion (also known as Q-criterion) [16,24])Let imagebe a tree metric for a tree imagewith leaf set X, and define the image-matrix imagewith entries:

image (10.3)

Then any pair of leaves image, for which imageis minimal, is a cherry in the tree T.

The NJ Algorithm is predicated upon the Q-criterion in the sense that if one starts with a tree metric, then the minimum value of Q gives a cherry, so if one is looking to build a tree from an arbitrary dissimilarity map, one might likewise try calculating the Q-values, and hope it works. As it turns out, this is practically and mathematically effective, yielding an algorithm with running time image[24], but in its current formulation, the rationale for the Q-criterion is nonobvious. The next exercise helps shed some light on the cherry-picking criterion, when image.

Exercise 10.17

For the quartet image((A,B),(C,D)); as in Example 10.3, there are five edges image. No matter what values an edge-weighting image actually assigns to these five edges, if we use the tree metric image in place of image in the cherry-picking criteria, then, using the first expression for Q in Theorem 10.2, the summands in the expression for the Q-values are basically just sums of path lengths along T. For example, the first term in image is image, or twice the length of the path from A to B. The next summand subtracted off is the sum of three terms: the length of the path from A to B, the length of the path from A to C, and the length of the path from A to D. The last summand subtracted off is the sum of three terms just like the last, but starting from B and going to A, to C, and then to D. Thus the path lengths from A to B cancel out completely, and in what remains, each length from a leaf to its parent node (i.e., image, and image) appears twice, while the length image from u to image appears four times, and all are then multiplied by image to give image. (By tracing over the graph, one can even just graphically represent all the path pieces described above and visually see the cancellations/reduction to this expression for image.) To complete this exercise:

1. Do a similar analysis for image and explain why image.

2. Explain why likewise image.

3. Observe by symmetric arguments that only two possible things can happen: image for all other pairs image in X OR image, and otherwise both are less than image when image or imageent

Exercise 10.17 shows that the cherry-picking criteria at least makes sense for unrooted binary quartet trees, since there is only one topology to consider. There is a much stronger connection between quartets and the NJ Algorithm, as is given by the next theorem, shown, e.g., in [25].

Theorem 10.3

If imageand D is a dissimilarity map on X, then the NJ Algorithm returns a tree that satisfies the four-point condition.

As suggested by the last part of Exercise 10.17, in general, while the NJ Algorithm will pick out the pair of cherries with minimal distance between them first, if there are two or more pairs of cherries with the same distance between them, any of these may be chosen first. Thus, more generally, the order of choosing cherries in the NJ Algorithm may not be unique (i.e., when multiple pairs of leaves have the same Q-values). If the NJ Algorithm selects taxa image as a cherry, and image is the new node joining image, then the new dissimilarity map image is defined to be

image

Exercise 10.18

1. Apply the NJ Algorithm to the dissimilarity map D specified by the information given below. (Recall that any element marked image can be inferred from the data given.) Take note of the order in which cherries are picked, and when more than one choice of cherry could be made.3

image

2. The software package BIONJ is an improved version of the NJ Algorithm that generally produces the same trees when the number of leaves is small. Access BIONJ at http://www.phylogeny.fr, e.g., from the homepage under “Phylogeny” pop-up, located under the heading “Online Programs” in the bar near the top of the page.

a. First, click on the option to “load an example of a distance matrix.” Scroll down and click the “Submit” button to see the output. Click on the “Data and Settings” tab above the text box to return to the input dissimilarity map.

b. Based on that experience, now replace the example data with the corresponding distance matrix data for D in part (1) of this problem, adjusted accordingly, and apply BIONJ. Use the leaf names “A, B, C, D, E, F,” in that order, with A corresponding to the first row/column of the dissimilarity map D, and so on. If Java is supported, the button “Visualize your tree with ATV” can also be used for another view of the tree.

c. For more practice with real species:

i. Go to http://www.atgc-montpellier.fr/fastme/fastme_ex.txtThree dissimilarity maps appear there. Try loading the first one into BIONJ. (Note: For implementation at http://www.phylogeny.fryou will need to put the species name and the corresponding numerical data for its row together on the same line.)

ii. Predict, from the output, the first cherry that would be picked by the NJ Algorithm, and then verify your guess by computing the Q-criterion for the dissimilarity map (i.e., carry out the first step in the NJ Algorithm).

iii. What species are being related, here? You may wish to make use of the search feature at http://tolweb.org/tree/ent

Recall our geometric perspective that, for a given a fixed image, and a dissimilarity map (distance matrix) D, to “fit” a tree to D is to find a pair image so that image and D are “close.” The NJ method is at least consistent: If a tree metric image is given as the input D, then image will be returned as the output.

Exercise 10.19

Suppose D is an image dissimilarity map (image) and image, for some positive constant image, that is, image for all image.

1. Explain why, in the first iteration of the Q-criterion for D and image, there will be the same choices of cherries to pick, i.e., if image is a cherry for D then it is also one for image, and vice versa.

2. Continuing, suppose that if there is more than one choice of cherry to pick, then one picks the same cherry for both D and image. Now explain why the distance from the new node joining this cherry for image is image times that for D.

3. If one always continues to pick the same cherries (whenever a choice is offered), explain why the trees returned by applying the NJ Algorithm to D and to image will have the same underlying leaf-labeled tree topology.

The NJ Algorithm constructs a weighted (binary) phylogenetic X-tree image from a dissimilarity map D on X. As a tree, T has vertices V, with image, and the NJ Algorithm proceeds by constructing the set V, edges E, and edge-weights image from the data D on X. An order imageof picking cherries in the NJ Algorithm when outputting the pair image from D could be defined as a sequence of pairs image of elements image, for which image is the kth cherry to be picked. The first cherry image is a subset of X, but the remaining pairs may involve interior vertices. For image fixed, no matter what phylogenetic X-tree T is constructed, by Exercise 10.4, the number of vertices image is always the same. Without loss of generality (that is, by relabeling the elements of V), it makes sense to speak of an ordering of cherries or cherry-picking order as a sequence of pairs of elements of image (with, say, the elements of image corresponding to image), for which the NJ Algorithm produces some binary phylogenetic X-tree T from some input dissimilarity map image, by picking cherries according to image.

For a cherry-picking order image on V, with image, and image, for image, take image to be the set of all dissimilarity maps D in image for which the NJ Algorithm applied to D proceeds using image. (The NJ Algorithm also outputs a weighting image on T, but that’s not important for the definition of image.) Applying Exercise 10.19, if image then image for any positive multiple image. Thus, image is a cone in image. This cone is called a neighbor-joining (NJ) cone. The entire space image is the union of NJ cones, running over all cherry-picking orderings image. Take image to be the set of all dissimilarity maps D in image for which the NJ Algorithm outputs the unrooted binary phylogenetic X-tree image with some weighting image. For each (binary) phylogenetic X-tree image, one has image for cherry-picking orders image that are compatible with T. In some cases, a tree image can arise from more than one ordering, and in that case, the boundaries of the corresponding NJ cones intersect. Neighbor-joining cones describe the NJ Algorithm geometrically—the output of the NJ on a point image is a weighted tree with underlying topology T if and only if D is in the union of NJ cones image.

Example 10.7

For an unrooted binary phylogenetic X-tree image on a set image of five leaves, one has image, and the orders of picking cherries in the NJ Algorithm applied to any dissimilarity map image will consist of sequences of pairs from V. Necessarily, T has two pairs of cherries, say image and image, one other leaf b, and three interior nodes image. In the NJ Algorithm, for any order of picking cherries image that yields T (with some weighting) the first cherry picked will be either image or image. After this, if, say, u is the created ancestor node for image, then either image is a cherry to be picked, or image a cherry to be picked. Once either cherry is chosen, the other is completely determined for the next step, since there is only one (binary, unrooted) tree topology for image. Thus, for image and cherries image and image, the NJ cones image, and image are the same, for image starting with the subsequence image and for image starting with image or image, where image is the ancestor of image. Switching the order of image and image in image and image does yield different orders and cones. Thus, for any binary phylogenetic X tree T on image leaves, there are two essentially distinct NJ cones which encompass all dissimilarity maps D that output T; in [26] these are labelled image, and image. Consequently, there are image total NJ cones for image. For more details, see [26].

Neighbor-joining cones are defined and studied in detail for small trees (image) in [26]. By studying and comparing the volumes of the NJ cones, [26] can geometrically formulate results about the behavior of the NJ Algorithm. These include how likely the NJ Algorithm is to return a particular tree topology given a random input vector (dissimilarity map), and the robustness of the NJ Algorithm. More complete information about these NJ cones, including information about “dimensions” of the cones, rays of the cones, faces, and other results, also appears in [27]. In both [27,23], geometric information is used to gain further insight into a comparison between the NJ Algorithm as a phylogenetic tree reconstruction method and another distance-based method, the balanced minimum evolution (BME) method, a topic we now explore.

10.4.2 Balanced Minimum Evolution

Recall that for any set image with image, and any edge-weighted phylogenetic X-tree image, with image, there is a naturally associated tree metric image. One can set the total length of the tree to be the total sum of all the edge weights: image. The minimum evolution principle for tree reconstruction uses the idea of minimizing tree length in order to find the best-fitting tree image to a given dissimilarity map image. Specifically, given a dissimilarity map image as input, a minimum evolution method seeks to locate as output an edge-weighted tree image so that image for all image, and image is as small as possible. Biologically, the minimum evolution principle is driven by the idea that evolution is efficient. Since branch lengths represent amount of evolutionary change leading from one, say, species to the next, the total tree length image is a measure of that efficiency across the whole tree, and hence should be minimal under a minimum evolution perspective.

There are many different minimum evolution methods; for all, it is necessary to start from D and come up with not only with the phylogenetic tree T, but also the edge weighting (branch lengths) image. Since there are finitely many possible phylogenetic X-trees, one approach is, starting with any dissimilarity map image and starting with any tree image, to come up with a branch length estimation scheme image for T based on the data D. One then computes the resulting tree length image and, looking over all possible T, seeks T so that image in minimized.

One approach for producing image from D, the Balanced Minimum Evolution (BME) method, uses an approach of Pauplin’s [14]. Pauplin’s scheme for branch length estimation from D, applied to the case of just a quartet, should now be very familiar to us from our previous exercises on the four-point condition and the NJ Algorithm. Specifically, if image is the interior edge joining u and image in the (unrooted) quartet tree ((A,B), (C,D)); as in Example 10.3, then image is given by setting image. This idea is extended in [14] to the case when image are replaced by entire subtrees in an arbitrary tree image. The result gives an iterative formula for image for any interior edge in T that accounts for the relative size (say, number of leaves) in each subtree. (This is the “balance” of the BME method.) The (easier) case of edges that are external is also handled there. Rather than say more about these formulas, however, we instead make use of a very nice result by which the total tree length image can be calculated more easily, without first finding all the branch lengths image.

Starting from image, define image to be one divided by the product of image for every interior node x encountered on the path image in T from i to j. Equivalently, set image to be the number of edges between leaves i and j on the path image. In this case, we can see that image. Now set

image

Notice that image only depends on the topology of T, so we may call it the BME vector of T.

Example 10.8

For image shown in Figure 10.8, below, one obtains the BME vector

image

image

Figure 10.8 An example of image; see Example 10.8

Exercise 10.20

Find the BME vectors image for each imageent

Using the BME vectors, Pauplin’s[14]formula for the balanced tree length estimation (i.e., estimated BME length) image, starting with the dissimilarity map D, is given by

image (10.4)

Equivalently, in the standard language of innner products of vectors, upon rewriting D as a vector d using the ordering from before, one gets image. The BME method proceeds by seeking image so that image in Eq. 10.4 is minimal.

Exercise 10.21

Using the results of Exercise 10.20 and the dissimilarity map from Example 10.5, use the BME method to find a tree image that best fits D under Pauplin’s branch length estimation scheme. Also find image for the interior edge e of T in this case as well. ent

In [28], the authors in fact defined terms image for any phylogenetic X-tree T (not necessarily binary) in terms of certain cyclic permutations of (“circular orderings”) of X that respect the structure of T. In the case of edge-weighted binary X-trees, one recovers the expression for image in the BME vector and Pauplin’s formula. In [28] this perspective was used to establish the consistency of the balanced tree length estimation. The consistency (and statistical consistency) of the BME method as a whole was established in [29].

Like all methods that would require a complete search over all trees image, combinatorial explosion is a problem. The large number of trees to consider quickly overwhelms computer capacity, so that in practice, one must come up with a heuristic. In [29,15], the theoretical underpinnings for the BME method were further examined, and a heuristic proposed and implemented for the BME method via the program FastME. This program essentially uses only nearest-neighbor interchange (NNI) moves among trees to seek out a tree T/BME vector image that is a good candidate for having minimal image. For more details, see [29,15].

Exercise 10.22

To access FastME, go to http://www.atgc-montpellier.fr/fastme/

1. Use the example file (under input data) to explore the option “Balanced GME.” Note that one can click on “file” in “example file” to see the data, some of which will be familiar from previous exercises. Results will be e-mailed back to you in Newick format.

2. Review [15] for a discussion of the success of this implementation versus the NJ Algorithm. ent

Options in FastME also include implementations of the NJ Algorithm and its improved version, the BioNJ Algorithm. There is a special reason why. Although the NJ Algorithm was long a favorite among molecular and other biologists for its effectiveness in practice, after many years of investigation, reservations about the NJ Algorithm remained, due in part to an ongoing sense of uncertainty about just what the algorithm was doing (and hence why it could be successful), from a more theoretical perspective. Illumination came in the form of a result by [29] linking the BME method and NJ Algorithm by showing that the NJ Algorithm is a so-called greedy algorithm for implementing the BME principle. Since the BME method is well grounded in standard mathematical approaches (minimization problems), this link shed light on the “true” nature of the NJ Algorithm. As a greedy algorithm, the NJ Algorithm acts locally, however, and like any greedy algorithm, does not necessarily actually produce a tree T which is optimal for the BME principle (i.e., has total length image minimal). The BME method for phylogenetic tree reconstruction can be reformulated still further in a neat geometric way. By definition, a convex set B (say image) is one for which any point image that is on any line image joining any two points image is also in B, i.e., z on image implies image. It is an easy exercise to see that the intersection of any set of convex sets is also convex. Let image be the convex hull in image of all the BME vectors image for image; by definition, the convex hull of a set of points is the intersection of all convex sets containing these points. This produces a polytope (analog of a polygon in higher dimensional space), which we call the nth BME polytope image, inside image. As a consequence of the definition of the BME vectors and the BME method, the vertices of the BME polytope are actually the BME vectors image. Using the generalized notion of BME vectors as in [2], the BME vector of the star phylogeny lies in the interior of the BME polytope image, and all other BME vectors lie on the boundary of the BME polytope.

Exercise 10.23

Explain why image is just a triangle in imageent

By Exercise 10.23, the BME polytope for image is a familiar two-dimensional object. More generally, the dimension of image. If image, then thinking not of the full polytope in image, but considering only vertices and edges of image alone, there is a graph isomorphism between image and the complete graph on n vertices.

Example 10.9

For image, so image is graph isomorphic to the complete graph on 15 vertices, shown below (see Figure 10.9).

image

Figure 10.9 A representation of the BME polytope image

These and many other facts about BME polytopes appear in [27,23]. For image, [27] observed that in the actual polytope image, the graph isomorphism with the complete graph on image vertices no longer holds: there are pairs of trees image for which the BME vectors are not connected by an edge of the polytope. That is, the line joining image and image is not itself an edge of the polytope, but must sit somewhere inside image, since image is convex. See [p. 5][27] for a picture and description of these pairs, and other information about faces, facets, and more for small n. For image, computational complexity prevents the hands-on geometric methods explored in [27] from being exploited for higher n.

Describing more fully the edges and other features of the the BME polytope could have practical implications for phylogenetic tree reconstruction. In the language of the well-known mathematical methods of linear optimization, the BME method becomes a linear programming problem: using the correspondence between image and image, carrying out the BME method for D is the same as minimizing the linear objective d over the (convex) BME polytope image. This framing of the problem motivated further theoretical studies of the BME polytope in [23], where a new version of a cherry-picking algorithm, this time for the BME method, was developed and used to show that both NNI moves and SPR (subtree-pruning-and-regrafting) moves between trees image give rise to edges in the BME polytope. Since FastME uses NNI moves to heuristically implement the BME method, Haws et al. [23] implies that FastME is an edge-walk along the BME polytope. With better knowledge of the BME polytope, it might be possible to find an even faster heuristic. Results in [23] show that clades (subtrees obtained by taking all children of a parent node and the edges that join them) correspond to “faces” in the BME polytope and identify themselves with BME polytopes image values image inside image.

Finally, as it was for NJ cones, it is possible to define BME cones. For any image, a BME cone is simply the set of all image for which image as in Eq. 10.4 is minimal. Again, geometrically, the BME cone describes the BME method, for the cone for T consists of all D for which the BME method applied to D produces T (with weighting image). As before, image is partitioned into the BME cones, running over all image, although a given D may lie in more than one BME cone. A consequence in [23] of this analysis, in combination with previous work on the NJ cones, is a further refinement of the relationship between the relationship of NJ as a greedy algorithm for the BME principle. In mathematical lingo, for any cherry-picking order image, the intersection of the NJ cone image and the BME cone for T has positive measure. In particular, this shows that for any order of joining nodes to form a tree, there is a dissimilarity map so that the NJ Algorithm returns the BME tree, a useful phylogenetic result. The performance of the NJ Algorithm and BME method can be further analyzed through studying these cones, and the implications of their differing geometries are explored further in [27,23]. A complete characterization of edges of the BME polytope in terms of tree operations remains an open question, however.

Also, in terms of further work, [17] studied how the BME method works when the addition of an extra taxon (e.g., species) to a data set alters the structure of the optimal phylogenetic tree. They characterized the behavior of the BME phylogenetics on such data sets, using the BME polytopes and the BME cones which are the normal cones of the BME polytope, as noted in [27]. A related study for another tree reconstruction method, UPGMA, was carried out in [30].

10.5 Summary

Many open questions in the treatment given in this chapter of some distance-based methods for phylogenetic tree reconstruction, via geometric and computational perspectives, remain. For example, giving a complete characterization of edges of the BME polytope in terms of tree operations has yet to be done. Other opportunities for further research include expanding from trees to networks, since it is known that trees do not capture well several aspects of molecular biology, including hybridization. Currently, distance-based methods have been superseded (at least in common use by many biologists) by other approaches. These other approaches include maximum likelihood methods and Bayesian approaches, the latter of which produce not just a single tree, but a distribution of possible trees.

Still, there remain good reasons for using distance-based methods. We have, quite misleadingly used trees with small examples of leaves in this chapter in order to make the underlying ideas accessible, and because much of the underlying mathematical theory can be reduced to small trees, like quartets. However, in the “real world” of biology, the most obvious reason for continuing interest in distance-based methods is due to their effectiveness via good polynomial time algorithms in the face of combinatorial explosion and computational complexity when the number of leaves is large. Even image can overwhelm many other methods, but we have (also rather scandalously), with the exclusion of consistency, not gone into details in this chapter about comparing the success, via additional factors such as efficiency, robustness, power, and falsifiability, of distance-based algorithms and their competitors. (See [4, Section 6.1.4] for a brief but comprehensive treatment, up to its publication date.)

The link between the NJ Algorithm and the BME method described above, but not as well known as it might be, also addresses some issues which have perhaps mistakenly led biologists to pull back from using the NJ Algorithm or its variants. A very recent paper [13] also provides some thought-provoking arguments against another commonly voiced concern among biologists, that in providing only information between leaves, distance-based methods may lose (too much) information (e.g., see [4, Section 6.2.4] for a sample of such concerns in a biology-friendly text). Looking at covariances in the dissimilarity maps reveals more information that has commonly disregarded, given calculations like that for the Q-criterion which explicitly use only the values image for an input dissimilarity map D. Still, we hope this chapter’s treatment will open the doors to these questions and issues, and to further readings in the articles and books referenced herein.

References

1. Felsenstein J. Inferring phylogenies. Sinauer Associates Inc. 2004.

2. Semple C, Steel M. Phylogenetics Oxford lecture series in mathematics and its applications. vol. 24 Oxford: Oxford University Press; 2003.

3. Pachter L, Sturmfels B editors. Algebraic statistics for computational biology. Cambridge University Press 2005.

4. Page RDM, Holmes EC. Molecular evolution: a phylogenetic approach. Blackwell Science Ltd 1998.

5. Turelli M, Barton NH, Coyne JA. Theory and speciation. Trends Ecol Evol. 2001;16:330–343.

6. Baum D. Reading a phylogenetic tree: the meaning of monophyletic groups. Nature Educ 2008;1.

7. Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;1:337–348.

8. Elias I. Settling the intractability of multiple alignment. J Comput Biol. 2006;13:1323–1339.

9. Desper R, Gascuel O. Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evo. 2004;2:587–598.

10. Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool. 1978;27:401–410.

11. DeBry RW. The consistency of several phylogeny-inference methods under varying evolutionary rates. Mol Biol Evol. 1992;9:537–551.

12. Denis F, Gascuel O. On the consistency of the minimum evolution principle of phylogenetic inference. Discrete Appl Math. 2003;127:63–77.

13. Roch S. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science. 2010;327:1376–1379.

14. Pauplin Y. Direct calculation of a tree length using a distance matrix. J Mol Evol. 2000;51:41–47.

15. Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J Comp Bio. 2002;9:687–705.

16. Saitou N, Nei M. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425.

17. Cueto MA, Matsen FA. Polyhedral geometry of phylogenetic rogue taxa; 2010. Preprint <arXiv:1001.5241>.

18. Billera LP, Holmes SJ, Vogtman K. The geometry of space of phylogenetic trees. Adv Appl Math. 2001;27:733–767.

19. Nye T. Principle component analysis in the space of phylogenetic trees. Ann Stat. 2011;39:2716–2739.

20. Owen M. Computing geodesic distances in tree space. SIAM J Discrete Math. 2011;25:1506–1529.

21. Owen M, Provan JS. A fast algorithm for computing geodesics in tree space. IEEE/ACM Trans Comput Biol Bioinform. 2011;8:2–13.

22. Day W. Computational complexity of inferring phylogenies from dissimilarity matrices. Bull Math Biol. 1987;49:461–467.

23. Haws D, Hodge T, Yoshida R. Optimality of the neighbor joining algorithm and faces of the balanced minimum evolution polytope. Bull Math Biol 2011. doi 10.1007/s11538-011-9640-x.

24. Studier JA, Keppler KJ. A note on the neighbor-joining method of Saitou and Nei. Mol Biol Evol. 1988;5:729–731.

25. Mihaescu R, Levy D, Pachter L. Why neighbor-joining works Algorithmica. 2009;54:1–24.

26. Eickmeyer K, Yoshida R. The geometry of the neighbor-joining algorithm for small trees. vol. 1547 Lecture notes in computer science 2008; p. 81–85.

27. Eickmeyer K, Huggins P, Pachter L, Yoshida R. On the optimality of the neighbor-joining algorithm. Algorithms Mol Biol. 2008;3 In: http://www.almob.org/content/3/1/5; 2008.

28. Steel M, Semple C. Cyclic permutations and evolutionary trees. Adv Appl Math. 2004;32:669–680.

29. Gascuel O, Steel M. Neighbor-joining revealed. Mol Biol Evol. 2006;23:1997–2000.

30. Davidson R, Sullivant S. Polyhedral combinatorics of UPGMA cones; 2012. Preprint. <arxiv.org/abs/1206.1621>.


1Separate from the ML approach to tree reconstruction.

2We use this terminology although in a biological interpretation, interior nodes may not be viewed as ancestors, but only as demarcation points for speciation.

3Though tedious, carrying out these computations will provide the experience necessary to make sense of the algorithm. A tutorial for the first step, using an equivalent arithmetic formulation of the Q-criterion presented here, is given at http://www.icp.ucl.ac.be/∼opperd/private/neighbor.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset