Chapter 8. Multiple Sequence Alignments, Trees, and Profiles

In Chapter 7, we introduced the idea of using sequence alignment to find and compare pairs of related sequences. Biologically interesting problems, however, often involve comparing more than two sequences at once. For example, a BLAST or FASTA search can yield a large number of sequences that match the query. How do you compare all these resulting sequences with each other? In other words, how can you examine these sequences to understand how they are related to one another?

One approach is to perform pairwise alignments of all pairs of sequences, then study these pairwise alignments individually. It's more efficient (and easier to comprehend), however, if you compare all the sequences at once, then examine the resulting ensemble alignment. This process is known as multiple sequence alignment. Multiple sequence alignments can be used to study groups of related genes or proteins, to infer evolutionary relationships between genes, and to discover patterns that are shared among groups of functionally or structurally related sequences. In this chapter, we introduce some tools for creating and interpreting multiple sequence alignments and describe some of their applications, including phylogenetic inference and motif discovery. Phylogenetic inference and motif discovery are rooted in evolutionary theory, so before we dive into a discussion of that area of bioinformatics, let's take a minute to review the history and theory of evolution.

The Morphological to the Molecular

In order to ground our discussion of the details of multiple sequence alignment, let's take another brief look at evolution. One of the goals of biology has been the creation of a taxonomy for living things, a method of organizing species in terms of their relationships to one another. Early biologists classified species solely according to their morphology—the physical appearance of the organism—and later, as dissection became a more common practice, their anatomy.

Naturalists also discovered fossils of creatures that didn't resemble anything alive at the time, but were thought to have once been living things. This evidence introduced the possibility that life on Earth had changed over time. It also suggested that the interrelationship between species isn't static, but rather is the result of an evolutionary process. As understanding of the geophysical processes of the planet improved, the amount of time required for such changes to occur became clearer. It is now widely accepted by scientists that life on Earth is approximately 3.5 billion years old. Fossil records of single-celled organisms resembling bacteria, with an estimated age of 3.5 billion years, have been found and catalogued.

The evolutionary theory that was eventually accepted by most biologists was that of Charles Darwin. Darwin proposed that every generation of living creatures has some variability. The individuals whose variations predispose them to survive in their environment are the ones who reproduce most successfully and pass on their traits in greater numbers. In light of this theory, it has been hypothesized that the diversity of life forms on Earth is due to divergence, perhaps even from one common ancestral unicellular organism, to fill various biological niches.

Molecular evolution extends the concept of evolution to the level of DNA and protein sequences. Although the replication of DNA sequence is a very accurate process, small replication errors accumulate over time, along with radiation damage and other mutations or alterations of the genomic sequence. Instead of evolutionary pressure selecting organisms based on morphological traits, selection occurs at the level of mutations. Consequently, the only mutations observed in genes from healthy organisms are those that don't result in the organisms' death.

Because these changes between gene sequences are incremental, we can take homologous genes—genes with common evolutionary origin and related function—from a number of divergent organisms and compare them by aligning identical or similar residues. This comparison of multiple sequences shows which regions of a gene (or its derived protein) are sensitive to mutation and which are tolerant of having one residue replaced by another. Thus, we can develop hypotheses about the molecular events underlying the evolution of these sequences. Many bioinformatics methods, including pairwise sequence comparison and sequence database searching, are based on this observation that homologous genes have similar sequences.

In considering sequence similarity, there is one additional wrinkle to bear in mind: the difference between orthologs and paralogs. The chemical processes of molecular evolution are responsible for more than just giving rise to species differences. Evolutionary change can occur within the genome of a single species as well. Orthologs are genes that are evolutionarily related, share a function, and have diverged by speciation. Paralogs , on the other hand, have a common ancestor but have diverged by gene duplication and no longer have a common functional role. In other words, orthologs have the same function but occur in different species, while paralogs exist in the same genome but have different functions. A sequence database search may return both orthologs and paralogs. Depending on the objectives of your study, you probably will not want to treat them all as members of the same set.

Multiple Sequence Alignment

Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are a statement of both evolutionary and structural similarity among the proteins encoded by each sequence in the alignment. We know that proteins with closely related functions are similar in both sequence and structure from organism to organism, and that sequence tends to change more rapidly than structure in the course of evolution. In multiple alignments generated from sequence data alone, regions that are similar in sequence are usually found to be superimposable in structure as well.

With a detailed knowledge of the biochemistry of a protein, you can create a multiple alignment by hand. This is a painstaking process, however. The challenge of automatic alignment is that it is hard to define exactly what an optimal multiple alignment is, and impossible to set a standard for a single correct multiple alignment. In theory, there is one underlying evolutionary process and one evolutionarily correct alignment to be generated from any group of sequences. However, the differences between sequences can be so great in parts of an alignment that there isn't an apparent, unique solution to be found by an alignment algorithm. Those same divergent regions are often structurally unalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions of similarity, not from attempting to align the very diverged regions.

The dynamic programming algorithm used for pairwise sequence alignment can theoretically be extended to any number of sequences. However, the time and memory requirements of this algorithm increase exponentially with the number of sequences. Dynamic programming alignment of two sequences takes seconds. Alignment of four relatively short sequences takes a few hours. Beyond that, it becomes impractical to align sequences this way. The program MSA is an implementation of an algorithm that reduces the complexity of the dynamic programming problem for multiple sequences to some extent. It can align about seven relatively short (200 -300) protein sequences in a reasonable amount of time. However, MSA is of little use when comparing large numbers of sequences.

Progressive Strategies for Multiple Alignment

A common approach to multiple sequence alignment is to progressively align pairs of sequences. The general progressive strategy can be outlined as follows: a starting pair of sequences is selected and aligned, then each subsequent sequence is aligned to the previous alignment. Like the Needleman-Wunsch and Smith-Waterman algorithms for sequence alignment, progressive alignment is an instance of a heuristic algorithm. Specifically, it is a greedy algorithm. Greedy algorithms decompose a problem into pieces, then choose the best solution to each piece without paying attention to the problem as a whole. In the case of progressive alignment, the overall problem (alignment of many sequences) is decomposed into a series of pairwise alignment steps.

Because it is a heuristic algorithm, progressive alignment isn't guaranteed to find the best possible alignment. In practice, however, it is efficient and produces biologically meaningful results. Progressive alignment methods differ in several respects: how they choose the initial pair of sequences to align, whether they align every subsequent sequence to a single cumulative alignment or create subfamilies, and how they score individual alignments and alignments of individual sequences to previous alignments.

Multiple Alignment with ClustalW

One commonly used program for progressive multiple sequence alignment is ClustalW. The heuristic used in ClustalW is based on phylogenetic analysis. First, a pairwise distance matrix for all the sequences to be aligned is generated, and a guide tree is created using the neighbor-joining algorithm. Then, each of the most closely related pairs of sequences—the outermost branches of the tree—are aligned to each other using dynamic programming. Next, each new alignment is analyzed to build a sequence profile. Finally, alignment profiles are aligned to each other or to other sequences (depending on the topology of the tree) until a full alignment is built.

This strategy produces reasonable alignments under a range of conditions. It's not foolproof; for distantly related sequences, it can build on the inaccuracies of pairwise alignment and phylogenetic analysis. But for sequence sets with some recognizably related pairs, it builds on the strengths of these methods. Pairwise sequence alignment by dynamic programming is very accurate for closely related sequences regardless of which scoring matrix or penalty values are used. Phylogenetic analysis is relatively unambiguous for closely related sequences. Using multiple sequences to create profiles increases the accuracy of pairwise alignment for more distantly related sequences.

There are many parameters involved in multiple sequence alignment. There are, of course, scoring matrices and gap penalties associated with the pairwise alignment steps. In addition, there are weighting parameters that alter the scoring matrix used in sequence-profile and profile-profile alignments. In ClustalW, these are set from the Multiple Alignment submenu or the Profile Structure Alignments submenu. In ClustalX, they are set from the Alignment pulldown menu.

The pairwise alignment parameters are familiar and have the same meaning in multiple alignment as they do in pairwise alignment. The multiple alignment parameters include gap opening and gap extension penalties for the multiple alignment process—to be used when fine-tuning alignments—and a maximum allowable delay, in terms of sequence length, for the start of divergent sequences at the beginning of the alignment.

One of ClustalW's heuristics is that, in protein sequence alignment, different scoring matrices are used for each alignment based on expected evolutionary distance. If two sequences are close neighbors in the tree, a scoring matrix optimized for close relationships aligns them. Distant neighbors are aligned using matrices optimized for distant relationships. Thus, when prompted to choose a series of matrices in the Multiple Alignment Parameters menu, it means just that: use BLOSUM62 for close relationships and BLOSUM45 for more distant relationships, rather than the same scoring matrix for all pairwise alignments.

Another heuristic that ClustalW uses is scalable gap penalties for protein profile alignments. A gap opening next to a conserved hydrophobic residue can be penalized more heavily than a gap opening next to a hydrophilic residue. A gap opening too close to another gap can be penalized more heavily than an isolated gap. These parameters are set in the Protein Gap Parameters menu.

Although ClustalW is run from the Unix command line, it is menu-driven and doesn't rely on command-line options. To start the program, you can simply type clustalw, and a menu of options is presented:

**************************************************************
******** CLUSTAL W (1.8) Multiple Sequence Alignments  ********  
**************************************************************
     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile / Structure Alignments
     4. Phylogenetic trees

     S. Execute a system command
     H. HELP
     X. EXIT (leave program) 

This menu, along with subsequent menus that appear after you input your sequences, guides you through the use of ClustalW in a fairly straightforward fashion. Alignments are written in a plain-text format.

While ClustalW is simple to install and use on a Linux workstation, ClustalX, the X Windows-based graphical user interface for ClustalW, isn't so easy to compile. However, ClustalX runs in its own X window, has pulldown menus, and allows viewing and plotting of multiple sequence alignments in a color-coded format. It also allows you to append sequences to an alignment one at a time, and to produce color PostScript output of specified sequence ranges in an alignment from different files, if desired, along with other convenient features. To install ClustalX on a Linux machine, you need:

  • The ClustalX binaries

  • The NCBI software toolkit source distribution

  • The LessTif libraries

The first thing you need to do is install the LessTif libraries. This distribution is extremely easy to work with. The LessTif libraries are available from http://www.lesstif.org and may also be available within your Linux distribution. The NCBI Toolkit (available from http://www.ncbi.nlm.nih.gov) should compile completely as long as your LessTif libraries are installed in /usr/X11R6/lib. If the NCBI Toolkit installation produces the file libvibrant.a, the command clustalx will work.

Viewing and Editing Alignments with Jalview

Usage: Jalview alignmentfile FORMAT

Viewing alignments is useful, but it's often necessary to edit them as well. Alignment editing is one of the few bioinformatics functions that's actually been done better for the Windows platform than for Unix, but if you've installed Java support on your workstation you can use a program called Jalview, written by Michele Clamp and available from the EBI at http://www.ebi.ac.uk/~michele/jalview/contents.html.[*]

To use Jalview on a Linux workstation, download the full Unix distribution of the version you want. Unpack the distribution, then edit the file Jalview to reflect your local environment. Jalview is a script that sets up the environment for Jalview and actually starts the program. Specifically, you want to set the environment variable CLASSPATH to reflect the location of the class file in your JDK ( Java Development Kit) and the location of the Jalview classes (your jalview.jar file). Set the environment variable JAVA_EXE to point to your Java executable:

setenv CLASSPATH /usr/jdk118/lib/classes.zip:/usr/local/jalview/jalview.jar
setenv JAVA_EXE /usr/jdk118/bin/java 

Jalview can read an alignment (.aln ) file from Clustal, as well as several other alignment formats. We focus on using Jalview as an alignment editor, but it does have other functions you can explore. It's also capable of searching databases if you specify them as a command-line option.

To run Jalview, make a link to the Jalview script in your working directory.

The common alignment formats that Jalview recognizes are MSF, CLUSTAL, and FASTA. These need to be indicated in all capital letters when the command is given.

The Jalview window is an active place: click with care. You can select individual sequences by clicking on their names at the left of the window, and you can select ranges of sequence by clicking on the numerical labels at the top of the sequence alignment. A red box appears to indicate the selected rows.

As in any other menu interface, the File menu contains file import and export options. Sequence alignments can be read from a file or even from a web URL. The Edit pulldown contains commands that allow you to delete, copy, and move selected sequences or columns. You can also manipulate the alignment by hand. Clicking on any letter in the alignment allows you to open a gap and move everything to the left of that letter over by dragging in either direction with the mouse. The Colour menu contains options for color-coding alignments, most of which are most informative for protein sequence. The Calculate menu contains options for calculating consensus sequences and phylogenetic trees.

Sequence Logos

Another way to view sequence alignments, and one which has become quite popular recently, is the sequence logo format developed by Tom Schneider of the National Cancer Institute. This format is especially good for shorter sequence regions, such as protein motifs. Consensus sequences represent each position in an alignment with the residue that is most commonly found in that position. Other information in the alignment, such as whether there are any other residues that occur at that site and with what relative frequencies they occur, is lost in a consensus sequence.

Sequence logos, as illustrated in Figure 8-1, are a graphical way to represent relative frequencies, information content, order of substitution preference, and other characteristics of each site in an alignment.

A sequence logo

Figure 8-1. A sequence logo

In a sequence logo, the letters in the column at each sequence position represent the consensus sequence in more detail than a standard single-letter consensus sequence would. The total height of a column represents the amount of information contained in that sequence position.[†] The sizes of the individual letters depict their relative frequency of occurrence.

The software for creating sequence logos is part of a larger group of programs called the DELILA package, which was originally developed in the Pascal language. You actually need only two of the many DELILA programs (alpro and makelogo) to create logos from aligned sequences. Pascal compilers aren't among the compilers commonly found on Linux systems, but there is a standard GNU Pascal compiler you can download if you're feeling adventurous. The other way to compile the software is to use the C versions of the programs that are now available. Because these programs have been automatically translated from Pascal, they require that the p2c (Pascal-to-C) libraries are installed on your system.

An easier approach for the novice is to use the sequence logo web server at the University of Cambridge, which (as of this writing) is actually recommended by the author of the DELILA programs and hence, we assume, does exactly what it's supposed to do. Aligned sequences can be submitted to this server in FASTA alignment format, which can be generated by ClustalX.

Phylogenetic Analysis

Having covered some of the basics of multiple sequence alignment, we now introduce one of its applications: phylogenetic inference. Phylogenetic inference is the process of developing hypotheses about the evolutionary relatedness of organisms based on their observable characteristics. Traditionally, phylogenetic analyses have been based on the gross anatomy of species. When Linneaus developed the system of classification into kingdoms, phyla, genera, and species, the early biologists sorted living things into a symbolic Tree of Life, which we saw in Figure 1-3. This tree-based representation of the relationships among species is a phylogenetic tree; it has since been adopted as a convenient schematic for depicting evolutionary relatedness based on sequence similarity. The quantitative nature of sequence relationships has allowed the development of more rigorous methods and rules for tree drawing.

While hand-drawn trees of life may branch fancifully according to what is essentially an artist's conception of evolutionary relationships, modern phylogenetic trees are strictly binary; that is, at any branch point, a parent branch splits into only two daughter branches. Binary trees can approximate any other branching pattern, and the assumption that trees are binary greatly simplifies the tree-building algorithms.

The length of branches in a quantitative phylogenetic tree can be determined in more than one way. Evolutionary distance between pairs of sequences, relative to other sequences in an input data set, is one way to assign branch length.

While a phylogeny of species generally has a root, assuming that all species have a specific common ancestor, a phylogenetic tree derived from sequence data may be rooted or unrooted. It isn't too difficult to calculate the similarity between any two sequences in a group and to determine where branching points belong. It is much harder to pinpoint which sequence in such a tree is the common ancestor, or which pair of sequences can be selected as the first daughters of a common ancestor. While some phylogenetic inference programs do offer a hypothesis about the root of a tree, most simply produce unrooted trees. Figure 8-2 and Figure 8-3 illustrate rooted and unrooted phylogenetic trees.

A rooted phylogenetic tree

Figure 8-2. A rooted phylogenetic tree

An unrooted phylogenetic tree

Figure 8-3. An unrooted phylogenetic tree

A phylogeny inferred from a protein or nucleic acid sequence has only a passing resemblance to a whole-organism tree of life (a true tree) that represents actual speciation events. A single phylogeny may be a tree, and it may describe a biological entity, but it takes far more than a single evolutionary analysis to draw conclusions about whole-organism phylogeny. Sequence-based phylogenies are quantitative. When they are built based on sufficient amounts of data, they can provide valuable, scientifically valid evidence to support theories of evolutionary history. However, a single sequence-based phylogenetic analysis can only quantitatively describe the input data set. It isn't valid as a quantitative tool beyond the bounds of that data set, and if you are using phylogenetic analysis tools to develop evolutionary hypotheses, it is critical to remember this point.

It has been shown, by comparative analysis of phylogenies generated for different protein and gene families, that one protein may evolve more quickly than another, and that a single protein may evolve more quickly in some organisms than in others. Thus, the phylogenetic analysis of a sequence family is most informative about the evolution of that particular gene. Only by analysis of much larger sets of data can theories of whole-organism phylogeny be suggested.

Phylogenetic Trees Based on Pairwise Distances

One of the easiest to understand algorithms for tree drawing is the pairwise distance method. This method produces a rooted tree. The algorithm is initialized by defining a matrix of distances between each pair of sequences in the input set. Sequences are then clustered according to distance, in effect building the tree from the branches down to the root.

Distances can be defined by more than one measure, but one of the more common and simple measures of dissimilarity between DNA sequences is the Jukes-Cantor distance, which is logarithmically related to the fraction of sites at which two sequences in an alignment differ. The fraction of matching positions in an ungapped alignment between two unrelated DNA sequences approaches 25%. Consequently, the Jukes-Cantor distance is scaled such that it approaches infinity as the fraction of unmatched residue pairs approaches 75%.

The pairwise clustering procedure used for tree drawing (UPGMA, unweighted pair group method using arithmetic averages) is intuitive. To begin with, each sequence is assigned to its own cluster, and a branch (or leaf ) of the tree is started for that sequence at height zero in the tree. Then, the two clusters that are closest together in terms of whatever distance measure has been chosen are merged into a single cluster. A branch point (or node) is defined that connects the two branches. The node is placed at a height in the tree that reflects the distance between the two leaves that have been joined. This process is repeated iteratively, until there are only two clusters left. When they are joined, the root of the tree is defined. The branch lengths in a tree constructed using this process theoretically reflect evolutionary time.

Phylogenetic Trees Based on Neighbor Joining

Neighbor joining is another distance matrix method. It eliminates a possible error that can occur when the UPGMA method is used. UPGMA produces trees in which the branches that are closest together by absolute distance are placed as neighbors in the tree. This assumption places a restriction on the topology of the tree that can lead to incorrect tree construction under some conditions.

In order to get around this problem, the neighbor-joining algorithm searches not just for minimum pairwise distances according to the distance metric, but for sets of neighbors that minimize the total length of the tree. Neighbor joining is the most widely used of the distance-based methods and can produce reasonable trees, especially when evolutionary distances are short.

Phylogenetic Trees Based on Maximum Parsimony

A more widely used algorithm for tree drawing is called parsimony. Parsimony is related to Occam's Razor, a principle formulated by the medieval philosopher William of Ockham that states the simplest explanation is probably the correct one.[‡] Parsimony searches among the set of possible trees to find the one requiring the least number of nucleic acid or amino acid substitutions to explain the observed differences between sequences.

The only sites considered in a parsimony analysis of aligned sequences are those that provide evolutionary information—that is, those sites that favor the choice of one tree topology over another. A site is considered to be informative if there is more than one kind of residue at the site, and if each type of residue is represented in more than one sequence in the alignment. Then, for each possible tree topology, the number of inferred evolutionary changes at each site is calculated. The topology that is maximally parsimonious is that for which the total number of inferred changes at all the informative sites is minimized. In some cases there may be multiple tree topologies that are equally parsimonious.

As the number of sequences increases, so does the number of possible tree topologies. After a certain point, it is impossible to exhaustively enumerate the scores of each topology. A shortcut algorithm that finds the maximally parsimonious tree in such cases is the branch-and-bound algorithm. This algorithm establishes an upper bound for the number of allowed evolutionary changes by computing a tree using a fast or arbitrary method. As it evaluates other trees, it throws out any exceeding this upper bound before the calculation is completed.

Phylogenetic Trees Based on Maximum Likelihood Estimation

Maximum likelihood methods also evaluate every possible tree topology given a starting set of sequences. Maximum likelihood methods are probabilistic; that is, they search for the optimal choice by assigning probabilities to every possible evolutionary change at informative sites, and by maximizing the total probability of the tree. Maximum likelihood methods use information about amino acid or nucleotide substitution rates, analogous to the substitution matrices that are used in multiple sequence alignment.

Software for Phylogenetic Analysis

There is a variety of phylogenetic analysis software available for many operating systems. With such a range of choices, which package do you use? One of the most extensive listings currently available is maintained by Dr. Joe Felsenstein, the author of the PHYLIP package, and is accessible from the PHYLIP web page (http://evolution.genetics.washington.edu/phylip.html). If you don't want to follow our example and use PHYLIP, you can easily find information about other packages.

PHYLIP

The most widely distributed phylogenetic analysis package is PHYLIP. It contains 30 programs that implement different phylogenetic analysis algorithms. Each of the programs runs separately, from the command line. By default, most of the programs look for an input file called infile and write an output file called outfile. Rather than entering parameters via command-line flags, as with BLAST, the programs have an interactive text interface that prompts you for information.

The following are the PHYLIP programs you are most likely to use when you're just getting started analyzing protein and DNA sequence data:

PROTPARS

Infers phylogenies from protein sequence input using the parsimony method

PROTDIST

Computes an evolutionary distance matrix from protein sequence input, using maximum likelihood estimation

DNAPARS

Infers phylogenies from DNA sequence input using parsimony

DNAPENNY

Finds all maximally parsimonious phylogenies for a set of sequences using a branch-and-bound search

DNAML

Infers phylogenies from DNA sequence input using maximum likelihood estimation

DNADIST

Computes a distance matrix from DNA sequence input using the Jukes-Cantor distance or one of three other distance criteria

NEIGHBOR

Infers phylogenies from distance matrix data using either the pairwise clustering or the neighbor joining algorithm

DRAWGRAM

Draws a rooted tree based on output from one of the phylogeny inference programs

DRAWTREE

Draws an unrooted tree based on output from one of the phylogeny inference programs

CONSENSE

Computes a consensus tree from a group of phylogenies

RETREE

Allows interactive manipulation of a tree by the user—not based on data

PHYLIP is a flexible package, and the programs can be used together in many ways. To analyze a set of protein sequences with PHYLIP, you can:

  1. Read a multiple protein sequence alignment using PROTDIST and create a distance matrix.

  2. Input the distance matrix to NEIGHBOR and generate a phylogeny based on neighbor joining.

  3. Read the phylogeny into DRAWTREE and produce an unrooted phylogenetic tree.

Or, you can:

  1. Read a multiple sequence alignment using PROTPARS and produce a phylogeny based on parsimony.

  2. Read the phylogeny using DRAWGRAM and produce a rooted tree.

Each of the PHYLIP programs is exhaustively documented in the * . doc files available with the PHYLIP distribution. This documentation has been converted into HTML by several groups. Links to web-based PHYLIP documentation are available from the PHYLIP home page.

Several organizations have made PHYLIP servers available on the Web; the version of PHYLIP in the SDSC Biology Workbench produces downloadable PostScript output.

The PHYLIP input format

PHYLIP's molecular sequence analysis programs can accept sequence data in an aligned (interleaved) format:

39 
Archaeopt  CGATGCTTAC CGCCGATGCT
Hesperorni CGTTACTCGT TGTCGTTACT
Baluchithe TAATGTTAAT TGTTAATGTT
B. virgini TAATGTTCGT TGTTAATGTT
Brontosaur CAAAACCCAT CATCAAAACC
B.subtilis GGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC

where the first 10 characters are that sequence's name followed by the sequence in aligned form. Subsequent rows follow. In a sequential format, the complete first sequence is given, then the second complete sequence, etc. However, in either case, the sequences must be prealigned by another program. PHYLIP doesn't have a utility for computing multiple sequence alignments.

If you examine the phylogeny output from PHYLIP, you'll find it's represented by codes indicating each of the sequences, arranged in nested parentheses. This is called Newick notation. The pattern of the parentheses indicates the topology of the tree. The innermost parentheses surround the terminal branches of the tree, e.g., (A,B), and each subsequent set of parentheses joins another pair of branches, e.g., ((A,B),(C,D)). If the algorithm that generates the phylogeny produces branch lengths, these branch lengths are associated explicitly with each branch within the Newick notation: e.g., ((A:1.2,B:1.5):1.0,(C:2.5,D:0.5):1.2).

Generating input for PHYLIP with ClustalX

The multiple sequence alignment program ClustalX, which we discussed earlier in this chapter, draws phylogenetic trees with the neighbor joining method. Perhaps more importantly, it can read sequences in various input formats and then write PHYLIP-format files from multiple sequence alignments, using a simple Save As command from within the ClustalX interface.

Profiles and Motifs

In addition to studying relationships between sequences, one of the most successful applications of multiple sequence alignments is in discovering novel, related sequences. This profile- or motif-based analysis uses knowledge derived from multiple alignments to construct and search for sequence patterns. In this section, we first introduce some of the concepts behind motifs, then describe tools that use these principles to search sequence databases.

First, by way of a refresher, a multiple sequence alignment is an alignment of anywhere from three to hundreds of sequences. Multiple sequence alignments can span the full sequence of the proteins involved or a single region of similarity, depending on their purpose. Multiple sequence alignments, such as the one shown in Figure 8-4, are generally built up by iterative pairwise comparison of sequences and sequence groups, rather than by explicit multiple alignment.

A multiple sequence alignment, shown using ClustalX

Figure 8-4. A multiple sequence alignment, shown using ClustalX

A sequence motif is a locally conserved region of a sequence, or a short sequence pattern shared by a set of sequences. The term "motif" most often refers to any sequence pattern that is predictive of a molecule's function, a structural feature, or family membership. Motifs can be detected in protein, DNA, and RNA sequences, but the most common use of motif-based analyses is the detection of sequence motifs that correspond to structural or functional features in proteins. Motifs are generated from multiple sequence alignments and can be displayed as patterns of amino acids (such as those in the Prosite database) or as sequence logos. For computational purposes, they can be represented as flexible patterns, position-specific scoring matrices, or profile hidden Markov models.

Motifs can be created for protein families , or sets of proteins whose members are evolutionarily related. Protein families can consist of many proteins that range from very similar to quite diverse. While the idea of a protein family is a fairly common concept, the method of selecting a protein family and defining its limits depends on the researcher who defines it. As in pairwise sequence comparison, there is a lower bound beyond which homology can't easily be detected. Motif-based methods can push this lower bound by detecting particularly subtle sequence patterns and distant homologs.

A sequence profile is a quantitative or qualitative method of describing a motif. A profile can be expressed in its most rudimentary form as a list of the amino acids occurring at each position in the motif. Early profile methods used simple profiles of this sort; however, modern profile methods usually weight amino acids according to their probability of being observed at each position.

Figure 8-5 shows a position-specific scoring matrix (PSSM), which is a matrix of scores representing a motif. Unlike a standard scoring matrix, the first dimension of the matrix is the length of the motif; the second dimension consists of the 20 amino acid possibilities. For each position in the matrix, there is a probability score for the occurrence of each amino acid. Most methods for developing position-specific scoring matrices normalize the raw probabilities with respect to a standard scoring matrix such as BLOSUM62.

PSSMs for sequence motifs common to zinc finger proteins

Figure 8-5. PSSMs for sequence motifs common to zinc finger proteins

Finally, a profile hidden Markov model (HMM) is the rigorous probabilistic formulation of a sequence profile. Profile HMMs contain the same probability information found in a PSSM; however, they can also account for gaps in the alignment, which tends to improve their sensitivity. Because profile analysis methods are still a subject of active research, there are many different programs and methods for motif discovery and profile building. We will focus on two of the easiest motif discovery packages to use, MEME and HMMer. We also describe the searchable databases of preconstructed protein family motifs—some with associated PSSMs or profile HMMs—offered by several organizations.

Motif Databases

We have seen that profiles and other consensus representations of sequence families can be used to search sequence databases. It shouldn't be too surprising, then, that there are motif databases that can be searched using individual sequences. Motif databases contain representations of conserved sequences shared by a sequence family. Today, their primary use is in annotation of unknown sequences: if you get a new gene sequence hot off the sequencer, scanning it against a motif database is a good first indicator of the function of the protein it encodes.

Motifs are generated by a variety of methods and with different objectives in mind. Some rely on automated analysis, but there is often a large amount of hands-on labor invested in the database by an expert curator. Because they store only those motifs that are present in reasonably large families, motif databases are small relative to GenBank, and they don't reflect the breadth of the protein structure or sequence databases. Be aware that an unsuccessful search against a motif database doesn't mean your sequence contains no detectable pattern; it could be part of a family that has not yet been curated or that doesn't meet the criteria of the particular pattern database you've searched. For proteins that do match defined families, a search against the pattern databases can yield a lot of homology information very quickly.

Blocks

Blocks, a service of the Fred Hutchinson Cancer Research Center, is an automatically generated database of ungapped multiple sequence alignments that correspond to the most conserved regions of proteins. Blocks is created using a combination of motif-detection methods, beginning with a step that exhaustively searches all spaced amino acid triplets in the sequence to discover a seed alignment, followed by a step that extends the alignment to find an aligned region of maximum length. The Blocks database itself contains more than 4,000 entries; it is extended to over 10,000 entries by inclusion of blocks created from entries in several other protein family databases (see Section 8.4.1.6). The Blocks server also provides several useful search services, including IMPALA (which uses the BLAST statistical model to compare a sequence against a library of profiles) and LAMA (Local Alignment of Multiple Alignments; Shmuel Pietrokovski's program for comparing an alignment of your own sequences against a database of Blocks).

PROSITE

PROSITE is an expert-curated database of patterns hosted by the Swiss Institute of Bioinformatics. It currently contains approximately 1,200 records, and is available for download as a structured flat file from http://ftp.expasy.ch. PROSITE uses a single consensus pattern to characterize each family of sequences. Patterns in PROSITE aren't developed based on automated analysis. Instead, they are carefully selected based on data published in the primary literature or on reviews describing the functionality of specific groups of proteins. A humorous cartoon on the PROSITE server indicates that the optimal method for identifying patterns requires only a human, chalk, and a chalkboard. PROSITE contains pattern information as well as position-specific scoring matrices that can detect new instances of the pattern.

Pfam

Pfam is a database of alignments of protein domain families. Pfam is made up of two databases: Pfam-A and Pfam-B. Pfam-A is a curated database of over 2,700 gapped profiles, most of which cover whole protein domains; Pfam-B entries are generated automatically by applying a clustering method to the sequences left over from the creation of Pfam-A. Pfam-A entries begin with a seed alignment, a multiple sequence alignment that the curators are confident is biologically meaningful and that may involve some manual editing. From each seed alignment, a profile hidden Markov model is constructed and used to search a nonredundant database of available protein sequences. A full alignment of the family is produced from the seed alignments and any new matches. This process can be iterated to produce more extensive families and detect remote matches. Pfam entries are annotated with information extracted from the scientific literature, and incorporate structural data where available. As a final note, Pfam is the database of profile HMMs used by the GeneWise genefinder to search for open reading frames.

PRINTS

PRINTS is a database of protein motifs similar to PROSITE, except that it uses "fingerprints" composed of more than one pattern to characterize an entire protein sequence. Motifs are often short relative to an entire protein sequence. In PRINTS, groups of motifs found in a sequence family can define a signature for that family.

COG

NCBI's Clusters of Orthologous Groups (COG) database is a different type of pattern database. COG is constructed by comparing all the protein sequences encoded in 21 complete genomes. Each cluster must consist of protein sequences from at least three separate genomes. The premise of COG is that proteins that are conserved across these genomes from many diverse organisms represent ancient functions that have been conserved throughout evolution. COG entries can be accessed by organism or by functional category from the NCBI web site. COG currently contains more than 2,100 entries.

Accessing multiple databases

So, which motif database should you use to analyze a new sequence? Because the comparisons are performed quickly and efficiently, we recommend you use as many as possible, keeping track of the best matches from each, their scores, and (if available) the significance of the hit. While Blocks uses InterPro as one of the sources for its own patterns, as of June 2000 it contains only ungapped patterns, omitting gapped profiles such as those contained in Pfam-A and PROSITE. Fortunately, all the motif databases discussed here have search interfaces available on the Web, most of which accept input in FASTA format or FASTA alignment format.

One service that allows integrated searching of many motif databases is the European Bioinformatics Institute's Integrated Resource of Protein Domains and Functional Sites (InterPro to its friends). InterPro allows you to compare a sequence against all the motifs from Pfam, PRINTS, ProDom, and PROSITE. InterPro motifs are annotated with the name of the source protein, examples of proteins in which the motif occurs, references to the literature, and related motifs.

Constructing and Using Your Own Profiles

Motif databases are useful if you're looking for protein families that are already well documented. However, if you think you've found a new motif you want to use to search GenBank, or you want to get creative and look for patterns in unusual places, you need to build your own profiles. Several software packages and servers are available for motif discovery , the process of finding and constructing your own motifs from a set of sequences. The simplest way to construct a motif is to find a well-conserved section out of a multiple sequence alignment. As usual, though, we encourage you to use automated approaches instead of doing things by hand: automation makes your work faster, more reproducible, and less error-prone. In addition to Block Maker, a number of other programs are commonly used to search for and discover motifs. In this section, we discuss the use of the MEME and HMMer programs, two packages commonly used for motif analysis.

Before we begin, though, here are two observations about motif discovery. First, as InterPro and Blocks grow, it is becoming increasingly difficult to find completely novel sequence motifs undocumented by one of their member databases. Be sure to check your motif against the set of known motifs, either by searching your sequences against the databases or by using a motif-comparison tool, such as the Blocks server's LAMA program. Second, in order to find patterns reliably and search with them, you need a lot of sequences. We have used these programs in projects where very few (5-10) sequences were available, but, as a rule of thumb, more than 20 sequences are needed for reasonable motif predictions. The more sequences you have, the more reliable the resulting motifs will be.

Finding new motifs with MEME

The MEME programs are a set of tools for motif analysis developed by Charles Elkan, Tim Bailey, and William Grundy of the University of California, San Diego. MEME is short for Multiple EM for Motif Elicitation (EM, in turn, is short for Expectation Maximization, a procedure from the world of statistics for predicting the values of "missing," or unobserved, values). They can be used over the Web (http://meme.sdsc.edu) or their C source code can be downloaded, compiled, and run on a local computer; here, we look at the web version. There are three programs in the MEME suite:

MEME

Discovers shared motifs in a set of unaligned sequences

MAST

Takes a motif discovered by MEME and uses it to search a sequence database

MetaMEME

Constructs a model from multiple MEME motifs and uses it to search a sequence database

When you submit a set of sequences to MEME, you are testing the hypothesis that, although though you don't know the overall alignment of the sequences, they share short regions of similarity. You begin using MEME by entering on a web form your email address and a set of sequences in which you wish to search for a motif. Sequences can be in one of several formats, although FASTA is preferred. At the bottom of the sumission page are some parameters you need to set regarding the number of times per sequence you expect a motif to occur, the number of motifs you expect to find, and the approximate width of each motif.

The results will be sent back to you in three emails. The first is just a confirmation message, letting you know that the job is being processed. The second (with the subject line "MEME Job xxxxx results:", where xxxxx is the job number assigned by the MEME server) contains MEME's prediction for the motifs in both human- and machine-readable form. This message is the one you need to search the database; be sure to save the contents of this message to a text file, so you can later submit it to MAST or MetaMEME. The third message (with the subject line "MEME job... MAST analysis:") is an HTML document (making it suitable for viewing in a web browser) that shows the location of each motif in the sequences you submitted. Each message is well documented and contains detailed explanations of the contents.

Searching for motifs with MAST and MetaMEME

The next step of a motif analysis is to see whether there are new occurrences of your motif in other sequences. The MEME server provides two distinct programs, MAST and MetaMEME, that allow you to search a sequence database using your new MEME motifs. MAST simply searches for occurrences of each motif and reports matching sequences, while MetaMEME combines multiple MEME motifs into a hidden Markov model and uses that model to search the database. Both MAST and MetaMEME take the MEME motif prediction from the second email[§] as input; MetaMEME also uses the original sequence file that generates the MEME motifs in creating its HMM. Both programs return results showing the position of each match, its score, and its statistical significance.

Motif discovery with other programs

As we mentioned previously, there are a number of programs that discover motifs in groups of unaligned sequences. Besides the ones we mentioned, you may want to try these: the SAM HMM programs developed by David Haussler and coworkers at University of California, Santa Cruz; the Emotif and Ematrix servers in the Brutlag group at Stanford University; and the ASSET, gibbs, and Probe tools available for download from NCBI. Again, a good thing to do early on is to use the LAMA program to compare your motif against the motifs in the Blocks database. If it looks like you really do have a novel motif, it can be useful to compare the results of one or more of these other motif discovery tools. If all the programs predict the same motif from the same sequences, you can be more confident in your results.

HMMer

HMMer is a software package for building profile HMMs. HMMer's central functionality is located in the hmmbuild program, which creates profile HMMs from sequence alignment, and the hmmcalibrate program, which calibrates search statistics for the HMM. The HMMer package also contains tools for generating new sequences probabilistically based on an HMM, searching sequence databases with a profile as the query, and searching profile databases with a query sequence, as well as the handy utility programs we list here:

getseq

Extracts a sequence from a large flat-file database by name. Handy to have around if you're selecting specific records out of a database from the command line.

hmmalign

Reads both a sequence file and a profile HMM and creates a multiple sequence alignment.

hmmbuild

Builds a profile HMM from a multiple sequence alignment. It can produce global results for the entire alignment or results for multiple local alignments.

hmmcalibrate

Reads an HMM and calibrates its search statistics.

hmmconvert

Converts an HMM into other profile formats, notably GCG profile format.

hmmemit

Generates sequences probabilistically based on a profile HMM. It can also generate a consensus sequence.

hmmfetch

Retrieves a profile HMM from a database if the name of the desired record is known.

hmmindex

Indexes a profile HMM database.

hmmpfam

Searches a profile HMM database (e.g., Pfam) with a query sequence. Use this if you're trying to annotate an unknown sequence.

hmmsearch

Searches a sequence database with a profile HMM. Use this if you're looking for more instances of a pattern in a sequence database.

sreformat

Converts a sequence or alignment file from one format to another. Handy to have around.

HMMer reads multiple sequence alignment files from several different sequence alignment programs, including ClustalW. The HMMer authors recommend ClustalW as a tool to generate multiple alignments for input into hmmbuild.

HMMer is available for download from Dr. Sean Eddy at Washington University (http://hmmer.wustl.edu). HMMer is a very well-behaved program, which installs without difficulty from source on Linux systems: just follow the directions in the INSTALL file. It even installs its own Unix manpages so you can access online help for each of the HMMer programs using the man command. Specific information about each of the HMMer programs' command-line options can also be viewed by running the program with the -h option.

Incorporating Motif Information into Pairwise Alignment

Multiple sequence information can optimize pairwise alignments. The BLAST package contains two new modes that use multiple alignment information to improve the specificity of database searches. These modes are accessed through the blastpgp program.

Position Specific Iterative BLAST (PSI-BLAST) is an enhancement of the original BLAST program that implements profiles to increase the specificity of database searches. Starting with a single sequence, PSI-BLAST searches a database for local alignments using gapped BLAST and builds a multiple alignment and a profile the length of the original query sequence. The profile is then used to search the protein database again, seeking local alignments. This procedure can be iterated any number of times. One caveat of using PSI-BLAST is that you need to know where to stop. Errors in alignment can be magnified by iteration, giving rise to false positives in the ultimate sequence search. PSI-BLAST can be used as a standalone by running the blastpgp program. However, the NCBI PSI-BLAST server is probably the optimal way to run a PSI-BLAST search. The server requires you to decide after each iteration whether to continue to another iteration, and you can hand-pick the sequences that contribute to the profile at each step.

Pattern Hit Initiated BLAST (PHI-BLAST) takes a sequence and a preselected pattern found in that sequence as input to query a protein sequence database. The pattern must be expressed in PROSITE syntax, which is described in detail on the PHI-BLAST server site. PHI-BLAST can also initiate a series of PSI-BLAST iterations, and can be a standalone program or a (vastly more user-friendly) web server.



[*] Installing Java support involves installing a Java Development Kit and a Java Runtime Environment. IBM has ported JDK and JRE 1.1.8 to Linux. They are available at the IBM site, http://www.ibm.com/java/jdk/index.html. You have to register to download the kits, but it's free. The kits are available as easy-to-install RPMs. You won't encounter a lot of Java applications for bioinformatics, but when you do, it's nice to be able to run them.

[†] For a thorough discussion of sequence logos and of information content in biological sequence data, you can download some very readable papers from Dr. Tom Schneider's web site at the National Cancer Institute: http://www-lecb.ncifcrf.gov/~toms/index.html.

[‡] Or, in other words, "It is futile to do with more what can be done with fewer."

[§] You did save the second email to a text file as we suggested, didn't you?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset