Chapter 7. Sequence Analysis, Pairwise Alignment, and Database Searching

We now begin our tour of bioinformatics tools in earnest. In the next five chapters, we describe some of the software tools and applications you can expect to see in current research in computational biology. From gene sequences to the proteins they encode to the complicated biological networks they are involved in, computational methods are available to help you analyze data and formulate hypotheses. We have focused on commonly used software packages and packages we have used; to attempt to encompass every detail of every program out there, however, we'd need to turn every chapter in this book into a book of its own.

The first tools we describe are those that analyze protein and DNA sequence data. Sequence data is the most abundant type of biological data available electronically. While other databases may eventually rival them in size, the importance of sequence databases to biology remains central. Pairwise sequence comparison, which we discuss in this chapter, is the most essential technique in computational biology. It allows you to do everything from sequence-based database searching, to building evolutionary trees and identifying characteristic features of protein families, to creating homology models. But it's also the key to larger projects, limited only by your imagination—comparing genomes, exploring the sequence determinants of protein structure, connecting expression data to genomic information, and much more.

The types of analysis that you can do with sequence data are:

  • Knowledge-based single sequence analysis for sequence characteristics

  • Pairwise sequence comparison and sequence-based searching

  • Multiple sequence alignment

  • Sequence motif discovery in multiple alignments

  • Phylogenetic inference

We divide our coverage of sequence analysis tools into two chapters. This chapter focuses on programs that operate on single sequences, or compare gene or protein sequences against each other. Chapter 8 is devoted to multiple sequence alignment methods.

Pairwise sequence comparison is the primary means of linking biological function to the genome and of propagating known information from one genome to another. In this chapter, we discuss the techniques of biological sequence analysis and, most importantly, how to assess the significance of results from sequence comparison. There are also a number of software tools available for doing pairwise sequence comparison. Table 7-1 provides a summary.

Table 7-1. Sequence Analysis Tools and Techniques

What you do

Why you do it

What you use to do it

Gene finding

Identify possible coding regions in genomic DNA sequences

GENSCAN, GeneWise, PROCRUSTES, GRAIL

DNA feature detection

Locate splice sites, promoters, and sequences involved in regulation of gene expression

CBS Prediction Server

DNA translation and reverse translation

Convert a DNA sequence into protein sequence or vice versa

"Protein machine" server at EBI

Pairwise sequence alignment (local)

Locate short regions of homology in a pair of longer sequences

BLAST, FASTA

Pairwise sequence alignment (global)

Find the best full-length alignment between two sequences

ALIGN

Sequence database search by pairwise comparison

Find sequence matches that aren't recognized by a keyword search; find only matches that actually have some sequence homology

BLAST, FASTA, SSEARCH

Chemical Composition of Biomolecules

Sequence analysis techniques can be applied to DNA and RNA (nucleotide) sequences or to protein (amino-acid) sequences. To understand why DNA and protein sequences are informative, you need to know a bit about the chemistry of DNA and proteins. In the context of the sequence analysis applications we discuss in this chapter, it's perfectly fine to think of a DNA sequence as pure information. If you really want to, you can skip over the chemical structures and think of DNA as a string of letters. But keep this fact in mind: the single-letter sequence code that describes DNA and is a simplified representation of a 3D chemical entity, and in some cases the 3D structure of the DNA is really significant.

Proteins, at least at first glance, are more chemically complicated than DNA, and it's impossible to separate the information content of their sequences from the chemical properties of the amino acids they're built from. You can't safely forget about the chemistry of proteins when you're analyzing their sequences, so we'll discuss protein chemistry thoroughly at the beginning of Chapter 9, before we introduce techniques for protein structure analysis. As discussed in Chapter 2, DNA is the medium for storing information in cells, and it stores and transmits that information through the sequence of nucleotides that make up the DNA chain. DNA occurs as a "double helix"—two long sequences of nucleotides that are chemical mirror images of each other. This double-helical structure and the chemistry that forces a specific pattern of pairing between nucleotides in the two halves of the helix is what gives DNA the ability to replicate itself and faithfully pass its information from cell to cell and generation to generation. The chemistry of pairing between nucleotide chains also allows the DNA sequence to be transcribed into RNA and translated into proteins.

Composition of DNA and RNA

DNA and RNA are polymer chains composed of a small alphabet of chemically similar compounds. The individual units are called nucleotides. As you can see in Figure 7-1, nucleotides are made up of three distinct parts: a cyclic base, a cyclic sugar (deoxyribose or ribose, respectively), and a phosphate group. Base utilization is different in DNA than in RNA. The DNA code consists of patterns built up from the A (adenine), T (thymine), G (guanine), and C (cytosine) nucleotides, while the RNA code substitutes U (uracil) for T.

The "backbone" bits of DNA and RNA—ribose and deoxyribose phosphates

Figure 7-1. The "backbone" bits of DNA and RNA—ribose and deoxyribose phosphates

Figure 7-2 shows the five nucleotides, which are also referred to as bases. In hydrolyzed double-stranded DNA, there are always equal amounts of A and T nucleotides (A = T). The amounts of G and C in the solution are also always equal (G = C). This is called Chargaff's rule after the researcher who discovered the relationships between A and T, G and C. (Note that there can different amounts of A and T, G, and C; the ratio of A-T to G-C base pairs can vary widely from species to species.)

The five "bases" that commonly appear in DNA or RNA

Figure 7-2. The five "bases" that commonly appear in DNA or RNA

Watson and Crick Solve the Structure of DNA

The quantitative relationships between adenine and thymine, and cytosine and guanine led Watson and Crick to propose a structural model for DNA in 1953, and later Crick's central dogma of biology. Watson and Crick's model of DNA was based on several observations:

  • The x-ray crystallography experiments of their colleague Rosalind Franklin who observed a diffraction pattern from DNA that suggested a helical molecule with a regular repeating structure at a spacing of 3.4 angstroms

  • Chargaff's rules

  • Experimental evidence that the bases were connected by hydrogen bonds in the DNA molecule

  • The knowledge of the correct structural conformations of the bases from x-ray crystallography

What Watson and Crick did was to combine this disparate information to propose the double helix. The double helix of DNA, which has now been determined in atomic detail using x-ray crystallography, is a structure in which adenine pairs with thymine, and guanine with cytosine by hydrogen bonding (Figure 7-3). The hydrogen bonded base pairs form the core of the molecule.

Two common base pairs, A-T and G-C

Figure 7-3. Two common base pairs, A-T and G-C

As shown in Figure 7-4, the base pairs stack on top of and parallel to each other with a spacing of 3.4 angstroms. They are held together in sequence by covalent chemical bonds between the sugar group of one nucleotide and the phosphate group of the next. This chain has a directionality: the end left with an exposed phosphate group is called the 5' end, while the end with the exposed ribose group is the 3' end.

Schematic of the DNA chain

Figure 7-4. Schematic of the DNA chain

The specific chemical pairing of nucleotides in DNA and RNA sequences suggests a mechanism by which each strand of DNA can serve as the template for the synthesis of a complementary strand. The use of a similar nucleotide code in RNA suggests that DNA can also be used as a template for synthesis of RNA. From these two pieces of evidence, Crick proposed his central dogma: that DNA directs its own replication and its transcription into RNA and that RNA is translated into protein.

Development of DNA Sequencing Methods

If you just digest DNA into its four component bases and measure the quantity of each, it tells you nothing about the DNA sequence. Modern methods for DNA sequencing rely on controlled biochemical reactions that allow the base content at each position in the DNA sequence to be quantitated independently. The chemical cleavage method for sequencing DNA relies on the specificity of chemical reagents (reactive substances) to break DNA chains at four specific types of sites. There are reagents that break or cleave the chain specifically after G nucleotides and reagents that cleave specifically after C nucleotides. There are also reagents that cleave less specifically: one to cleave after A and G nucleotides and one to cleave after C and T nucleotides. The method Maxam and Gilbert designed was conceptually simple. Four samples of DNA are required for this method. One type of reagent is mixed with each sample in a quantity that causes each DNA chain in the sample to be broken only one time, on average, at a random location. One end of the DNA is radioactively labeled, and the other is not, so only one piece of each broken chain is radioactive after the chain is cleaved. DNA fragments of different sizes can be separated using an electric current to drive them through a viscous medium called a gel. The larger the fragment, the more it's slowed by the gel, so at the end of some period of time, different-sized radioactive pieces of DNA are spread out at regular intervals down the gel. Figure 7-5 shows a partial autoradiogram of a DNA sequencing gel. Each set of four closely spaced lanes represents an individual sequencing experiment. The gel is read from bottom to top. Each band on the gel identifies the nucleotide present at the position in the sequence, depending on which of the four lanes it appears in. (Image courtesy of Dr. Dennis Dean, Virginia Tech.) If each DNA chain is broken once after a random A, C, G, or T, a uniform distribution of fragments that map the entire sequence of the DNA is created. Depending on which sample the radioactive piece is from, the last base in its sequence is known, and the sequence can be read off the gel from end to end.

DNA sequencing gel

Figure 7-5. DNA sequencing gel

Sanger's chain-terminator procedure is the most commonly used sequencing chemistry in modern laboratories. This procedure takes advantage of an enzyme called DNA polymerase, which builds a complementary strand of DNA for an existing single strand. In Sanger's method, the DNA polymerase reaction is carried out in the presence of specific analogues of nucleotides that, when they are incorporated, cause the synthesis of the complementary strand to stop. Four samples are prepared, each containing a small amount of one type of chain terminator. Analogously to the Maxam and Gilbert method, a uniform distribution of DNA fragments is generated, each with a known end residue. The fragments are analyzed based on the strength of this fluorescence signal, giving the sequence of the complementary strand to the original DNA.

The chain termination method is easily automated, and computer-compatible sequencing systems that use this method are readily available. Most genome sequence data is currently generated using this method, though new sequencing methods that don't involve chain cleavage or chain termination are in development. We discuss the process of sequencing data analysis and genome assembly further in Chapter 11.

The Chemical Composition of Proteins

Unlike DNA, protein polymers consist of a common set of building blocks called amino acids. There are 20 amino acids that make up the standard chemical alphabet used to build proteins. Amino acids are small molecules that share a common motif, of three substituent chemical groups arranged around a central carbon atom. One of the substituent groups is always an amino group; another is always carboxylic acid group. To form the protein polymer, the amino and carboxyl groups react with each other and form a bond called the peptide bond. The third substituent on the central carbon of an amino acid is variable, and it's this property that makes the amino acids into a code for storing information. The sequence of amino acids in a protein is referred to as the protein's primary structure. Protein sequence can be subjected to the same analyses (described later) for DNA sequence. As we describe sequence analysis methods, we will point out ways in which these methods differ for proteins and DNA.

Mechanisms of Molecular Evolution

The discovery of DNA as the molecular basis of heredity and evolution made it possible to understand the process of evolution in a whole new way. Darwin's theory of evolution by natural selection describes the observable process of evolution and speciation. However, it doesn't explain how information is passed from generation to generation, nor does it explain the mechanisms that give rise to, or that limit, variation within each generation.

The two halves of the double-helical DNA molecule serve as a template for replication of the DNA molecule. Even though the molecular rules governing replication of DNA are specific, replication doesn't always occur with perfect fidelity. When a piece of DNA is replicated incorrectly and the error is not corrected by the cell's repair machinery, it's called a mutation.

Mutations can occur in any part of an organism's DNA: in the middle of genes that code for proteins or functional RNA molecules, in the middle of regulatory sequences that govern when a gene is turned on, or out in the "middle of nowhere", in the regions between gene sequences. Mutations can have dramatic effects on the organism's phenotype (its visible or measurable characteristics) or they can have no apparent effect. Over time—thousands or millions of years—mutations that are beneficial or at least not harmful to a species can become fixed in the population, meaning that the mutated form of the gene occurs with a certain frequency among all individuals of a particular species. Over longer time scales, enough mutations may accumulate that new species develop.

There are two classes of mutations: point mutations, in which a change affects a single nucleotide in the DNA sequence; and segmental mutations, which can affect anywhere from a few to many hundreds of adjacent nucleotides.

Point mutations usually result from a single mismatch, in which one nucleotide is mispaired with the template DNA as a new complementary DNA strand is being built. Point mutations become significant only if they occur in the middle of a coding region or signal sequence, and then only if they cause a change in functionality. In coding regions, point mutations can either be synonymous, meaning that the mutated strand codes for the same amino acid as it did before the mutation occurred, or nonsynonymous. The genetic code (which was shown back in Figure 2-3) is degenerate; that is, several different three-letter combinations code for each amino acid. The groups of codons which code for each amino acid are by no means random; instead, nature has arranged a fail-safe mechanism in which several codons that differ by only one nucleotide represent a single amino acid, thereby allowing a little room for synonymous replication errors in DNA.

Segmental mutations, which can result in insertion or deletion of long stretches of DNA, can occur by many different mechanisms, all of which involve mismatching of a strand of DNA either with the wrong partner or with a part of itself. Segmental mutations can result in duplications of whole genes or even large regions of chromosomes; some genetic events can even result in the duplication of entire genomes. Generated by gene and chromosome duplication, redundant copies of genes can be repurposed (through a slow process of mutational trial and error) to perform new functions in the cell. A detailed discussion of these mechanisms is given in the excellent book Fundamentals of Molecular Evolution; see the Bibliography.

Both types of mutation leave traces in the evolutionary record, that is, in the DNA sequences of living things. Since mutations tend to be preserved only if they are functionally useful (or at least, not harmful), there is a tendency for functionally important parts of sequences to be conserved (to remain constant throughout the evolutionary process) while noncoding or nonfunctional sequences diverge wildly. This tendency to conserve functionally important sequences is the basis for the whole field of sequence analysis; it lets us draw evolutionary connections between genes that are related in sequence.

By comparative study of DNA sequences, and on a larger scale, of whole genomes, it's possible to develop quantitative methods for understanding when and how mutational events occurred, as well as how and why they were preserved to survive in existing species and populations. Genomics and bioinformatics—the production of genome data and the development of tools for analyzing it—have made it possible to examine the evolutionary record and make increasingly quantitative statements about the evolutionary relationship of one species to another. Taxonomies can begin to be based not merely on anatomy but on quantitative measurements of differences in the genetic code. Both point mutations and segmental mutations are explicitly modeled in the scoring schemes for comparison of protein and DNA sequences discussed later in this chapter. Changes in the identity of the residue (nucleotide or amino acid) at a given position in the sequence are scored using standard substitution scores (for example, a positive score for a match and a negative score for a mismatch) or substitution matrices. Insertions and deletions are scored with penalties for gap opening and gap extension.

Genefinders and Feature Detection in DNA

Once a large chunk of DNA has been mapped and sequenced, the task of understanding its function begins. In this section, we describe some programs that search the sequence for genes and other biologically important features. A feature is a sequence pattern with some functional significance, such as start and stop codons, splice sites (in the case of eukaryotes), and sequences that are bound by proteins in order to regulate gene expression. Some features can be found by searching for a specific sequence, such as the restriction site cleaved by a given restriction enzyme. Others, such as promoters and genes, aren't so easy to pick out. Analysis of single DNA sequences in search of sequence features is a rapidly growing research area in bioinformatics.

There are two reasons that genefinding and feature detection are such notoriously difficult problems. First, there are a huge number of protein-DNA interactions, many of which have not yet been experimentally characterized, and some of which differ from organism to organism. More importantly, we don't always know what constitutes a binding sequence. Current promoter detection algorithms yield about 20-40 false positives for each real promoter identified. Some proteins bind to specific sequences; others are more flexible in their preference for attachment sites. To complicate matters further, a protein can bind in one part of a chromosome but affect a completely different region hundreds or thousands of base pairs away.

Predicting Gene Locations

Genefinders are programs that identify (or try to, anyway) all the open reading frames in unannotated DNA. They use a variety of approaches to locate genes, but the most successful combine content-based and pattern-recognition approaches. Content-based methods for gene prediction take advantage of the fact that the distribution of nucleotides in genes is different than in non-genes. The GRAIL family of programs developed at Oak Ridge National Laboratories uses a neural network to combine evidence from seven different statistical measures of DNA content (frame bias, periodicities, fractal dimension, coding 6-tuples, in-frame 6-tuples, k-tuple commonality, and repetitive 6-tuple words); subsequent versions measure additional features to better exploit these different types of data. At each position in the DNA sequence, the program weighs each type of information, integrates them, and comes up with a score that represents the likelihood that the region in question is in an ORF or an intergenic region. Pattern-recognition methods look for characteristic sequences associated with genes (start and stop codons, promoters, splice sites) to infer the presence and structure of a gene.

In isolation, each method goes only so far. You have a similar rate of success if you try to identify human faces by looking for either a characteristic skin texture (content) or the presence of a moustache (pattern), but not both. Not surprisingly, the current generation of genefinders combine both methods with additional knowledge, such as gene structure or sequences of other, known genes.

Some genefinders are accessible only though web interfaces, making the interaction very straightforward: the sequence that needs to be examined for genes is submitted to the program, it is processed, and the output is returned. On one hand, this eliminates the need for installation and maintenance of the genefinder on your system, and it provides a relatively uniform interface for the different programs. On the other, if you plan to rely on the results of a genefinder, you should take the time to understand underlying algorithm, find out if the model is specific for a given species or family, and, in the case of content-based models, know which sequences they are. The accuracy of a genefinder can be misleadingly high if it is trained on the same sequence with which you test it.

Some commonly used programs in gene finding include Oak Ridge National Labs' GRAIL, GENSCAN (developed by Chris Burge, now at MIT, and Samuel Karlin at Stanford), PROCRUSTES (developed by Pavel Pevzner and coworkers), and GeneWise (developed by Ewan Birney and Richard Durbin). GRAIL combines evidence from a variety of signal and content information using a neural network. GENSCAN combines information about content statistics with a probabilistic model of gene structure. PROCRUSTES and GeneWise find open reading frames by translating the DNA sequence and comparing the resulting protein sequence with known protein sequences. PROCRUSTES compares potential ORFs with close homologs, while GeneWise compares the gene against a single sequence or a model of an entire protein family.

Feature Detection

In addition to their role in genefinder systems, feature-detection algorithms can be used on their own to find patterns in DNA sequences. Frequently, these tools help interpret freshly sequenced DNA or choose targets for designing PCR primers or microarray oligomers. Some starting places for tools like these include the Center for Biological Sequence Analysis at the Technical University of Denmark (which has several web-based applications for finding intron-exon splice sites and transcription start sites in eukaryotic DNA), the CodeHop server at the Fred Hutchinson Cancer Research Center (which predicts PCR primers based on conserved protein sequences), and the Tools collection at the European Bioinformatics Institute.

In addition to these special-purpose tools, another popular approach is to use motif discovery programs that automatically find common patterns in sequences. We will examine these programs in greater detail when we look at multiple sequence analysis methods.

DNA Translation

Before a protein can be synthesized, its sequence must be translated from the DNA. Translation of DNA sequence into protein sequence isn't conceptually or computationally difficult. All that is required is the DNA sequence, a genetic code, and a program that reads in one type of sequence and outputs the other.

Any DNA sequence can be translated in six possible ways. The sequence can be translated backward and forward. Because each amino acid in a protein is specified by three bases in the DNA sequence, there are three possible translations of any DNA sequence in each direction: one beginning with the very first character in the sequence, one beginning with the second character, and one beginning with the third character.

Figure 7-6 shows "back-translation" of a protein sequence (shown on the top line) into DNA, using the bacterial and plant plastid genetic code. As you can see, back-translation of a protein sequence into DNA isn't unique. Each amino acid in the short sequence shown can be represented by as many as six codons, and the possible codons can be combined in many ways to produce not one, but hundreds of possible coding sequences, even for a short peptide. However, note that nature has grouped the codons "sensibly": alanine (A) is always specified by a "G-C-X" codon, arginine (R) is specified either by a "C-G-X" codon or an "A-G-pyrimidine" codon, etc. This reduces the number of potential sequences that have to be checked if you (for example) try to write a program to compare a protein sequence to a DNA sequence database.[*]

Back-translation from a protein sequence

Figure 7-6. Back-translation from a protein sequence

There are no markers in the DNA sequence to indicate where one codon ends and the next one begins. Consequently, unless the location of the start codon is known ahead of time, a double-stranded DNA sequence can be interpreted in any of six ways: an open reading frame can start at nucleotide i, at i+1, or at i+2 on either the observed or complementary strand. To account for this uncertainty, when a protein is compared with a set of DNA sequences, the DNA sequences are translated into all six possible amino acid sequences, and the protein query sequence is compared with these resulting conceptual translations. This exhaustive translation is called a "six-frame translation" and is illustrated in Figure 7-7.

A DNA sequence and its translation in three of six possible reading frames

Figure 7-7. A DNA sequence and its translation in three of six possible reading frames

Because of the large number of codon possibilities for some amino acids, back-translation of a protein into DNA sequence can result in an extremely large number of possible sequences. However, codon usage statistics for different species are available and can be used to suggest the most likely back-translation out of the range of possibilities.

BLAST and FASTA dynamically translate query and database sequences so you don't need to worry about translating a database before you do a sequence comparison. However, in the event that you need to produce a six-frame translation of a single DNA sequence or translate a protein back into a set of possible DNA sequences, and you don't want to script it yourself, the Protein Machine server (http://www.ebi.ac.uk/translate/) at the European Bioinformatics Institute (EBI) will do it for you.

Pairwise Sequence Comparison

Comparison of protein and DNA sequences is one of the foundations of bioinformatics. Our ability to perform rapid automated comparisons of sequences facilitates everything from assignment of function to a new sequence, to prediction and construction of model protein structures, to design and analysis of gene expression experiments. As biological sequence data has accumulated, it has become apparent that nature is conservative. A new biochemistry isn't created for each new species, and new functionality isn't created by the sudden appearance of whole new genes. Instead, incremental modifications give rise to genetic diversity and novel function. With this premise in mind, detection of similarity between sequences allows you to transfer information about one sequence to other similar sequences with reasonable, though not always total, confidence.

Before you can make comparative statements about nucleic acid or protein sequences, a sequence alignment is needed. The basic concept of selecting an optimal sequence alignment is simple. The two sequences are matched up in an arbitrary way. The quality of the match is scored. Then one sequence is moved with respect to the other and the match is scored again, until the best-scoring alignment is found.

What sounds simple in principle isn't at all simple in practice. Choosing a good alignment by eye is possible, but life is too short to do it more than once or twice. An automated method for finding the optimal alignment out of the thousands of alternatives is clearly the right approach, but in order for the method to be consistent and biologically meaningful, several questions must be answered. How should alignments be scored? A scoring scheme can be as simple as +1 for a match and -1 for a mismatch, but what is the best scoring scheme for the data? Should gaps be allowed to open in the sequences to facilitate better matches elsewhere? If gaps are allowed, how should they be scored? Given the correct scoring parameters, what is the best algorithm for finding the optimal alignment of two sequences? And when an alignment is produced, is it necessarily significant? Can an alignment of similar quality be produced for two random sequences? Through the rest of this section, we consider each of these questions in greater detail.

Figure 7-8 shows examples of three kinds of alignment. These are three pairwise sequence alignments generated using a program called ALIGN. In each alignment, the sequences being compared are displayed, one above the other, such that matching residues are aligned. Identical matches are indicated with a colon (:) between the matching residues, while similarities are indicated with a single dot (.). Information about the alignment is presented at the top, including percent identity (the number of identical matches divided by the length of the alignment) and score. Finally, gaps in one sequence relative to another are represented by dashes (-) for each position in that sequence occupied by a gap.

Three alignments: high scoring, low scoring but meaningful, and random

Figure 7-8. Three alignments: high scoring, low scoring but meaningful, and random

The first alignment is a high-scoring one: it shows a comparison of two closely related proteins (two hemoglobin molecules, one from a sea lamprey and one from a hagfish). Compare that alignment with the second, a comparison of two distantly related proteins (again, two hemoglobin molecules, in this case taken from lamprey and rice). Cursory inspection shows fewer identical residues are shared by the sequences in the low-scoring alignment than in the high-scoring one. Still, there are several similarities or conservative changes—changes in which one amino acid has been replaced by another, chemically similar residue. The third alignment is a random alignment, a comparison between two unrelated sequences (the lamprey hemoglobin and a human retinol binding protein). Notice that, in addition to the few identities and conservative mutations between the two, large gaps have been opened in both sequences to achieve this alignment. Gene families aren't likely to evolve in this way, and given the lack of similarity between the sequences, you can conclude that these proteins are unrelated.

In describing sequence comparisons, several different terms are commonly used. Sequence identity, sequence similarity, and sequence homology are the most important of these terms. Each means something slightly different, though they are often casually used interchangeably.

Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. In protein sequences, amino acids of similar chemical properties are found to substitute for each other much more readily than dissimilar amino acids. These propensities are represented in scoring matrices that score sequence alignments. Two amino acids are considered similar if one can be substituted for another with a positive log odds score from a scoring matrix (described in the next section).

Sequence homology is a more general term that indicates evolutionary relatedness among sequences. It is common to speak of a percentage of sequence homology when comparing two sequences, although that percentage may indicate a mixture of identical and similar sites. Finally, sequence homology refers to the evolutionary relatedness between sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. The terms similarity and homology are often used interchangeably to describe sequences, but, strictly speaking, they mean different things. Similarity refers to the presence of identical and similar sites in the two sequences, while homology reflects a stronger claim that the two sequences share a common ancestor.

Scoring Matrices

What you really want to learn when evaluating a sequence alignment is whether a given alignment is random, or meaningful. If the alignment is meaningful, you want to gauge just how meaningful it is. You attempt to do this by constructing a scoring matrix.

A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities. One is the probability of random occurrence of an amino acid in a sequence alignment. This value is simply the product of the independent frequencies of occurrence of each of the amino acids. The other is the probability of meaningful occurrence of a pair of residues in a sequence alignment. These probabilities are derived from samples of actual sequence alignments that are known to be valid.

In order to score an alignment, the alignment program needs to know if it is more likely that a given amino acid pair has occurred randomly, or that it has occurred as a result of an evolutionary event. The logarithm of the ratio of the probability of meaningful occurrence to the probability of random occurrence is positive if the probability of meaningful occurrence is greater, and negative if the probability of random occurrence is greater. Because the scores are logarithms of probability ratios, they can be meaningfully added to give a score for the entire sequence. The more positive the score, the more likely the alignment is to be significant.

Figure 7-9 shows an example of a BLOSUM45 matrix, a popular substitution matrix for amino acids.

The BLOSUM45 matrix, a popular substitution matrix for amino acids

Figure 7-9. The BLOSUM45 matrix, a popular substitution matrix for amino acids

Substitution matrices for amino acids are complicated because they reflect the chemical nature and frequency of occurrence of the amino acids. For example, in the BLOSUM matrix, glutamic acid (E) has a positive score for substitution with aspartic acid (D) and also with glutamine (Q). Both these substitutions are chemically conservative. Aspartic acid has a sidechain that is chemically similar to glutamic acid, though one methyl group shorter. On the other hand, glutamine is similar in size and chemistry to glutamic acid, but it is neutral while glutamic acid is negatively charged. Substitution scores for glutamic acid with residues such as isoleucine (I) and leucine (L) are negative. These residues have neutral, nonpolar sidechains and are chemically different from glutamic acid. The scores on the diagonal of the matrix reflect the frequency of occurrence of each amino acid. For example, with a positive score of 15, it is extremely unlikely that the alignment of a rare tryptophan (W) with another tryptophan is coincidence, while the more common serine (S) has a positive score of only 4 for a match with another serine. It's important to remember that these scores are logarithms, which means that a match of two serines is far from being mere coincidence.

Substitution matrices for bases in DNA or RNA sequence are very simple. By default, the sequence alignment program BLAST uses the scheme of assigning a standard reward for a match and a standard penalty for a mismatch, with no regard to overall frequencies of bases. In most cases, it is reasonable to assume that A:T and G:C occur in roughly equal proportions.

Commonly used substitution matrices include the BLOSUM and PAM matrices. When using BLAST, you need to select a scoring matrix. Most automated servers select a default matrix for you (usually something like BLOSUM62), and if you're just doing a quick sequence search, it's fine to accept the default.

BLOSUM matrices are derived from the Blocks database, a set of ungapped alignments of sequence regions from families of related proteins. A clustering approach sorts the sequences in each block into closely related groups, and the frequencies of substitutions between these within a family derives the probability of a meaningful substitution. The numerical value (e.g., 62) associated with a BLOSUM matrix represents the cutoff value for the clustering step. A value of 62 indicates that sequences were put into the same cluster if they were more than 62% identical. By allowing more diverse sequences to be included in each cluster, lower cutoff values represent longer evolutionary time scales, so matrices with low cutoff values are appropriate for seeking more distant relationships. BLOSUM62 is the standard matrix for ungapped alignments, while BLOSUM50 is more commonly used when generating alignments with gaps.

Point accepted mutation (PAM) matrices are scaled according to a model of evolutionary distance from alignments of closely related sequences. One PAM "unit" is equivalent to an average change in 1% of all amino acid positions. The most commonly used PAM matrix is PAM250. However, comparison of results using PAM and BLOSUM matrices suggest that BLOSUM matrices are better at detecting biologically significant similarities.

Gap Penalties

DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful alignment between them. Most algorithms use a gap penalty to represent the validity of adding a gap in an alignment.

The addition of a gap has to be costly enough, in terms of the overall alignment score, that gaps will open only where they are really needed and not all over the sequence. Most sequence alignment models use affine gap penalties, in which the cost of opening a gap in a sequence is different from the cost of extending a gap that has already been started. Of these two penalties—-the gap opening penalty and the gap extension penalty—-the gap opening penalties tend to be much higher than the associated extension penalty. This tendency reflects the tendency for insertions and deletions to occur over several residues at a time.

Gap penalties are intimately tied to the scoring matrix that aligns the sequences: the best pair of gap opening and extension penalties for one scoring matrix doesn't necessarily work with another. Scores of -11 for gap opening and -1 for gap extension are commonly used in conjunction with the BLOSUM 62 matrix for gapped-BLAST, while BLOSUM50 uses a -12/-1 penalty.

Dynamic Programming

Dynamic programming methods are a general class of algorithms that are often seen both in sequence alignment and other computational problems. They were first described in the 1950s by Richard Bellman of Princeton University as a general optimization technique. Dynamic programming seems to have been introduced[†] to biological sequence comparison by Saul Needleman and Christian Wunsch, who apparently were unaware of the similarity between their method and Bellman's.

As we mentioned, dynamic programming algorithms solve optimization problems, problems in which there are a large number of possible solutions, but only one (or a small number of ) best solutions. A dynamic programming algorithm finds the best solution by first breaking the original problem into smaller subproblems and then solving. These pieces of the larger problem have a sequential dependency; that is, the fourth piece can be solved only with the answer to the third piece, the third can be solved only with the answer to the second, and so on. Dynamic programming works by first solving all these subproblems, storing each intermediate solution in a table along with a score, and finally choosing the sequence of solutions that yields the highest score. The goal of the dynamic programming algorithm is to maximize the total score for the alignment. In order to do this, the number of high-scoring residue pairs must be maximized and the number of gaps and low-scoring pairs must be minimized.

In sequence comparison, the overall problem is finding the best alignment between two sequences. This problem is broken down into subproblems of aligning each residue from one sequence with each residue from the other. The solution is a decision as to whether the residues should be aligned with each other, a gap should be introduced in the first sequence, or a gap should be introduced in the second sequence. Each high-scoring choice rules out the other two low-scoring possibilities, so that if information about the accumulated scores is stored at each step, every possible alignment need not be evaluated.

The algorithm uses an (m x n) matrix of scores (illustrated in Figure 7-10) in which m and n are the lengths of the sequences being aligned. Starting with the alignment of a gap against itself (which is given the initial score zero), the algorithm fills in the matrix one row at a time. At each position in the matrix, the algorithm computes the scores that result for each of its three choices, selects the one that yields the highest score, then stores a pointer at the current position to the preceding position that was used to arrive at the high score. When every position in the matrix has been filled in, a traceback step is performed, and the highest-scoring path along the pointers is followed back to the beginning of the alignment.

A matrix of scores comparing two sequences; continuous high-scoring matches are highlighted

Figure 7-10. A matrix of scores comparing two sequences; continuous high-scoring matches are highlighted

Global Alignment

One alignment scenario you may encounter is the alignment of two sequences along their whole length. The algorithm for alignment of whole sequences is called the Needleman-Wunsch algorithm. In this scenario, an optimal alignment is built up from high-scoring alignments of subsequences, stepping through the matrix from top left to bottom right. Only the best-scoring path can be traced through the matrix, resulting in an optimal alignment.

Using ALIGN to produce a global sequence alignment

Now that we have seen the theory behind the global alignment of two sequences, let's examine a program that implements this algorithm. ALIGN is a simple utility for computing global alignments. It is part of the FASTA software distribution, described later in this chapter. The programs in the FASTA distribution are easily run from the Unix command line, although many of them have been incorporated into the SDSC Biology Workbench web-based sequence analysis software, if you prefer to access them through a point-and-click interface. The FASTA programs compile easily under Linux; however, once they are compiled, you need to link them into your /usr/local/bin directory or some other sensible location by hand.

To run ALIGN and any of the other FASTA programs, you need sequence data in FASTA format. This is one of the most frequently used sequence formats and probably the simplest. To use ALIGN, each of the sequences you are aligning should be in a separate file.

A sequence in FASTA format[‡] looks like this:

>2HHB:A HEMOGLOBIN (DEOXY) - CHAIN A 
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR 

The FASTA format is very flexible, and it is one of the most commonly used formats for sequence analysis programs. A FASTA file contains one or more records in FASTA format, separated by empty lines. Each record consists of a human-readable comment followed by a nucleotide or protein sequence. The comment appears in the first line of the record; it must begin with a greater-than (>) symbol followed by one or more identifiers for the sequence. The comment may contain information about the molecule represented by the sequence, such as the protein or gene name and source organism. In the previous example, the identifier is a PDB code (2HHB), followed by a description of the sequence (the A chain of a deoxyhemoglobin protein). The remainder of the record contains the sequence itself, divided into separate lines by line breaks. Lines are usually 60 characters long, but the format doesn't specify a line length. Programs that take FASTA data as input (such as ALIGN) usually make allowances for FASTA's free-form nature. Still, it's a good practice to check the program's documentation to make sure that your data is appropriately formatted.

To use ALIGN, simply enter align at the command prompt. You are then prompted for sequence filenames. Results are sent to standard output. The ASCII format for pairwise alignments that is produced by FASTA is still commonly used, although there is a trend toward use of more easily parsed alignment formats:

Output scoring matrix: BLOSUM50, gap penalties: -12/-2 43.2% identity;          
Global alignment score: 374

                10        20        30        40              50    
2HHB_A V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA
       : :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::      :.
2HHB:B VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP             

               10          20        30        40        50        
            60        70        80        90       100       110
2HHB_A QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL
       .::.:::::  :.....::.:.. .....::.::  ::.::: ::.::.. :. .:: :.
2HHB:B KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF
       60        70        80        90       100       110        
           120       130       140
2HHB_A PAEFTPAVHASLDKFLASVSTVLTSKYR
         :::: :.:. .: .:.:...:. ::.
2HHB:B GKEFTPPVQAAYQKVVAGVANALAHKYH
      120       130       140

The FASTA distribution contains a sample HTML form and CGI script for use with the program LALIGN, a pairwise local alignment program. The script can be modified to work with the ALIGN program if a web-based interface is desired.

Local Alignment

The most commonly used sequence alignment tools rely on a strategy called local alignment. The global alignment strategy discussed earlier assumes that the two sequences to be aligned are known and are to be aligned over their full length. In the scenarios that are encountered most often with sequence alignment, however, you are either searching with one sequence against a sequence database looking for unknown sequences, or searching a very long DNA sequence, such as part of a genome, for partial segments that match a query sequence. In protein or gene sequences that do have some evolutionary relatedness, but which have diverged significantly from each other, short homologous segments may be all the evidence of sequence homology that remains.

The version of the dynamic programming algorithm that performs local alignment of two sequences is known as the Smith-Waterman algorithm. Named for its inventors, Dr. Temple Smith and Dr. Michael Waterman, this algorithm is similar to the Needleman-Wunsch algorithm except that an additional choice is allowed when tracing through the matrix. A local alignment isn't required to extend from beginning to end of the two sequences being aligned. If the cumulative score up to some point in the sequence is negative, the alignment can be abandoned and a new alignment started. The alignment can also end anywhere in the matrix.

Tools for local alignment

One of the most frequently reported implementations of the Smith-Waterman algorithm for database searching is the program SSEARCH, which is part of the FASTA distribution described later. LALIGN, also part of the FASTA package, is an implementation of the Smith-Waterman algorithm for aligning two sequences.

Sequence Queries Against Biological Databases

A common application of sequence alignment is searching a database for sequences that are similar to a query sequence. In these searches, an alignment of a sequence hundreds or thousands of residues long is matched against a database of at least tens of thousands of comparably sized sequences. Using dynamic programming-based methods, this isn't very practical unless special-purpose hardware is available. Instead, for routine searches, special heuristic database-searching methods are used. Heuristic methods exploit knowledge about sequences and alignment statistics to make these large-scale searches efficient and practical. While they don't guarantee optimal alignments, they make sensitive searches of large sequence databases possible. In this section, we describe BLAST and FASTA, two commonly used database-searching programs.

Local Alignment-Based Searching Using BLAST

By far, the most popular tool for searching sequence databases is a program called BLAST (Basic Local Alignment Search Tool). BLAST is the algorithm at the core of most online sequence search servers.[§] It performs pairwise comparisons of sequences, seeking regions of local similarity, rather than optimal global alignments between whole sequences.

BLAST can perform hundreds or even thousands of sequence comparisons in a matter of minutes. And in less than a few hours, a query sequence can be compared to an entire database to find all similar sequences. BLAST is so popular for this purpose that it's become a verb in the computational biology community, as in "I BLASTed this sequence against GenBank and came up with three matches."

The BLAST algorithm

Local sequence alignment searching using a standard Smith-Waterman algorithm is a fairly slow process. The BLAST algorithm, which speeds up local sequence alignment, has three basic steps. First, it creates a list of all short sequences (called words in BLAST terminology) that score above a threshold value when aligned with the query sequence. Next, the sequence database is searched for occurrences of these words. Because the word length is so short (3 residues for proteins, 11 residues for nucleic acids), it's possible to search a precomputed table of all words and their positions in the sequences for improved speed. These matching words are then extended into ungapped local alignments between the query sequence and the sequence from the database. Extensions are continued until the score of the alignment drops below a threshold. The top-scoring alignments in a sequence, or maximal-scoring segment pairs (MSPs), are combined where possible into local alignments. Originally, BLAST searched only for ungapped alignments. However, new additions to the BLAST software package that do search for gapped alignments have since been introduced.

NCBI BLAST and WU-BLAST

There are two implementations of the BLAST algorithm: NCBI BLAST and WU-BLAST. Both implementations can be used as web services and as downloadable software packages. NCBI BLAST is available from the National Center for Biotechnology Information (NCBI), while WU-BLAST is an alternate version that grew out of NCBI BLAST 1.4 and is developed and maintained by Dr. Warren Gish and coworkers at Washington University.

NCBI BLAST is the more commonly used of the two. The most recent versions of this program have focused on the development of methods for comparing multiple-sequence profiles (see Chapter 8). WU-BLAST, on the other hand, has developed a different system for handling gaps as well as a number of features (such as filtering for repeats) that are useful for searching genome sequences. Consequently, TIGR, Stanford's yeast genome server, Berkeley's Drosophila genome project, and others use WU-BLAST 2.0 as the sequence-comparison tool for searching their genome sequence data via the Web. As of this writing, WU-BLAST 2.0, the most recent version of the software, is copyrighted. NCBI BLAST and WU-BLAST's previous version, 1.4, are both in the public domain and freely available to all researchers. Because of its ubiquity we focus on NCBI BLAST in the following section, but WU-BLAST is an alternative. For more information on these flavors of BLAST see the NCBI web site at http://www.ncbi.nlm.nih.gov/BLAST, or the WU-BLAST web site at http://blast.wustl.edu.

What do the various BLAST programs do?

Frequent users of BLAST can also download and install BLAST binaries on their own machines. BLAST installs easily on a Linux system. Simply create a new directory (e.g., /usr/local/blast), move the archive into it, and extract. Here are the four main executable programs in the BLAST distribution:

[blastall]

Performs BLAST searches using one of five BLAST programs: blastp, blastn, blastx, tblastn, or tblastx

[blastpgp]

Performs searches in PSI-BLAST or PHI-BLAST mode

[bl2seq]

Performs a local alignment of two sequences

[formatdb]

Converts a FASTA-format flat file sequence database into a BLAST database

blastall encompasses all the major options for ungapped and gapped BLAST searches. A full list of its command-line arguments can be displayed with the command blastall - :

[-p]

Program name. Its options include:

blastp

Protein sequence (PS) query versus PS database

blastn

Nucleic acid sequence (NS) query versus NS database

blastx

NS query translated in all six reading frames versus PS database

tblastn

PS query versus NS database dynamically translated in all six reading frames

tblastx

Translated NS query versus translated NS database—computationally intensive

[-d]

Database name. Each indexed database consists of several files; the name is the common portion of those filenames.

[-i]

Query sequence filename. Defaults to standard input if not specified.

[-e]

Expectation value cutoff. (See Section 7.8.1.5.)

[-m]

Alignment view. Its options include:

0

Pairwise

1

Master-slave, show identities

2

Master-slave, no identities

3

Flat master-slave, show identities

4

Flat master-slave, no identities

5

Master-slave, no identities, blunt ends

6

Flat master-slave, no identities, blunt ends

[-o]

Output file name. Defaults to standard output if not specified.

[-G]

Gap opening penalty. Defaults to 11 (for BLOSUM63 matrix).

[-E]

Gap extension penalty. Defaults to 1 (for BLOSUM63 matrix).

[-q]

Nucleotide mismatch penalty. blastn only. Defaults to -3.

[-r]

Reward for nucleotide match. blastn only. Defaults to 1.

[-b]

Number of alignments to show.

[-g]

Perform gapped alignment. T/F. Unavailable with tblastx.

[-M]

Scoring matrix. Defaults to BLOSUM62.

[-T]

Produce HTML output. T/F. Defaults to F.

These are the command-line options you are most likely to use, but there are a large number of other options available. As you become familiar with BLAST, you may want to learn to use them.

blastpgp allows you to use two new BLAST modes: PHI-BLAST (Pattern Hit Initiated BLAST) and PSI-BLAST (Position Specific Iterative BLAST). PHI-BLAST uses protein motifs, such as those found in PROSITE and other motif databases, to increase the likelihood of finding biologically significant matches. PSI-BLAST uses an iterative alignment procedure to develop position-specific scoring matrices, which increases its capability to detect weak pattern matches. Both methods are discussed further when we get to multiple sequence analysis in Chapter 8.

bl2seq allows the comparison of two known sequences using the blastp or blastn programs. Most of the command-line options for bl2seq are similar to those for blastall.

Building a local database with formatdb

Usage: formatdb -i araseed.nt -p F -o T

Although BLAST is available on many web servers, one of the benefits of installing and using BLAST locally is the ability to create your own sequence databases. For instance, you may have a set of sequences that aren't yet published or publicly distributed. They're not in the GenBank database, so if you can't run BLAST on your own machine, how do you search them?

The program formatdb accepts an input sequence database either in FASTA format or in NCBI's ASN.1 format (described in Chapter 12 ). On the program command line it is necessary to specify the input filename, whether the input is a protein or nucleic acid sequence, and whether you want to create indexes for the database or not. There are other command-line options available, which can be viewed by trying to run formatdb with no command-line options specified.

The files created are:

araseed.nt.nhr
araseed.nt.nin
araseed.nt.nsd
araseed.nt.nsi
araseed.nt.nsq

Evaluating BLAST results

A BLAST search in a sequence database can produce dozens or hundreds of candidate alignments. Out of these alignments, how can you tell which are really significantly homologous, and which are merely the best matches between unrelated sequences? BLAST provides three related pieces of information that allow you to interpret its results: raw scores, bit scores, and E-values.

The raw score for a local sequence alignment is the sum of the scores of the maximal-scoring segment pairs (MSPs) that make up the alignment. Because of differences between scoring matrices, raw scores aren't always directly comparable. Bit score s are raw scores that have been converted from the log base of the scoring matrix that creates the alignment to log base 2. This rescaling allows bit scores to be compared between alignments.

E-values provide information about the likelihood that a given sequence alignment is significant. An alignment's E-value indicates the number of alignments one expects to find with a score greater than or equal to the observed alignment's score in a search against a random database. Thus, a large E-value (5 or 10) indicates that the alignment probably has occurred by chance, and that the target sequence has been aligned to an unrelated sequence in the database. E-values of 0.1 or 0.05 are typically used as cutoffs in sequence database searches. Using a larger E-value cutoff in a database search allows more distant matches to be found, but it also results in a higher rate of spurious alignments. Of the three, E-values are the values most often reported in the literature.

There is a limit beyond which sequence similarity becomes uninformative about the relatedness of the sequences being compared. This limit is encountered below approximately 25% sequence similarity for protein sequences of normal length, although research continues to push at this boundary. In the case of protein sequences with low sequence similarity that are still believed to be related, structural analysis techniques may provide evidence for such a relationship. Where structure is unknown, sequences with low similarity are categorized as unrelated, but that may mean only that the evolutionary distance between sequences is so great that a relationship can't be detected.

Local Alignment Using FASTA

Another heuristic method for local sequence alignment is the FASTA algorithm. FASTA predates BLAST, and it is still actively maintained by Dr. William Pearson at the University of Virginia. Like BLAST, it is available both as a service over the Web and as a downloadable set of programs. In this section, we describe the current version of the FASTA algorithm and some of the programs included in the FASTA distribution.

The FASTA algorithm

FASTA first searches for short sequences (called ktups)[‖] that occur in both the query sequence and the sequence database. Then, using the BLOSUM50 matrix, the algorithm scores the 10 ungapped alignments that contain the most identical ktups. These ungapped alignments are tested for their ability to be merged into a gapped alignment without reducing the score below a threshold. For those merged alignments that score over the threshold, an optimal local alignment of that region is then computed, and the score for that alignment (called the optimized score) is reported.

FASTA ktups are shorter than BLAST words, typically 1 or 2 for proteins, and 4 or 6 for nucleic acids. Lower ktup values result in slower but more sensitive searches, while higher ktup values yield faster searches with fewer false positives.

The FASTA programs

The FASTA distribution contains search programs that are analogous to the main BLAST modes, with the exception of PHI-BLAST and PSI-BLAST, as well as programs for global and local pairwise alignment and other useful functions. The FASTA programs listed here all compile easily on a Linux system:

[fasta]

Compares a protein sequence against a protein database (or a DNA sequence against a DNA database) using the FASTA algorithm

[ssearch]

Compares a protein sequence against a protein database (or DNA sequence against a DNA database) using the Smith-Waterman algorithm

[fastx /fasty]

Compares a DNA sequence against a protein database, performing translations on the DNA sequence

[tfastx /tfasty]

Compares a protein sequence against a DNA database, performing translations on the DNA sequence database

[align]

Computes the global alignment between two DNA or protein sequences

[lalign]

Computes the local alignment between two DNA or protein sequences

As of this writing, all these programs except ALIGN and LALIGN are available in the FASTA 3.3 package; for now, ALIGN and LALIGN are available only in the FASTA 2 distribution.

Multifunctional Tools for Sequence Analysis

Several research groups and companies have assembled web-based interfaces to collections of sequence tools. The best of these have fully integrated tools, public databases, and the ability to save a record of user data and activities from one use to another. If you're searching for matches to just one or a few sequences and you want to search the standard public databases, these portals can save you a lot of time while providing most of the functionality and ease of use of a commercial sequence analysis package. In some cases, you'll have to pay for a license or subscription to access the full functionality of these sites; others are freely accessible.

NCBI SEALS

The NCBI SEALS project aims to develop a Perl-based command-line environment for Systematic Analysis of Lots of Sequences. SEALS is far from a fully automated genome analysis tool, and it isn't intended to be. What SEALS does provide is an enhancement to the command-line environment on Unix systems. It is composed of a large suite of scripts with a variety of useful functions: converting file formats, manipulating BLAST results and FASTA files, database retrieval, piping files into Netscape, and a host of other features that make your data easier to look at without requiring a resource-sucking GUI. SEALS runs on Unix systems and is probably most useful for those who are already Unix aficionados. Before you write a script to process a lot of sequences, check to see if the process you want has been implemented in SEALS.

The Biology Workbench

The San Diego Supercomputing Center offers access to sequence-analysis tools through the Biology Workbench. This resource has been freely available to academic users in one form or another since 1995. Users obtain a login and password at the SDSC site, and work sessions and data can be saved securely on the server.

The Biology Workbench offers keyword and sequence-based searching of nearly 40 major sequence databases and over 25 whole genomes. Both BLAST and FASTA are implemented as search and alignment tools in the Workbench, along with several local and global alignment tools, tools for DNA sequence translation, protein sequence feature analysis, multiple sequence alignment, and phylogenetic tree drawing. The Workbench group has not yet implemented profile tools, such as MEME, HMMer, or sequence logo tools, although PSI-BLAST is available for sequence searches.

Although its interface can be somewhat cumbersome, involving a lot of window scrolling and button clicking, the Biology Workbench is still the most comprehensive, convenient, and accessible of the web-based toolkits. One of its main benefits is that many sequence file formats are accepted and translated by the software. Users of the Workbench need never worry about file type incompatibility and can move seamlessly from keyword-based database search, to sequence-based search, to multiple alignment, to phylogenetic analysis.

DoubleTwist

Another entry into the sequence analysis portal arena is DoubleTwist at http://doubletwist.com. This site allows you to submit a sequence for comparison to multiple databases using BLAST. It also provides "agents" that monitor databases for new additions that match a submitted sequence and automatically notifies the user. These services, as well as access to the EcoCyc pathways database and to an online biology research protocols database, are free with registration at the site at the time of this writing.



[*] The more computationally efficient solution to this problem is simply to translate the DNA sequence database in all six reading frames.

[†] Or, as mathematicians might say, "rediscovered." Because computational biology combines research from so many different areas, this independent discovery happens often and is only noticed later.

[‡] Also known as Pearson format, after the author of the FASTA software, William Pearson.

[§] To give you perspective on how long the common tools of bioinformatics have been available, the original BLAST paper by Altschul et al. was published in the Journal of Molecular Biology in October 1990.

[‖] An abbreviation for k tuples, or ordered sequences of k residues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset