Chapter 2. Computational Approaches to Biological Questions

There is a standard range of techniques that are taught in bioinformatics courses. Currently, most of the important techniques are based on one key principle: that sequence and structural homology (or similarity) between molecules can be used to infer structural and functional similarity. In this chapter, we'll give you an overview of the standard computer techniques available to biologists; later in the book, we'll discuss how specific software packages implement these techniques and how you should use them.

Molecular Biology's Central Dogma

Before we go any further, it's essential that you understand some basics of cell and molecular biology. If you're already familiar with DNA and protein structure, genes, and the processes of transcription and translation, feel free to skip ahead to the next section.

The central dogma of molecular biology states that:

DNA acts as a template to replicate itself, DNA is also transcribed into RNA, and RNA is translated into protein.

As you can see, the central dogma sums up the function of the genome in terms of information. Genetic information is conserved and passed on to progeny through the process of replication. Genetic information is also used by the individual organism through the processes of transcription and translation. There are many layers of function, at the structural, biochemical, and cellular levels, built on top of genomic information. But in the end, all of life's functions come back to the information content of the genome.

Put another way, genomic DNA contains the master plan for a living thing. Without DNA, organisms wouldn't be able to replicate themselves. The raw "one-dimensional" sequence of DNA, however, doesn't actually do anything biochemically; it's only information, a blueprint if you will, that's read by the cell's protein synthesizing machinery. DNA sequences are the punch cards; cells are the computers.

DNA is a linear polymer made up of individual chemical units called nucleotides or bases. The four nucleotides that make up the DNA sequences of living things (on Earth, at least) are adenine, guanine, cytosine, and thymine—designated A, G, C, and T, respectively. The order of the nucleotides in the linear DNA sequence contains the instructions that build an organism. Those instructions are read in processes called replication, transcription, and translation.

Replication of DNA

The unusual structure of DNA molecules gives DNA special properties. These properties allow the information stored in DNA to be preserved and passed from one cell to another, and thus from parents to their offspring. Two molecules of DNA form a double-helical structure, twining around each other in a regular pattern along their full length—which can be millions of nucleotides. The halves of the double helix are held together by bonds between the nucleotides on each strand. The nucleotides also bond in particular ways: A can pair only with T, and G can pair only with C. Each of these pairs is referred to as a base pair , and the length of a DNA sequence is often described in base pairs (or bp), kilobases (1,000 bp), megabases (1 million bp), etc.

Each strand in the DNA double helix is a chemical "mirror image" of the other. If there is an A on one strand, there will always be a T opposite it on the other. If there is a C on one strand, its partner will always be a G.

When a cell divides to form two new daughter cells, DNA is replicated by untwisting the two strands of the double helix and using each strand as a template to build its chemical mirror image, or complementary strand. This process is illustrated in Figure 2-1.

Schematic replication of one strand of the DNA helix

Figure 2-1. Schematic replication of one strand of the DNA helix

Genomes and Genes

The entire DNA sequence that codes for a living thing is called its genome . The genome doesn't function as one long sequence, however. It's divided into individual genes. A gene is a small, defined section of the entire genomic sequence, and each gene has a specific, unique purpose.

There are three classes of genes. Protein-coding genes are templates for generating molecules called proteins. Each protein encoded by the genome is a chemical machine with a distinct purpose in the organism. RNA-specifying genes are also templates for chemical machines, but the building blocks of RNA machines are different from those that make up proteins. Finally, untranscribed genes are regions of genomic DNA that have some functional purpose but don't achieve that purpose by being transcribed or translated to create another molecule.

Transcription of DNA

DNA can act not only as a template for making copies of itself but also as a blueprint for a molecule called ribonucleic acid (RNA). The process by which DNA is transcribed into RNA is called transcription and is illustrated inFigure 2-2. RNA is structurally similar to DNA. It's a polymeric molecule made up of individual chemical units, but the chemical backbone that holds these units together is slightly different from the backbone of DNA, allowing RNA to exist in a single-stranded form as well as in a double helix. These single-stranded molecules still form base pairs between different parts of the chain, causing RNA to fold into 3D structures. The individual chemical units of RNA are designated A, C, G, and U (uracil, which takes the place of thymine).

Schematic of DNA being transcribed into RNA

Figure 2-2. Schematic of DNA being transcribed into RNA

The genome provides a template for the synthesis of a variety of RNA molecules: the three main types of RNA are messenger RNA, transfer RNA, and ribosomal RNA. Messenger RNA (mRNA) molecules are RNA transcripts of genes. They carry information from the genome to the ribosome, the cell's protein synthesis apparatus. Transfer RNA (tRNA) molecules are untranslated RNA molecules that transport amino acids, the building blocks of proteins, to the ribosome. Finally, ribosomal RNA (rRNA) molecules are the untranslated RNA components of ribosomes, which are complexes of protein and RNA. rRNAs are involved in anchoring the mRNA molecule and catalyzing some steps in the translation process. Some viruses also use RNA instead of DNA as their genetic material.

Translation of mRNA

Translation of mRNA into protein is the final major step in putting the information in the genome to work in the cell.

Like DNA, proteins are linear polymers built from an alphabet of chemically variable units. The protein alphabet is a set of small molecules called amino acids.

Unlike DNA, the chemical sequence of a protein has physicochemical "content" as well as information content. Each of the 20 amino acids commonly found in proteins has a different chemical nature, determined by its side chain—a chemical group that varies from amino acid to amino acid. The chemical sequence of the protein is called its primary structure, but the way the sequence folds up to form a compact molecule is as important to the function of the protein as is its primary structure. The secondary and tertiary structure elements that make up the protein's final fold can bring distant parts of the chemical sequence of the protein together to form functional sites.

As shown in Figure 2-3, the genetic code is the code that translates DNA into protein. It takes three bases of DNA (called a codon ) to code for each amino acid in a protein sequence. Simple combinatorics tells us that there are 64 ways to choose 3 nucleotides from a set of 4, so there are 64 possible codons and only 20 amino acids. Some codons are redundant; others have the special function of telling the cell's translation machinery to stop translating an mRNA molecule. Figure 2-4 shows how RNA is translated into protein.

The genetic code

Figure 2-3. The genetic code

Synthesis of protein with standard base pairing

Figure 2-4. Synthesis of protein with standard base pairing

Molecular Evolution

Errors in replication and transcription of DNA are relatively common. If these errors occur in the reproductive cells of an organism, they can be passed to its progeny. Alterations in the sequence of DNA are known as mutations . Mutations can have harmful results—results that make the progeny less likely to survive to adulthood. They can also have beneficial results, or they can be neutral. If a mutation doesn't kill the organism before it reproduces, the mutation can become fixed in the population over many generations. The slow accumulation of such changes is responsible for the process known as evolution. Access to DNA sequences gives us access to a more precise understanding of evolution. Our understanding of the molecular mechanism of evolution as a gradual process of accumulating DNA sequence mutations is the justification for developing hypotheses based on DNA and protein sequence comparison.

What Biologists Model

Now that we've completed our ultra-short course in cell biology, let's look at how to apply it to problems in molecular biology. One of the most important exercises in biology and bioinformatics is modeling. A model is an abstract way of describing a complicated system. Turning something as complex (and confusing) as a chromosome, or the cycle of cell division, into a simplified representation that captures all the features you are trying to study can be extremely difficult. A model helps us see the larger picture. One feature of a good model is that it makes systems that are otherwise difficult to study easier to analyze using quantitative approaches. Bioinformatics tools rely on our ability to extract relevant parameters from a biological system (be it a single molecule or something as complicated as a cell), describe them quantitatively, and then develop computational methods that use those parameters to compute the properties of a system or predict its behavior.

To help you understand what a model is and what kind of analysis a good model makes possible, let's look at three examples on which bioinformatics methods are based.

Accessing 3D Molecules Through a 1D Representation

In reality, DNA and proteins are complicated 3D molecules, composed of thousands or even millions of atoms bonded together. However, DNA and proteins are both polymers , chains of repeating chemical units (monomers ) with a common backbone holding them together. Each chemical unit in the polymer has two subsets of atoms: a subset of atoms that doesn't vary from monomer to monomer and that makes up the backbone of the polymer, and a subset of atoms that does vary from monomer to monomer.

In DNA, four nucleic acid monomers (A, T, C, and G) are commonly used to build the polymer chain. In proteins, 20 amino acid monomers are used. In a DNA chain, the four nucleic acids can occur in any order, and the order they occur in determines what the DNA does. In a protein, amino acids can occur in any order, and their order determines the protein's fold and function.

Not too long after the chemical natures of DNA and proteins were understood, researchers recognized that it was convenient to represent them by strings of single letters. Instead of representing each nucleic acid in a DNA sequence as a detailed chemical entity, they could be represented simply as A, T, C, and G. Thus, a short piece of DNA that contains thousands of individual atoms can be represented by a sequence of few hundred letters. Figure 2-5 illustrates the simplified way to represent a polymer chain.

Simplifying the representation of a polymer chain

Figure 2-5. Simplifying the representation of a polymer chain

Not only does this abstraction save storage space and provide a convenient form for sharing sequence information, it represents the nature of a molecule uniquely and correctly and ignores levels of detail (such as atomic structure of DNA and many proteins) that are experimentally inaccessible. Many computational biology methods exploit this 1D abstraction of 3D biological macromolecules.

The abstraction of nucleic acid and protein sequences into 1D strings has been one of the most fruitful modeling strategies in computational molecular biology, and analysis of character strings is a long-standing area of research in computer science.[*] One of the elementary questions you can ask about strings is, "Do they match?" There are well-established algorithms in computer science for finding exact and inexact matches in pairs of strings. These algorithms are applied to find pairwise matches between biological sequences and to search sequence databases using a sequence query.

In addition to matching individual sequences, string-based methods from computer science have been successfully applied to a number of other problems in molecular biology. For example, algorithms for reconstructing a string from a set of shorter substrings can assemble DNA sequences from overlapping sequence fragments. Techniques for recognizing repeated patterns in single sequences or conserved patterns across multiple sequences allow researchers to identify signatures associated with biological structures or functions. Finally, multiple sequence-alignment techniques allow the simultaneous comparison of several molecules that can infer evolutionary relationships between sequences.

This simplifying abstraction of DNA and protein sequence seems to ignore a lot of biology. The cellular context in which biomolecules exist is completely ignored, as are their interactions with other molecules and their molecular structure. And yet it has been shown over and over that matches between biological sequences—for example, in the detection of similarity in eye-development genes in humans and flies, as we discussed in Chapter 1—can be biologically meaningful.

Abstractions for Modeling Protein Structure

There is more to biology than sequences. Proteins and nucleic acids also have complex 3D structures that provide clues to their functions in the living organism. Molecular structures are usually represented as collections of atoms, each of which has a defined position in 3D space. Structure analysis can be performed on static structures, or movements and interactions in the molecules can be studied with molecular simulation methods.

Standard molecular simulation approaches model proteins as a collection of point masses (atoms) connected by bonds. The bond between two atoms has a standard length, derived from experimental chemistry, and an associated applied force that constrains the bond at that length. The angle between three adjacent atoms has a standard value and an applied force that constrains the bond angle around that value. The same is true of the dihedral angle described by four adjacent atoms. In a molecular dynamics simulation, energy is added to the molecular system by simulated "heating." Following standard Newtonian laws, the atoms in the molecule move. The energy added to the system provides an opposing force that moves atoms in the molecule out of their standard conformations. The actions and reactions of hundreds of atoms in a molecular system can be simulated using this abstraction.

However, the computational demands of molecular simulations are huge, and there is some uncertainty both in the force field -- the collection of standard forces that model the molecule—and in the modeling of nonbonded interactions -- interactions between nonadjacent atoms. So it has not proven possible to predict protein structure using the all-atom modeling approach.

Some researchers have recently had moderate success in predicting protein topology for simple proteins using an intermediate level of abstraction—more than linear sequence, but less than an all-atom model. In this case, the protein is treated as a series of beads (representing the individual amino acids) on a string (representing the backbone). Beads may have different characters to represent the differences in the amino acid sidechains. They may be positively or negatively charged, polar or nonpolar, small or large. There are rules governing which beads will attract each other. Like charges repel; unlike charges attract. Polar groups cluster with other polar groups, and nonpolar with nonpolar. There are also rules governing the string; mainly that it can't pass through itself in the course of the simulation. The folding simulation itself is conducted through sequential or simultaneous perturbation of the position of each bead.

Mathematical Modeling of Biochemical Systems

Using theoretical models in biology goes far beyond the single molecule level. For years, ecologists have been using mathematical models to help them understand the dynamics of changes in interdependent populations. What effect does a decrease in the population of a predator species have on the population of its prey? What effect do changes in the environment have on population? The answers to those questions are theoretically predictable, given an appropriate mathematical model and a knowledge of the sizes of populations and their standard rates of change due to various factors.

In molecular biology, a similar approach, called metabolic control analysis , is applied to biochemical reactions that involve many molecules and chemical species. While cells contain hundreds or thousands of interacting proteins, small molecules, and ions, it's possible to create a model that describes and predicts a small corner of that complicated metabolism. For instance, if you are interested in the biological processes that maintain different concentrations of hydrogen ions on either side of the mitochondrial inner membrane in eukaryotic cells, it's probably not necessary for your model to include the distant group of metabolic pathways that are closely involved in biosynthesis of the heme structure.

Metabolic models describe a biochemical process in terms of the concentrations of chemical species involved in a pathway, and the reactions and fluxes that affect those concentrations. Reactions and fluxes can be described by differential equations; they are essentially rates of change in concentration. What makes metabolic simulation interesting is the possibility of modeling dozens of reactions simultaneously to see what effect they have on the concentration of particular chemical species. Using a properly constructed metabolic model, you can test different assumptions about cellular conditions and fine-tune the model to simulate experimental observations. That, in turn, can suggest testable hypotheses to drive further research.

Why Biologists Model

We've mentioned more than once that theoretical modeling provides testable hypotheses, not definitive answers. It sometimes isn't so easy to maintain this distinction, especially with pairwise sequence comparison, which seems to provide such ready answers. Even identification of genes based on sequence similarity ultimately needs to be validated experimentally. It's not sufficient to say that an unknown DNA sequence is similar to the sequence of a gene that has been subject to detailed characterization, so therefore it must have an identical function. The two sequences could be distantly related but have evolved to have different functions. However, it's altogether reasonable to use sequence similarity as the starting point for verification; if sequence homology suggests that an unknown gene is similar to citrate synthases, your first experimental approach might be to test the unknown gene product for citrate synthase activity.

One of the main benefits of using computational tools in biology is that it becomes easier to preselect targets for experimentation in molecular biology and biochemistry. Using everything from sequence profiling methods to geometric and physicochemical analysis of protein structures, researchers can focus narrowly on the parts of a sequence or structure that appear to have some functional significance. Only a decade ago, this focusing might have been done using "shotgun" approaches to site-directed mutagenesis, in which random single-residue mutants of a protein were created and characterized in order to select possible targets. Functional genomics and metabolic reconstruction efforts are beginning to provide biochemists with a framework for narrowing their research focuses as well.

For the researcher focused on developing bioinformatics methods, the discovery of general rules and properties in data is by far the most interesting category of problems that can be addressed using a computer. It's also a diverse category and one we can't give you many rules for. Researchers have found interesting and useful properties in everything from sequence patterns to the separation of atoms in molecular structures and have applied these findings to produce such tools as genefinders, secondary structure prediction tools, profile methods, and homology modeling tools.

Bioinformatics researchers are still tackling problems that currently have reasonably successful solutions, from basecalling to sequence alignment to genome comparison to protein structure modeling, attempting to improve the accuracy and range of these procedures. Information-technology experts are currently developing database structures and query tools for everything from gene-expression data to intermolecular interactions. Like any other field of research, there are many niches of inquiry available, and the only way to find them is to delve into the current literature.

Computational Methods Covered in This Book

Molecular biology research is a fast-growing area. The amount and type of data that can be gathered is exploding, and the trend of storing this data in public databases is spilling over from genome sequence to all sorts of other biological datatypes. The information landscape for biologists is changing so rapidly that anything we say in this book is likely to be somewhat behind the times before it even hits the shelves.

Yet, since the inception of the Human Genome Project, a core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases—DNA, protein sequence, and protein structure. Although databases containing results from new high-throughput molecular biology methods have not yet grown to the extent the sequence databases have, standard methods for analyzing these data have begun to emerge.

While not exhaustive, the following list gives you an overview of the computational methods we address in this book:

Using public databases and data formats

The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information so you can search dozens of journals at once. You can even set up "agents" that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle you're looking for out of the database haystack. Tools for searching biochemical literature and sequence databases are introduced in Chapter 6.

Sequence alignment and sequence searching

As mentioned in Chapter 1, being able to compare pairs of DNA or protein sequences and extract partial matches has made it possible to use a biological sequence as a database query. Sequence-based searching is another key skill for biologists; a little exploration of the biological databases at the beginning of a project often saves a lot of valuable time in the lab. Identifying homologous sequences provides a basis for phylogenetic analysis and sequence-pattern recognition. Sequence-based searching can be done online through web forms, so it requires no special computing skills, but to judge the quality of your search results you need to understand how the underlying sequence-alignment method works and go beyond simple sequence alignment to other types of analysis. Tools for pairwise sequence alignment and sequence-based database searching are introduced in Chapter 7.

Gene prediction

Gene prediction is only one of a cluster of methods for attempting to detect meaningful signals in uncharacterized DNA sequences. Until recently, most sequences deposited in GenBank were already characterized at the time of deposition. That is, someone had already gone in and, using molecular biology, genetic, or biochemical methods, figured out what the gene did. However, now that the genome projects are in full swing, there's a lot of DNA sequence out there that isn't characterized.

Software for prediction of open reading frames, genes, exon splice sites, promoter binding sites, repeat sequences, and tRNA genes helps molecular biologists make sense out of this unmapped DNA. Tools for gene prediction are introduced in Chapter 7.

Multiple sequence alignment

Multiple sequence-alignment methods assemble pairwise sequence alignments for many related sequences into a picture of sequence homology among all members of a gene family. Multiple sequence alignments aid in visual identification of sites in a DNA or protein sequence that may be functionally important. Such sites are usually conserved; that is, the same amino acid is present at that site in each one of a group of related sequences. Multiple sequence alignments can also be quantitatively analyzed to extract information about a gene family. Multiple sequence alignments are an integral step in phylogenetic analysis of a family of related sequences, and they also provide the basis for identifying sequence patterns that characterize particular protein families. Tools for creating and editing multiple sequence alignments are introduced in Chapter 8.

Phylogenetic analysis

Phylogenetic analysis attempts to describe the evolutionary relatedness of a group of sequences. A traditional phylogenetic tree or cladogram groups species into a diagram that represents their relative evolutionary divergence. Branchings of the tree that occur furthest from the root separate individual species; branchings that occur close to the root group species into kingdoms, phyla, classes, families, genera, and so on.

The information in a molecular sequence alignment can be used to compute a phylogenetic tree for a particular family of gene sequences. The branchings in phylogenetic trees represent evolutionary distance based on sequence similarity scores or on information-theoretic modeling of the number of mutational steps required to change one sequence into the other. Phylogenetic analyses of protein sequence families talks not about the evolution of the entire organism but about evolutionary change in specific coding regions, although our ability to create broader evolutionary models based on molecular information will expand as the genome projects provide more data to work with. Tools for phylogenetic analysis are introduced in Chapter 8.

Extraction of patterns and profiles from sequence data

A motif is a sequence of amino acids that defines a substructure in a protein that can be connected to function or to structural stability. In a group of evolutionarily related gene sequences, motifs appear as conserved sites. Sites in a gene sequence tend to be conserved—to remain the same in all or most representatives of a sequence family—when there is selection pressure against copies of the gene that have mutations at that site. Nonessential parts of the gene sequence will diverge from each other in the course of evolution, so the conserved motif regions show up as a signal in a sea of mutational noise. Sequence profiles are statistical descriptions of these motif signals; profiles can help identify distantly related proteins by picking out a motif signal even in a sequence that has diverged radically from other members of the same family. Tools for profile analysis and motif discovery are introduced in Chapter 8.

Protein sequence analysis

The amino-acid content of a protein sequence can be used as the basis for many analyses, from computing the isoelectric point and molecular weight of the protein and the characteristic peptide mass fingerprints that will form when it's digested with a particular protease, to predicting secondary structure features and post-translational modification sites. Tools for feature prediction are introduced in Chapter 9, and tools for proteomics analysis are introduced in Chapter 11.

Protein structure prediction

It's a lot harder to determine the structure of a protein experimentally than it is to obtain DNA sequence data. One very active area of bioinformatics and computational biology research is the development of methods for predicting protein structure from protein sequence. Methods such as secondary structure prediction and threading can help determine how a protein might fold, classifying it with other proteins that have similar topology, but they don't provide a detailed structural model. The most effective and practical method for protein structure prediction is homology modeling—using a known structure as a template to model a structure with a similar sequence. In the absence of homology, there is no way to predict a complete 3D structure for a protein. Tools for protein structure prediction are introduced in Chapter 9.

Protein structure property analysis

Protein structures have many measurable properties that are of interest to crystallographers and structural biologists. Protein structure validation tools are used by crystallographers to measure how well a structure model conforms to structural rules extracted from existing structures or chemical model compounds. These tools may also analyze the "fitness" of every amino acid in a structure model for its environment, flagging such oddities as buried charges with no countercharge or large patches of hydrophobic amino acids found on a protein surface. These tools are useful for evaluating both experimental and theoretical structure models.

Another class of tools can calculate internal geometry and physicochemical properties of proteins. These tools usually are applied to help develop models of the protein's catalytic mechanism or other chemical features. Some of the most interesting properties of protein structures are the locations of deeply concave surface clefts and internal cavities, both of which may point to the location of a cofactor binding site or active site. Other tools compute hydrogen-bonding patterns or analyze intramolecular contacts. A particularly interesting set of properties are the electrostatic potential field surrounding the protein and other electrostatically controlled parameters such as individual amino acid pKas, protein solvation energies, and binding constants. Methods for protein property analysis are discussed in Chapter 10.

Protein structure alignment and comparison

Even when two gene sequences aren't apparently homologous, the structures of the proteins they encode can be similar. New tools for computing structural similarity are making it possible to detect distant homologies by comparing structures, even in the absence of much sequence similarity. These tools also are useful for comparing constructed homology models to the known protein structures they are based on. protein structure alignment tools are introduced in Chapter 10.

Biochemical simulation

Biochemical simulation uses the tools of dynamical systems modeling to simulate the chemical reactions involved in metabolism. Simulations can extend from individual metabolic pathways to transmembrane transport processes and even properties of whole cells or tissues. Biochemical and cellular simulations traditionally have relied on the ability of the scientist to describe a system mathematically, developing a system of differential equations that represent the different reactions and fluxes occurring in the system. However, new software tools can build the mathematical framework of a simulation automatically from a description provided interactively by the user, making mathematical modeling accessible to any biologist who knows enough about a system to describe it according to the conventions of dynamical systems modeling. Dynamical systems modeling tools are discussed in Chapter 11.

Whole genome analysis

As more and more genomes are sequenced completely, the analysis of raw genome data has become a more important task. There are a number of perspectives from which one can look at genome data: for example, it can be treated as a long linear sequence, but it's often more useful to integrate DNA sequence information with existing genetic and physical map data. This allows you to navigate a very large genome and find what you want. The National Center for Biotechnology Information (NCBI) and other organizations are making a concerted effort to provide useful web interfaces to genome data, so that users can start from a high-level map and navigate to the location of a specific gene sequence.

Genome navigation is far from the only issue in genomic sequence analysis, however. Annotation frameworks, which integrate genome sequence with results of gene finding analysis and sequence homology information, are becoming more common, and the challenge of making and analyzing complete pairwise comparisons between genomes is beginning to be addressed. Genome analysis tools are discussed in Chapter 11.

Primer design

Many molecular biology protocols require the design of oligonucleotide primers. Proper primer design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA sequencing, and microarray experiments. Primers must hybridize with the target DNA to provide a clear answer to the question being asked, but, they must also have appropriate physicochemical properties; they must not self-hybridize or dimerize; and they should not have multiple targets within the sequence under investigation. There are several web-based services that allow users to submit a DNA sequence and automatically detect appropriate primers, or to compute the properties of a desired primer DNA sequence. Primer design tools are discussed in Chapter 11.

DNA microarray analysis

DNA microarray analysis is a relatively new molecular biology method that expands on classic probe hybridization methods to provide access to thousands of genes at once. Microarray experiments are amenable to computational analysis because of the uniform, standardized nature of their results—a grid of equally sized spots, each identifiable with a particular DNA sequence. Computational tools are required to analyze larger microarrays because the resulting images are so visually complex that comparison by hand is no longer feasible.

The main tasks in microarray analysis as it's currently done are an image analysis step, in which individual spots on the array image are identified and signal intensity is quantitated, and a clustering step, in which spots with similar signal intensities are identified. Computational support is also required for the chip-design phase of a microarray experiment to identify appropriate oligonucleotide probe sequences for a particular set of genes and to maintain a record of the identity of each spot in a grid that may contain thousands of individual experiments. Array analysis tools are discussed in Chapter 11.

Proteomics analysis

Before they're ever crystallized and biochemically characterized, proteins are often studied using a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2D gel electrophoresis can separate a mixture of thousands of proteins into distinct components; the individual spots of material can be blotted or even cut from the gel and analyzed. Simple computational tools can provide some information to aid in the process of analyzing protein mixtures. It's trivial to compute molecular weight and pI from a protein sequence; by using these values in combination, sets of candidate identities can be found for each spot on a gel. It's also possible to compute, from a protein sequence, the peptide fingerprint that is created when that protein is broken down into fragments by enzymes with specific protein cleavage sites. Mass spec analyses of protein fragments can be compared to computed peptide fingerprints to further limit the search. Proteomics tools are covered in Chapter 11.

A Computational Biology Experiment

Computer-based research projects and computational analysis of experimental data must follow the same principles other scientific study do. Your results must clearly answer the question you set out to test, and they must be reproducible by someone else using the same input data and following the same process.

If you're already doing research in experimental biology, you probably have a pretty good understanding of the scientific method. Although your data, your method, and your results are all encoded in computer files rather than sitting on your laboratory bench, the process of designing a computational "experiment" is the same as you are used to.

Although it's easy in these days of automation to simply submit a query to a search engine and use the results without thinking too much about it, you need to understand your method and analyze your results thoroughly in the same way you would when applying a laboratory protocol. Sometimes that's easier said than done. So let's take a walk through the steps involved in defining an experiment in computational biology.

Identifying the Problem

A scientific experiment always begins with a question. A question can be as broad as "what is the catalytic mechanism of protein X?" It's not always possible to answer a complex question about how something works with one experiment. The question needs to be broken down into parts, each of which can be formulated as a hypothesis.

A hypothesis is a statement that is testable by experiment. In the course of solving a problem, you will probably formulate a number of testable statements, some of them trivial and some more complex. For instance, as a first approach to answering the question, "What is the catalytic mechanism of protein X?", you might come up with a preliminary hypothesis such as: "There are amino acids in protein X that are conserved in other proteins that do the same thing as protein X." You can test this hypothesis by using a computer program to align the sequences of as many protein X-type proteins as you can find, and look for amino acids that are identical among all or most of the sequences. Subsequently you'd move to another hypothesis such as: "Some of these conserved amino acids in the protein X family have something to do with the catalytic mechanism." This more complex hypothesis can then be broken down into a number of smaller ones, each of them testable (perhaps by a laboratory experiment, or perhaps by another computational procedure).

A research project can easily become interminable if the goals are ill-defined or the question can't feasibly be answered. On the other hand, if you aren't careful, it's easy to keep adding questions to a project on the basis of new information, allowing the goal to keep creeping out of reach every time its completion is close. It's easy to do this with computational projects, because the cost of materials and resources is low once the initial expense of buying computers and software is covered. It seems no big loss to just keep playing around on the computer.

We have found that this pathological condition can be avoided if, before embarking on a computational project, some time is spent on sketching out a specification of the project's goals and timeline. If you plan to write a project spec, it's easier to start from written answers to questions such as the following:

  • What is the question this project is trying to answer?

  • What is the final form you expect the results to take? Is the goal to produce a computer program, a data set that will be used in an ongoing project, a journal publication, etc.? What are the requirements for success or completion of the project?

  • What is the approximate timeline of the project?

  • What is the general outline of the project? Here, it would be appropriate to break the project down into constituent parts and describe what you think needs to be done to finish each part.

  • How does your project fit in with the work of others? If you're a lone wolf, you don't have to worry about this, but research scientists tend to run in packs. It's good to have a clear understanding of where your work is dependent on others. If you are writing a project spec for a group of people to work on, indicate who is responsible for each part of the work.

  • At what point will it be unprofitable to continue?

Thinking through questions like these not only gives you a clearer idea of what your projects are trying to achieve, but also gives you an outline by which you can organize your research.

Separating the Problem into Simpler Components

In Chapter 7 through Chapter 14, we cover many of the common protocols for using bioinformatics tools and databases in your research. Coming up with the series of steps in those protocols wasn't rocket science. The key to developing your own bioinformatics computer skills is this: know what tools are available and know how to use them. Then you can take a modular approach to the problems you want to solve, breaking them down into distinct modules such as sequence searching, sequence profile detection, homology modeling, model evaluation, etc., for each of which there are established computational methods.

Evaluating Your Needs

As you break down a problem into modular components, you should be evaluating what you have, in terms of available data and starting points for modeling, and what you need. Getting from point A to point B, and from point C to point D, won't help you if there's absolutely no way to get from point B to point C. For instance, if you can't find any homologous sequences for an unknown DNA sequence, it's unlikely you'll get beyond that point to do any further modeling. And even if you do find a group of sequences with a distinctive profile, you shouldn't base your research plans on developing a structural model if there are no homologous structures in the Protein Data Bank (PDB). It's just common sense, but be sure that there's a likely way to get to the result you want before putting time and effort into a project.

Selecting the Appropriate Data Set

In a laboratory setting, materials are the physical objects or substances you use to perform an experiment. It's necessary for you to record certain data about your materials: when they were made, who prepared them, possibly how they were prepared, etc.

The same sort of documentation is necessary in computational biology, but the difference is that you will be experimenting on data, not on a tangible object or substance. The source data you work with should be distinguished from the derived data that constitutes the results of your experiment. You will probably get your source data from one of the many biomolecular databases. In Chapter 13, you will learn more about how information is stored in databases and how to extract it. You need to record where your source data came from and what criteria or method you use to extract your source data set from the source database.

For example, if you are building a homology model of a protein, you need to account for how you selected the template structures on which you based your model. Did you find them using the unknown sequence to search the PDB? Did that approach provide sufficient template structures, or did you, perhaps, use sequence profile-based methods to help identify other structures that are more distantly related to your unknown? Each step you take should be documented.

Criteria for selecting a source data set in computational biology can be quite complex and nontrivial. For instance, statistical studies of sequence information or of structural data from proteins are often based on a nonredundant subset of available protein data. This means that data for individual proteins is excluded from the set if the proteins are too similar in sequence to other proteins that are being included. Inclusion of two structure datafiles that describe the same protein crystallized under slightly different conditions, for example, can bias the results of a computational study. Each step of such a selection process needs to be documented, either within your own records, or by reference to a published source.

It's important to remember that all digital sequence and structure data is derived data. By the time it reaches you, it has been through at least one or two processing steps, each of which can introduce errors. DNA sequences have been processed by basecalling software and assembled into maps, analyzed for errors, and possibly annotated according to structure and function, all by tools developed by other scientists as human and error-prone as yourself. Protein structure coordinates are really just a very good guess at where atoms fit into observed electron density data, and electron density maps in turn have been extrapolated from patterns of x-ray reflections. This isn't to say that you should not use or trust biological data, but you should remember that there is some amount of uncertainty associated with each unambiguous-looking character in a sequence or atomic coordinate in a structure. Crystallographers provide parameters, such as R-factors and b-values, which quantify the uncertainty of coordinates in macromolecular structures to some extent, but in the case of sequences, no such estimates are routinely provided within the datafile.

Identifying the Criteria for Success

Critical evaluation of results is key to establishing the usefulness of computer modeling in biology. In the context of presenting various tools you can use, we've discussed methods for evaluating your results, from using BLAST E-values to pick the significant matches out of a long list of results to evaluating the geometry of a protein structural model. Before you start computing molecular properties or developing a computational model, take inventory of what you know, and look for further information. Then try to see the ways in which that information can validate your results. This is part of breaking down your problem into steps.

Computational methods almost always produce results. It's not like spectroscopy, where if there's nothing significant in the cuvette, you don't get a signal. If you do a BLAST search, you almost always get some hits back. You need to know how to distinguish meaningful results from garbage so you don't end up comparing apples to oranges (or superoxide dismutases to alcohol dehydrogenases). If you calculate physicochemical properties of a protein molecule or develop a biochemical pathway simulation, you get a file full of numbers. The best possible way to evaluate your results is to have some experimental results to compare them to.

Before you apply a computational method, decide how to evaluate your results and what criteria they need to meet for you to consider the approach successful.

Performing and Documenting a Computational Experiment

When managing results for a computational project, you should make a distinction between primary results and results of subsequent analyses. You should include a separate section in your results directory for any analysis steps you may perform on the data (for instance, the results of statistical tests or curve fitting). This section should include any observations you may have made about the data or the data collection. Keep separate the results, which are the data you get from executing the experiment, and the analysis, which is the insight you bring to the data you have collected.

One tendency that is common to users of computational biology software is to keep data and notes about positive results while neglecting to document negative results. Even if you've done a web-based BLAST search against a database and found nothing, that is information. And if you've written or downloaded a program that is supposed to do something, but it doesn't work, that information is valuable too—to the next guy who comes in to continue your project and wastes time trying to figure out what works and what doesn't.

Documentation issues in computational biology

Many researchers, even those who do all their work on the computer, maintain paper laboratory notebooks, which are still the standard recording device for scientific research. Laboratory notebooks provide a tangible physical record of research activities, and maintenance of lab records in this format is still a condition of many research grants.

However, laboratory notebooks can be an inconvenient source of information for you or for others who are trying to duplicate or reconstruct your work. Lab notebooks are organized linearly, with entries sorted only by date. They aren't indexed (unless you have a lot more free time than most researchers do). They can't be searched for information about particular subjects, except by the unsatisfactory strategy of sitting down and skimming the whole book beginning to end.

Computer filesystems provide an intuitive basis for clear and precise organization of research records. Information about each part of a project can be stored logically, within the file hierarchy, instead of sequentially. Instead of (or in addition to) a paper notebook on your bookshelf, you will have electronic record embedded within your data. If your files are named systematically and simple conventions are used in structuring your electronic record, Unix tools such as the grep command will allow you to search your documentation for occurrences of a particular word or date and to find the needed information much more quickly than you would reading through a paper notebook.

Electronic notebooks

While you can get by with homegrown strategies for building an electronic record of your work, you may want to try one of the commercial products that are available. Or, if you're looking for a freeware implementation of the electronic notebook concept, you can obtain a copy of the software being developed by the DOE2000 Electronic Notebook project. The eNote package lets you input text, upload images and datafiles, and create sketches and annotations. It's a Perl CGI program and will run on any platform with a web server and a Perl interpreter installed. When installed, it's accessible from a web URL on your machine, and you can update your notebook through a web form. The DOE project is designed to fulfill federal agency requirements for laboratory notebooks, as scientific research continues to move into the computer age.

The eNote package is extremely simple to install. It requires that you have a working web server installed on your machine. If you do, you can download the eNote archive and unpack it in its own directory, for example /usr/local/enote. The three files enote.pl, enotelib.pl, and sketchpad.pl are the eNote programs. You need to move enote.pl to the /home/httpd/cgi-bin directory (or wherever your executable CGI directory is; this one is the default on Red Hat Linux systems) and rename it enote.cgi. If you want to restrict access to the notebook, create a special subdirectory just for the eNote programs, and remember that the directory will show up in the URL path to access the CGI script. The sketchpad.pl file should also be moved to this directory, but it doesn't have to be renamed. Move the directories gifs and new-gifs to a web-accessible location. You can create a directory such as /home/httpd/enote for this purpose. Leave the file enotelib.pl and the directory sketchpad where you unpacked them.

Finally, you need to edit the first line in both enote.cgi and sketchpad.pl to point to the location of the Perl executable on your machine. Edit the enote.cgi script to reflect the paths where you installed the eNote script and its support files. You also need to choose a directory in which you want eNote to write entries. For instance, you may want to create a /home/enote/notebook directory and store eNote write files there. If so, be sure that directory is readable and writable by other users so the web server (which is usually identified as user nobody) can write there.

The eNote script also contains parameters that specify whether users of the notebook system can add, delete, and modify entries. If you plan to use eNote seriously, these are important parameters to consider. Would you allow users to tear unwanted pages out of a laboratory notebook or write over them so the original entry was unreadable? eNote allows you to maintain control over what users can do with their data.

The eNote interface is a straightforward web form, which also links to a Java sketchpad applet. If you want only specific users with logins on your machine to be able to access the eNote CGI script, you can set up a .htaccess file in the eNote subdirectory of your CGI directory. A .htaccess file is a file readable by your web server that contains commands to restrict access to a particular directory and/or where it can be accessed from. For more information on creating a .htaccess file, consult the documentation for the web server you are using—most likely Apache on most Linux systems.

If you do begin to use an electronic notebook for storing your laboratory notes, remember that you must save backups of your notebook frequently in case of system failures.



[*] A string is simply an unbroken sequence of characters. A character is a single letter chosen from a set of defined letters, whether that be binary code (strings of zeros and ones) or the more complicated alphabetic and numerical alphabet that can be typed on a computer keyboard.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset