Any result in bioinformatics, whether it is a sequence alignment, a structure prediction, or an analysis of gene expression patterns, should answer a biological question. For this reason, it is up to the investigators to interpret their results in the context of a clear question, and to make those results accessible to their colleagues. This interpretation step is the most important part of the scientific process. For your results to be useful, they must be interpretable. We'll say it again: if your results can't be interpreted, they won't help anybody, not even you.
In this chapter, we present computational tools that help you to make sense of your results. To this end, the chapter is organized so that it roughly parallels the data-analysis process. In the first part of this chapter, we introduce a number of programs that are used to visualize the sort of data arising from bioinformatics research. These programs range from general-purpose plotting and statistical packages for numerical data to programs dedicated to presenting sequence and structural information in an interpretable form. The second part of this chapter covers some tools for data mining—the process of finding, interpreting, and evaluating patterns in large sets of data—in the context of some bioinformatics applications.
The topics covered in this chapter are basically subdisciplines of
the larger area of computational statistics. As you have seen in
previous chapters, statistical methods are important because they
provide a check on the significance of the researcher's
discoveries. The human nervous system is very good at finding
patterns; a little too good, in fact. If you scrutinize a protein
sequence for long enough, you will begin to see patterns, whether
they're biologically significant (like part of a family
signature sequence, such as P.YTVF
in
chloramphenicol acetyltransferase) or not (words or names, such as
PER
) amidst the amino acids.[*] Thus, we use visualization to exploit the abilities of
the eye and brain to find patterns that may be interesting. We use
statistics and data mining to keep our intuition in check and to
restrict our searches to those patterns that can be quantitatively
and repeatedly shown to be significant.
Preparing your data (also known as preprocessing or data cleansing ) is the most important part of data mining. It's also the least glamorous and one of the least-discussed parts. Preprocessing can be as simple as making sure your data is in the right format for the program that reads it, or it can involve extensive calculations.
As a bioinformatics researcher, you must be especially careful of your data. Your results and reputation are based on data that have been provided by other researchers. Consequently, you must be scrupulous in collecting and using that data. The following is a list of some general questions about data integrity to answer when you work on a project (this list isn't exhaustive; you will probably come up with other things to check that are specific to the project at hand):
Is your data what you expect it to be? For example, DNA sequences
should only contain As, Ts, Cs, and Gs (unless your program
understands the additional IUPAC symbols). Protein sequences should
contain only the 20 amino acids. You can use
grep
to quickly check if your file contains
lines with bad characters.
Are your datafiles correctly formatted for the programs you plan to use? Be wary of more flexible formats. For example, some programs apply a length limit to the comment line in FASTA files, while other programs don't.
Be aware of sequence variants. Splice variants, mutations, deletions, sequencing errors, and inadvertent truncations of the sequence file all can result in a different sequence than you'd expect. It is up to you to track which differences in sequences or structures are biologically relevant and which are artifacts of the experimental process.
Unless the sequences you are working with have been given to you by a collaborator who has not yet deposited them in a sequence database, make sure that you can find each of your sequences in GenBank or another database.
When working with large tabular data, make sure that every field in the table has an appropriate value. Using a program such as XGobi is a good way to check this, since it complains if not every field has a value. A visualization tool such as XGobi is also useful if the values are out of register, since the resulting points will be outliers.
Does the program produce meaningful results on test data? When you use a new program, you should have some data for which you know the results, so you can test the program and make sure it gives the right answer and behaves in a reasonable fashion (these tests are sometimes called sanity checks ). For example, if a program compares two sequences or structures, does it give the same result regardless of which order you pass the data to it?
Check for side effects produced by the programs you use. Does a program change any of its input? Changes can be as subtle as adding spaces between every 10 residues in a sequence file, normalizing numerical values, or changing the coordinate values of structural data.
For microarray data, have the values been normalized? If the programs you are using perform any kind of normalization, it is important that you determine how the normalization was achieved.
For protein structure data, are all the atom numbers and residue numbers sequential? Is the structure intact, or does it contain chain breaks or other anomalies? Are all residues labeled with the standard amino acid three-letter codes, and are all atoms labeled with the appropriate PDB atom codes?
Finally, make sure you understand all the programs being used, at least as far as knowing the format and meaning of their input and output.
If you are going to be working with images, you need some basic tools for viewing graphics files under Linux and Unix. There are many graphics programs available; the three that we describe next are commonly available, easy to use, and free.
xzgv
is a program for viewing graphics files
under the X Window System. It can display the more
popular graphics formats (GIF, PNG, JPEG), as well as a variety of
others. For a simple graphics viewer, it has some handy features. For
example, it creates thumbnails (small versions of a picture that
preview the file) and can step through images one at a time, as a
slideshow.
Ghostview and
gv
are viewers for PostScript and Portable
Document Format (PDF) files. PostScript and PDF are page-description
languages developed at Adobe Systems, Inc.[†] Both programs allow you to page through a
document, jump to specific pages, print whole documents or selected
pages, and perform other simple document-navigation tasks. More and
more journals are distributing their articles electronically as PDF
files, so a document viewer such as gv
is very
useful for keeping up with literature.
Because it produces more compact files and cooperates with Windows applications, PDF seems to be overtaking PostScript as the more common format. Many documents are still distributed over the Net in PostScript, including preprints (particularly those written by people who use the LaTEX document formatting language) and the output of some web-based services (such as the BMERC secondary structure prediction server).
The GIMP (short for "GNU Image Manipulation Program") is an image processing package with similar functionality and features to Adobe Photoshop. While the GIMP can open and view graphics files, it is probably overkill to do so. However, when it comes to reformatting or modifying graphics to use in a presentation or paper, having image-manipulation software on hand is invaluable.
Tools for viewing sequence data, particularly multiple sequence alignments, were discussed in Chapter 8. As we mentioned in that chapter, one of the best ways to rapidly summarize information from a sequence alignment is to use a sequence logo. In this section, we discuss a geometric approach to visualizing sequence relationships and introduce TEXshade, a program for creating publication-quality sequence alignment figures.
TEXshade
(http://homepages.uni-tuebingen.de/beitz/tse.html
) is a package for marking up sequence alignments
written using LaTEX, a
document markup language invented for mathematicians and computer
scientists. This package is remarkably flexible, allowing you to
color and label aligned residues according to conservation and
chemical characteristics. In addition,
TEXshade can incorporate secondary structure
and accessibility information output from the DSSP program (described
in Chapter 6), as well as predictions of secondary
structure generated by the PHD prediction server. Finally,
TEXshade can automatically create
"fingerprints" that provide a bird's-eye view of an
alignment, in which columns of conserved residues are represented by
vertical lines. Like sequence logos, fingerprints can rapidly
summarize alignment data and find clusters of neighboring conserved
residues.
TEXshade is called from within a
LaTEX document. If you have a sequence
alignment stored in MSF format (the GCG multiple sequence alignment
format) in the file alignment.msf
, the following
LaTEX document produces an alignment formatted
according to TEXshade's defaults:
documentclass{report} usepackage{texshade} egin{document} egin{texshade}{alignment.msf} end{texshade} end{document}
LaTEX is a document markup language similar to
HTML. In the preceding code example, the output of which is shown in
Figure 14-1, you are telling the
TEX interpreter that this document has a
beginning and an end, and that it contains only a
TEXshade alignment of the sequences in
alignment.msf
. You need to mark up the resulting
alignment by hand. If this sounds painful, the SDSC Biology Workbench
provides a graphical interface to TEXshade and
can automatically render TEXshade-formatted
sequence alignments as GIF or PostScript images.
Multiple sequence alignments and sequence logos represent similarities at every position of a group of aligned sequences. However, even with coloring of conserved residues, it isn't always easy to tell how the sequences are related. Sometimes, it's useful to look at an even higher level of abstraction to see how the sequences cluster. Phylogenetic trees represent one way to visualize relatedness.
DGEOM, a set of programs by Mark J. Forster and coworkers, takes a set of aligned sequences (either a multiple sequence alignment in GCG's MSF format, or a set of pairwise alignments) and represents them as points in a 3D space, where the distances between the points represent the evolutionary distances between the sequences. The points are written to a PDB file and can be viewed with your choice of protein structure viewers. Some may flinch at the idea of storing a representation of a sequence alignment in a file format intended to store structural data, but the program works well, and since high-quality structure visualization packages are easy to find, this approach avoids the need for a standalone graphics system. The programs are written in Perl and C, making them fairly easy to modify.
Another implementation of the geometric approach to viewing sequence relationships is the SeqSpace package developed by Chris Dodge and Chris Sander at the EBI. This package includes C++ programs for computing the sequence distances, and it uses Java viewers to render the points in 3D.
As
of
this writing, there is no standard package for visualizing
interactions between molecules in a pathway or network. The most
common way to represent molecular interactions schematically is in
the form of a graph.[‡]
Graphs can also be useful for illustrating other data that represents
interactions, including the output of cluster analyses and
interacting residues in protein structures. Biological networks, such
as metabolisms and signaling pathways, tend to be very densely
connected, and creating readable, highly connected graphs isn't
an easy task. Fortunately, AT&T Research distributes GraphViz
(http://www.research.att.com/sw/tools/graphviz/
), a freely available suite of programs that allow you
to draw and edit graphs. This package has three features that make it
particularly attractive: it runs quickly, it has excellent
documentation, and it uses a flexible and intuitive syntax to
describe graphs.
The GraphViz package consists of five programs:
Draws directed graphs
Draws undirected graphs
A graphical interface for editing graphs
A language for editing graphs and other diagrams; used to write
dotty
A graphical interface to dot
written in the Tcl
language
To draw a directed graph of a small set of interactions, you can type
the following code into a text editor and save it to a file named
morphopath.dot
:
digraph G { size="3, 3"; SHH -> "Early gene expression"; "FGF-4" -> "Early gene expression"; SHH -> "BMP-2"; "BMP-2" -> "FGF-4"; "FGF-4" -> SHH; "BMP-2" -> "Late gene expression"; "FGF-4" -> "Late gene expression"; }
The dot
program is invoked using the following
command at the shell prompt:
% dot -Tps morphopath.dot -o morphopath.ps end
This command tells the dot
program in the
GraphViz suite to produce a PostScript image of the graph described
by morphopath.dot
and to put the image in the
file morphopath.ps
(see Figure 14-2).
If you have some experience with the C programming language, you might recognize the similarity of the graph description format to a struck in C. This structured approach to describing the graph's layout makes it possible to define graphs within graphs, and also makes it straightforward to generate GraphViz files from Perl scripts. In addition to specifying connectivity, graphs produced by the GraphViz programs can be annotated with labels, different arrow types, and highlighted boxes.
Numerical data can always be fed into a spreadsheet program such as Microsoft Excel or StarOffice Calc and plotted using the program's built-in graphing functions. Often, this is the best way to make plots quickly. However, you will encounter situations in which you need more control over the plotting process than the spreadsheet provides. Two common examples of this kind of situation are in formatting your plots for publication and in dealing with high-dimensional data sets. If you do have to create figures from your data, we encourage you to take a look at Edward Tufte's books on visual explanations (see the Bibliography). They are full of examples and tips on making clean figures that clearly say what you mean.
In this section, we describe some programs that can create plots. In addition, we introduce two special-purpose programming languages that include good facilities for visualization as well as data analysis: the statistics language R (and its commercial cousin, S-plus) and the numerical programming language Matlab (and its free counterpart, Octave).
gnuplot
(http://www.gnuplot.org
) is
one of the more
widely used programs for producing plots of scientific data. Because
of its flexibility, longevity, and open source nature,
gnuplot
is loaded with features, including
scripting and facilities for including plots in documents. The dark
side of this flexibility and longevity, though, is a fairly
intimidating command syntax. Fortunately, a graphical interface to
gnuplot
called xg fe
exists. xg fe
is good for quickly plotting
either data or a function, as shown in Figure 14-3.
You can find out more about xg fe
at http://home.flash.net/~dmishee/xgfe/xgfe.html
.
If you need to exert more control over the format of the output,
though, it behooves you to read through the
gnuplot
documentation and see what it can do.
Additionally, if you need to aumotically generate many plots from
data, you may want to figure out how to control gnuplot
's behavior using Perl or another scripting
language.
Grace (http://plasma-gate.weizmann.ac.il/Grace/
)and its
predecessor,
xmgr
, are alternatives to
gnuplot
as a fairly powerful tool for plotting
2D data. Grace uses a simple graphical interface under the
X Window System, which
allows a fair amount of menu-driven customization of plots. Like
xg fe
, Grace provides the fairly simple 20%
functionality you need 80% of the time. In addition to its current
main distribution site at the Weizmann Institute of Science in Israel
(which always has the latest version), there are a number of mirror
sites from which Grace can be acquired. The home site also has a
useful FAQ and tutorial introduction to working with Grace.
Plotting programs such as Grace and gnuplot
work
well if your data has two or three variables that can be assigned to
the plot axes. Unfortunately, most interesting data in biology has a
much higher dimensionality. The science of investigating
high-dimensional data is known as multivariate
or multidimensional analysis. One significant
problem in dealing with multidimensional data is visualization. For
those who can't envision an 18-dimensional space, there is
XGobi (http://www.research.att.com/areas/stat/xgobi/
).
XGobi and XGvis are a set of programs
freely available from AT&T Labs. XGobi allows you to view data
with many variables three dimensions at a time as a constellation of
points you can rotate using a mouse. XGvis performs multidimensional
scaling, the intelligent squashing of high-dimensional data into a
space you can visualize (usually a 2D plot or a rotatable 3D plot).
Figure 14-4 shows output from XGobi.
XGobi has a huge number of features; here is a brief explanation to
get you started. XGobi takes as input a text file containing columns
of data. If you have a datafile named
xgdemo.dat
, it can be viewed in XGobi by typing
the following command at the shell prompt:
% xgobi xgdemo.dat &
XGobi initially presents the points in a 2D scatterplot. Selecting Rotation from the View menu at the top of the window shows a moving 3D plot of the points that you can control with the mouse by clicking within the data points and moving the mouse. Selecting Grand Tour or Correlation Tour from the View menu rotates the points in an automated tour of the data space.
The variable widgets (the circles along the right side of the XGobi interface) represent each of the variables in the data. The line in each widget represents the orientation of that variable's axis in the plot. If the data contains more than three variables, you can select the variables to be represented by clicking first within the widget of the variable you want to dismiss, and then within the widget of the variable to be displayed. Finally, clicking on the name of the corresponding variable displays a menu of transformations for that axis (e.g., natural logarithms, common logs, squares, and square roots). Like the GraphViz graph drawing programs, XGobi and XGvis are superbly documented and easy to install on Linux systems if you follow the instructions on the XGobi home page. Some Linux distributions (such as SuSE) even include XGobi.
In this section, we introduce two new programming languages that are well adapted for data analysis. The proposition of learning more languages after just learning Perl may seem a little perverse. Who would want to learn a whole language just to do data analysis? If your statistics requirements can be satisfied with a spreadsheet and calculator, these packages may not be for you. Also, as we saw in the last chapter, there are facilities for creating numerically sophisticated applications using Perl, particularly the PDL.
However, many problems in bioinformatics require the use of involved numerical or statistical calculations. The time required to develop and debug such software is considerable, and it may not be worth your time to work on such code if it's used only once or twice. Fortunately, in the same way that Perl makes developing data-handling programs easy, data analysis languages (for lack of a better term) ease the prototyping and rapid development of data analysis programs. In the next sections, we introduce R (and its commercial cousin, S-plus), a language for doing statistics; and Matlab (and its free cousin, Octave), a language for doing numerical mathematics.
R is a free implementation of the S statistics programming language developed at AT&T Bell Laboratories. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland. Both R and its commercial cousins (S-plus 3.x, 4.x, and 2000) are available for the Unix and Win32 platforms, and both have a syntax that has been described as "not dissimilar," so we use R to refer to both languages.
R is usually run within an interpreted environment. Instead of writing whole programs that are executed from the command line, R provides its own interactive environment in which statements can be run one at a time or as whole programs. To start the R environment, type in R at the command prompt and hit the Enter key. You should see something like this:
R : Copyright 2000, The R Development Core Team Version 1.1.1 (August 15, 2000) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type "?license" or "?licence" for distribution details. R is a collaborative project with many contributors. Type "?contributors" for a list. Type "demo( )" for some demos, "help( )" for on-line help, or "help.start( )" for a HTML browser interface to help. Type "q( )" to quit R.
The angle bracket (>) at the bottom of this screen is the R prompt, similar to the shell prompt in Unix. In the following examples, we write the R prompt before the things that the user (that's you) is supposed to type.
Before anything else, you should know two commands. Arguably, the most important command in any interactive environment is the one that lets you exit back out to the operating system. In R, the command to quit is:
> q( )
The second most useful command is the one that provides access to
R's online help system, help( )
. The
default help( )
command, with no arguments,
returns an explanation of how to use help( )
. If
you want help on a specific R function, put the name of the function
in the parentheses following help. So, for
example, if you want to learn how the source(
)
function works, you can type:
> help(source)
You can also use ? as shorthand for help( )
. So,
instead of typing help (source) in the example,
you can just enter ?source.
As mentioned earlier, the R environment is interactive. If you type the following:
> 2 + 2
R tells you that two plus two does, in fact, equal four:
> 2 + 2 [1] 4
The R response (4 ) is preceded by a bracketed number ([1] ), which indicates the position of the answer in the output vector. Unlike Perl, R has no scalar variables. Instead, single numbers like the previous answer are stored in a vector of length one. Also, note that the first element in the vector is numbered 1 instead of 0.[§]
The <- operator is used as the assignment operator. Its function is similar to the = operator in Perl. Typing:
> a <- 2 + 2
produces no output, since the result of the calculation is stored in the variable on the left side of the assignment operator. In order to see what value a variable contains, enter its name at the R prompt:
> a <- 2 + 2 > a [1] 4
Just as Perl has basic datatypes that are useful for working with text, R has datatypes that are useful for doing statistics. We have already seen the vector; R also has matrices, arrays (which are a multidimensional generalization of matrices), and factors (lists of strings that label vectors of the same length).
The place to go for more information
on R is the Comprehensive R Archive Network (CRAN). You can find the
CRAN site nearest to you either by using your favorite search engine
or off a link from the R Project home page (http://www.R-project.org
). CRAN has a number
of packages for specific statistical applications implemented in R
and available as RPM files (for information on installing RPMs, see
Chapter 3). If your project requires sampling,
clustering, regression, or factor analysis, R can be a lifesaver. R
can even be made to directly access XGobi as an output system, so
that the results of your computations can be plotted in two or more
dimensions.
You can try R without having to install it, thanks to Rweb
(http://www.math.montana.edu/Rweb/
), a service provided by the Montana State University
Mathematics Department. Rweb accepts your R code, runs it, and
returns a page with the results of the calculation. If you want to
use R for anything beyond simple demonstrations, though, it's
faster to download the RPM files and run R on a local computer.
If you find that R is useful in your work, we vigorously recommend
you supplement the R tutorial, An Introduction to
R, and the R FAQ (http://cran.r-project.org/
) with the third
edition of Modern Applied Statistics with S-Plus
(see the Bibliography). Both Venables and Ripley
are now part of the R development core team, and although their text
is primarily an S-plus book, supplements are available from
Ripley's web site (http://www.stats.ox.ac.uk/~ripley/
)that
make the examples in book more easily used under
R.
GNU Octave
(http://www.gnu.org/software/octave/octave.html
) is a freely available
programming language whose syntax and functions are similar to
Matlab, a commercial
programming environment from The MathWorks, Inc. (http://www.mathworks.com/products/matlab/
).
Matlab is popular among engineers for quickly writing programs that
perform large numbers of numerical computations. Octave (or Matlab)
is worth a look if you want to write quick prototypes of
number-crunching programs, particularly for data analysis or
simulation. Both Octave and Matlab are available for Unix and Windows
systems. Octave source code, binaries, and documentation are all
available online; they are also distributed as part of an increasing
number of Linux distributions.
Octave produces graphical output using the
gnuplot
package mentioned previously. While this
arrangement works well enough, it is rather spartan compared to the
plotting capabilities of Matlab. In fact, if you are a student, we
will take off our open source hats and strongly encourage you to take
advantage of the academic pricing on Matlab; it will add years to
your life. As a further incentive, a number of the data mining and
machine learning tools discussed in the next section are available as
Matlab packages.
This section has described solutions to data presentation problems
that arise frequently in bioinformatics. For some of the most current
work on visualization in bioinformatics, see the European
Bioinformatics Institute's visualization information off the
Projects link on their industrial relations page (http://industry.ebi.ac.uk
). Links to more
online visualization and data mining resources are available off the
web page for this book. Table 14-1 shows the tools
and techniques that are used for data visualization.
Table 14-1. Data Visualization Tools and Techniques
One of the most exciting areas of modern biology is the application of data mining methods to biological databases. Many of these methods can equally well fall into the category of machine learning, the name used in the artificial intelligence community for the larger family of programs that adapt their behavior with experience. We present here a summary of some techniques that have appeared in recent work in bioinformatics. The list isn't comprehensive but will hopefully provide a starting point for learning about this growing area.
A word of caution: anthropomorphisms have a tendency to creep into discussions of data mining and machine learning, but there is nothing magical about them. Programs are said to "learn" or be "trained," but they are always just following well-defined sets of instructions. As with any of the tools we've described in this book, data mining tools are supplements, rather than substitutes, for human knowledge and intuition. No program is smart enough to take a pile of raw data and generate interesting results, much less a publication-quality article ready for submission to the journal of your choice. As we've stressed before, the creation of a meaningful question, the experimental design, and the meaningful interpretation of results are your responsibility and yours alone.
The topics addressed by data mining are ones that statisticians and applied mathematicians have worked on for decades. Consequently, the division between statistics and data mining is blurry at best. If you do work with data mining or machine learning techniques, you will want to have more than a passing familiarity with traditional statistical techniques. If your problem can be solved by the latest data-mining algorithm or a straightforward statistical calculation, you would do well to choose the simple calculation. By the same token, please avoid the temptation to devise your own scoring method without first consulting a statistics book to see if an appropriate measure already exists. In both cases, it will be easier to debug and easier to explain your choice of a standard method over a nonstandard one to your colleagues.
Machine learning methods can be broadly divided into supervised and unsupervised learning. Learning is said to be supervised when a learning algorithm is given a set of labeled examples from which to learn (the training set) and is then tested on a set of unlabeled examples (the test set). Unsupervised learning is performed when data is available, but the correct labels for each example aren't known. The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data. For example, the MEME program introduced in Chapter 8, performs unsupervised learning in order to find sequence motifs in a set of unaligned sequences. It isn't known ahead of time whether each sequence contains the pattern, where the pattern is, or what the pattern looks like.
Cluster analysis is another kind of unsupervised learning that has received some attention in the analysis of microarray data. Clustering, as shown in Figure 14-5, is the procedure of classifying data such that similar items end up in the same class while dissimilar items don't, when the actual classes aren't known ahead of time. It is a standard technique for working with multidimensional data. Figure 14-5 shows two panels with unadorned dots on the left and dots surrounded by cluster boundaries on the right.
In this section, we describe some data mining methods commonly reported in the bioinformatics literature. The purpose of this section is to provide an executive summary of the complex tricks for data analysis. You aren't expected to be able to implement these algorithms in your programming language of choice. However, if you see any of these methods used to analyze data in a paper, you should be able to recognize the method and, if necessary, evaluate the way in which it was applied. Like any technique in experimental biology, it is important to have an understanding of the machine learning methods used in computational biology to know whether or not they have been used appropriately and correctly.
In its simplest form, a decision tree is a list of questions with yes or no answers, hierarchically arranged, that lead to a decision. For instance, to determine whether a stretch of DNA is a gene, we might have a tree like the one shown in Figure 14-6.
A tree like this one is easy to work through, since it has a finite number of possibilities at each branch, and any path through the tree leads to a decision. The structure of the tree and the rules at each of the branches are determined from the data by a learning algorithm. Techniques for learning decision trees were described by Leo Breiman and coworkers in the early 1980s, and were later popularized in the machine learning community by J. R. Quinlan, whose freely available C4.5 decision tree software and its commercial successor, C5, are standards in the field.
One major advantage of decision trees over other machine learning techniques is that they produce models that can be interpreted by humans. This is an important feature, because a human expert can look at a set of rules learned by a decision tree and determine whether the learned model is plausible given real-world constraints.[‖] In biology, tree classifiers tend to be used in pattern recognition problems, such as finding gene splice sites or identifying new occurrences of a protein family member. The MORGAN genefinder developed by Steven Salzberg and coworkers is an example of a decision tree approach to genefinding.
Neural networks are statistical models used in pattern recognition and classification. Originally developed in the 1940s as a mathematical model of memory, neural networks are sometimes also called connectionist models because of their representation as nodes (which are usually variables) connected by weighted functions. Figure 14-7 shows the process by which a neural network is constructed. Please note, though, that there is nothing particularly "neural" about these models, nor are there actually physical nodes and connections involved. The idea behind neural networks is that, by working in concert, these simple processing elements can perform more complex computations.
A neural network is composed of a set of nodes that are connected in
a defined topology, where each node has input and output connections
to other nodes. In general, a neural network will receive an input
pattern (for example, an amino acid sequence whose secondary
structure is to be predicted), which sets the values of the nodes on
the first layer (the input layer). These values are propagated
according to transfer functions (the connections) to the next layer
of nodes, which propagate their values to the next layer, until the
output layer is reached. The pattern of activation of the output
layer is the output of the network. Neural networks are used
extensively in bioinformatics problems; examples include the PHD
(http://www.embl-heidelberg.de/predictprotein/predictprotein.html
) and PSIPRED (http://insulin.brunel.ac.uk/psipred/
)
secondary structure predictors described in Chapter 9, and the GRAIL genefinder (http://compbio.ornl.gov/grailexp/
) mentioned
in Chapter 7.
Genetic algorithms are optimization algorithms. They search a large number of possible solutions for the best one, where "best" is determined by a cost function or fitness function. Like neural networks, these models were inspired by biological ideas (in this case, population genetics), but there is nothing inherently biological about them. In a genetic algorithm, a number of candidate solutions are generated at random. These candidate solutions are encoded as chromosomes. Parts of each chromosome are then exchanged à la homologous recombination between real chromosomes. The resulting recombined strategies are then evaluated according to the fitness function, and the highest scoring chromosomes are propagated to the next generation. This recombination and propagation loop continues until a suitably good solution is found. Genetic algorithms are frequently used in molecular simulations, such as docking and folding of proteins.
In late 1998, a machine learning tool called the support vector machine (SVM) began to attract a great deal of attention in bioinformatics. Support vector machines were developed by Vladimir Vapnik of Bell Laboratories, an early pioneer of machine learning methods. They have been applied to a wide range of problems, from optical character recognition to financial time series analysis and recognizing spam (the email variety, not the lunch meat). SVMs were first applied to biological problems by Tommi Jaakola (now at MIT), David Haussler, and coworkers at UC Santa Cruz, who used them for protein sequence classification. They have since been applied to many of the standard computational biology challenge problems (function prediction, structure prediction, and genefinding) but have gained the most attention for their use in microarray analysis.
Support vector machines are supervised classifiers that try to find a linear separation between different classes of points in a high-dimensional space. In a 2D space, this separator is a line; in 3D, it's a plane. In general, this separating surface is called a hyperplane . Support vector machines have two special features. First, instead of just finding any separating hyperplane, they are guaranteed to find the optimal one, or the one whose placement yields the largest separation between the two classes. The data points nearest the frontier between the two classes are called the support vectors.[#] Second, although SVMs are linear classifiers, they can classify nonlinearly separable sets of points by transforming the original data points into a higher dimensional space in which they can be separated by a linear surface.
Table 14-2 shows some of the most popular data-mining tools and techniques.
Table 14-2. Data Mining Tools and Techniques
What you do |
Why you do it |
What you use to do it |
---|---|---|
Clustering |
To find similar items when a classification scheme isn't known ahead of time |
Clustering algorithms, self-organizing maps |
Classification |
To label each piece of data according to a classification scheme | |
Regression |
To extrapolate a trend from a few examples |
Regression algorithms, neural networks, SVMs, decision trees |
Combining estimators |
Voting methods, mixture methods |
[*] When you start to see sequence motifs in words or people's names, it's time to take a break.
[†] Adobe makes a PDF reader as well, named Acrobat Reader, which is available at no cost for Windows, Mac OS, Linux, and a handful of Unix systems.
[‡] Here we are talking about the graph theory kind of graph, in which a set of dots (nodes) are connected by a set of lines (edges). Directed graphs are those in which the edge connecting two nodes has a direction. Edges in directed graphs are usually represented using arrows pointing in the direction of a connection. In an undirected graph, the edges have no direction.
[§] Actually, R vectors do have a zero element, but it doesn't accept values assigned to it, and it returns numeric(0), which is an empty vector.
[‖] The canonical decision-tree urban legend comes from an application of trees by a long-distance telephone company that wanted to learn about churn, the process of losing customers to other long-distance companies. They discovered that an abnormally large number of their customers over the age of 70 were subject to churn. A human recognized something the program did not: humans can die of old age. So, being able to interpret your results can be useful.
[#] Vectors, in this case, refer to the coordinates of the data points. For example, on a 2D map, you might have pairs of (x,y) coordinates representing the location of the data points. These ordered pairs are the vectors.