Chapter 14. Visualization and Data Mining

Any result in bioinformatics, whether it is a sequence alignment, a structure prediction, or an analysis of gene expression patterns, should answer a biological question. For this reason, it is up to the investigators to interpret their results in the context of a clear question, and to make those results accessible to their colleagues. This interpretation step is the most important part of the scientific process. For your results to be useful, they must be interpretable. We'll say it again: if your results can't be interpreted, they won't help anybody, not even you.

In this chapter, we present computational tools that help you to make sense of your results. To this end, the chapter is organized so that it roughly parallels the data-analysis process. In the first part of this chapter, we introduce a number of programs that are used to visualize the sort of data arising from bioinformatics research. These programs range from general-purpose plotting and statistical packages for numerical data to programs dedicated to presenting sequence and structural information in an interpretable form. The second part of this chapter covers some tools for data mining—the process of finding, interpreting, and evaluating patterns in large sets of data—in the context of some bioinformatics applications.

The topics covered in this chapter are basically subdisciplines of the larger area of computational statistics. As you have seen in previous chapters, statistical methods are important because they provide a check on the significance of the researcher's discoveries. The human nervous system is very good at finding patterns; a little too good, in fact. If you scrutinize a protein sequence for long enough, you will begin to see patterns, whether they're biologically significant (like part of a family signature sequence, such as P.YTVF in chloramphenicol acetyltransferase) or not (words or names, such as PER ) amidst the amino acids.[*] Thus, we use visualization to exploit the abilities of the eye and brain to find patterns that may be interesting. We use statistics and data mining to keep our intuition in check and to restrict our searches to those patterns that can be quantitatively and repeatedly shown to be significant.

Preparing Your Data

Preparing your data (also known as preprocessing or data cleansing ) is the most important part of data mining. It's also the least glamorous and one of the least-discussed parts. Preprocessing can be as simple as making sure your data is in the right format for the program that reads it, or it can involve extensive calculations.

As a bioinformatics researcher, you must be especially careful of your data. Your results and reputation are based on data that have been provided by other researchers. Consequently, you must be scrupulous in collecting and using that data. The following is a list of some general questions about data integrity to answer when you work on a project (this list isn't exhaustive; you will probably come up with other things to check that are specific to the project at hand):

  • Is your data what you expect it to be? For example, DNA sequences should only contain As, Ts, Cs, and Gs (unless your program understands the additional IUPAC symbols). Protein sequences should contain only the 20 amino acids. You can use grep to quickly check if your file contains lines with bad characters.

  • Are your datafiles correctly formatted for the programs you plan to use? Be wary of more flexible formats. For example, some programs apply a length limit to the comment line in FASTA files, while other programs don't.

  • Be aware of sequence variants. Splice variants, mutations, deletions, sequencing errors, and inadvertent truncations of the sequence file all can result in a different sequence than you'd expect. It is up to you to track which differences in sequences or structures are biologically relevant and which are artifacts of the experimental process.

  • Unless the sequences you are working with have been given to you by a collaborator who has not yet deposited them in a sequence database, make sure that you can find each of your sequences in GenBank or another database.

  • When working with large tabular data, make sure that every field in the table has an appropriate value. Using a program such as XGobi is a good way to check this, since it complains if not every field has a value. A visualization tool such as XGobi is also useful if the values are out of register, since the resulting points will be outliers.

  • Does the program produce meaningful results on test data? When you use a new program, you should have some data for which you know the results, so you can test the program and make sure it gives the right answer and behaves in a reasonable fashion (these tests are sometimes called sanity checks ). For example, if a program compares two sequences or structures, does it give the same result regardless of which order you pass the data to it?

  • Check for side effects produced by the programs you use. Does a program change any of its input? Changes can be as subtle as adding spaces between every 10 residues in a sequence file, normalizing numerical values, or changing the coordinate values of structural data.

  • For microarray data, have the values been normalized? If the programs you are using perform any kind of normalization, it is important that you determine how the normalization was achieved.

  • For protein structure data, are all the atom numbers and residue numbers sequential? Is the structure intact, or does it contain chain breaks or other anomalies? Are all residues labeled with the standard amino acid three-letter codes, and are all atoms labeled with the appropriate PDB atom codes?

Finally, make sure you understand all the programs being used, at least as far as knowing the format and meaning of their input and output.

Viewing Graphics

If you are going to be working with images, you need some basic tools for viewing graphics files under Linux and Unix. There are many graphics programs available; the three that we describe next are commonly available, easy to use, and free.

xzgv

xzgv is a program for viewing graphics files under the X Window System. It can display the more popular graphics formats (GIF, PNG, JPEG), as well as a variety of others. For a simple graphics viewer, it has some handy features. For example, it creates thumbnails (small versions of a picture that preview the file) and can step through images one at a time, as a slideshow.

Ghostview and gv

Ghostview and gv are viewers for PostScript and Portable Document Format (PDF) files. PostScript and PDF are page-description languages developed at Adobe Systems, Inc.[†] Both programs allow you to page through a document, jump to specific pages, print whole documents or selected pages, and perform other simple document-navigation tasks. More and more journals are distributing their articles electronically as PDF files, so a document viewer such as gv is very useful for keeping up with literature.

Because it produces more compact files and cooperates with Windows applications, PDF seems to be overtaking PostScript as the more common format. Many documents are still distributed over the Net in PostScript, including preprints (particularly those written by people who use the LaTEX document formatting language) and the output of some web-based services (such as the BMERC secondary structure prediction server).

The GIMP

The GIMP (short for "GNU Image Manipulation Program") is an image processing package with similar functionality and features to Adobe Photoshop. While the GIMP can open and view graphics files, it is probably overkill to do so. However, when it comes to reformatting or modifying graphics to use in a presentation or paper, having image-manipulation software on hand is invaluable.

Sequence Data Visualization

Tools for viewing sequence data, particularly multiple sequence alignments, were discussed in Chapter 8. As we mentioned in that chapter, one of the best ways to rapidly summarize information from a sequence alignment is to use a sequence logo. In this section, we discuss a geometric approach to visualizing sequence relationships and introduce TEXshade, a program for creating publication-quality sequence alignment figures.

Making Publication-Quality Alignmentswith TEXshade

TEXshade (http://homepages.uni-tuebingen.de/beitz/tse.html) is a package for marking up sequence alignments written using LaTEX, a document markup language invented for mathematicians and computer scientists. This package is remarkably flexible, allowing you to color and label aligned residues according to conservation and chemical characteristics. In addition, TEXshade can incorporate secondary structure and accessibility information output from the DSSP program (described in Chapter 6), as well as predictions of secondary structure generated by the PHD prediction server. Finally, TEXshade can automatically create "fingerprints" that provide a bird's-eye view of an alignment, in which columns of conserved residues are represented by vertical lines. Like sequence logos, fingerprints can rapidly summarize alignment data and find clusters of neighboring conserved residues.

TEXshade is called from within a LaTEX document. If you have a sequence alignment stored in MSF format (the GCG multiple sequence alignment format) in the file alignment.msf, the following LaTEX document produces an alignment formatted according to TEXshade's defaults:

documentclass{report}
usepackage{texshade}
egin{document} 

egin{texshade}{alignment.msf}
end{texshade} 

end{document}

LaTEX is a document markup language similar to HTML. In the preceding code example, the output of which is shown in Figure 14-1, you are telling the TEX interpreter that this document has a beginning and an end, and that it contains only a TEXshade alignment of the sequences in alignment.msf. You need to mark up the resulting alignment by hand. If this sounds painful, the SDSC Biology Workbench provides a graphical interface to TEXshade and can automatically render TEXshade-formatted sequence alignments as GIF or PostScript images.

A TEXshade alignment and its corresponding fingerprint

Figure 14-1. A TEXshade alignment and its corresponding fingerprint

Viewing Sequence Distances Geometrically

Multiple sequence alignments and sequence logos represent similarities at every position of a group of aligned sequences. However, even with coloring of conserved residues, it isn't always easy to tell how the sequences are related. Sometimes, it's useful to look at an even higher level of abstraction to see how the sequences cluster. Phylogenetic trees represent one way to visualize relatedness.

DGEOM, a set of programs by Mark J. Forster and coworkers, takes a set of aligned sequences (either a multiple sequence alignment in GCG's MSF format, or a set of pairwise alignments) and represents them as points in a 3D space, where the distances between the points represent the evolutionary distances between the sequences. The points are written to a PDB file and can be viewed with your choice of protein structure viewers. Some may flinch at the idea of storing a representation of a sequence alignment in a file format intended to store structural data, but the program works well, and since high-quality structure visualization packages are easy to find, this approach avoids the need for a standalone graphics system. The programs are written in Perl and C, making them fairly easy to modify.

Another implementation of the geometric approach to viewing sequence relationships is the SeqSpace package developed by Chris Dodge and Chris Sander at the EBI. This package includes C++ programs for computing the sequence distances, and it uses Java viewers to render the points in 3D.

Networks and Pathway Visualization

As of this writing, there is no standard package for visualizing interactions between molecules in a pathway or network. The most common way to represent molecular interactions schematically is in the form of a graph.[‡] Graphs can also be useful for illustrating other data that represents interactions, including the output of cluster analyses and interacting residues in protein structures. Biological networks, such as metabolisms and signaling pathways, tend to be very densely connected, and creating readable, highly connected graphs isn't an easy task. Fortunately, AT&T Research distributes GraphViz (http://www.research.att.com/sw/tools/graphviz/), a freely available suite of programs that allow you to draw and edit graphs. This package has three features that make it particularly attractive: it runs quickly, it has excellent documentation, and it uses a flexible and intuitive syntax to describe graphs.

The GraphViz package consists of five programs:

dot

Draws directed graphs

neato

Draws undirected graphs

dotty

A graphical interface for editing graphs

lefty

A language for editing graphs and other diagrams; used to write dotty

tcldot

A graphical interface to dot written in the Tcl language

To draw a directed graph of a small set of interactions, you can type the following code into a text editor and save it to a file named morphopath.dot:

digraph G {
size="3, 3";
    SHH -> "Early gene expression";
    "FGF-4" -> "Early gene expression";
    SHH -> "BMP-2";
    "BMP-2" -> "FGF-4";
    "FGF-4" -> SHH;
    "BMP-2" -> "Late gene expression";
    "FGF-4" -> "Late gene expression"; }

The dot program is invoked using the following command at the shell prompt:

% dot -Tps morphopath.dot -o morphopath.ps end

This command tells the dot program in the GraphViz suite to produce a PostScript image of the graph described by morphopath.dot and to put the image in the file morphopath.ps (see Figure 14-2).

morphopath pathway output

Figure 14-2. morphopath pathway output

If you have some experience with the C programming language, you might recognize the similarity of the graph description format to a struck in C. This structured approach to describing the graph's layout makes it possible to define graphs within graphs, and also makes it straightforward to generate GraphViz files from Perl scripts. In addition to specifying connectivity, graphs produced by the GraphViz programs can be annotated with labels, different arrow types, and highlighted boxes.

Working with Numerical Data

Numerical data can always be fed into a spreadsheet program such as Microsoft Excel or StarOffice Calc and plotted using the program's built-in graphing functions. Often, this is the best way to make plots quickly. However, you will encounter situations in which you need more control over the plotting process than the spreadsheet provides. Two common examples of this kind of situation are in formatting your plots for publication and in dealing with high-dimensional data sets. If you do have to create figures from your data, we encourage you to take a look at Edward Tufte's books on visual explanations (see the Bibliography). They are full of examples and tips on making clean figures that clearly say what you mean.

In this section, we describe some programs that can create plots. In addition, we introduce two special-purpose programming languages that include good facilities for visualization as well as data analysis: the statistics language R (and its commercial cousin, S-plus) and the numerical programming language Matlab (and its free counterpart, Octave).

gnuplot and xgfe

gnuplot (http://www.gnuplot.org) is one of the more widely used programs for producing plots of scientific data. Because of its flexibility, longevity, and open source nature, gnuplot is loaded with features, including scripting and facilities for including plots in documents. The dark side of this flexibility and longevity, though, is a fairly intimidating command syntax. Fortunately, a graphical interface to gnuplot called xg fe exists. xg fe is good for quickly plotting either data or a function, as shown in Figure 14-3. You can find out more about xg fe at http://home.flash.net/~dmishee/xgfe/xgfe.html.

Output from xg fe/gnuplot

Figure 14-3. Output from xg fe/gnuplot

If you need to exert more control over the format of the output, though, it behooves you to read through the gnuplot documentation and see what it can do. Additionally, if you need to aumotically generate many plots from data, you may want to figure out how to control gnuplot 's behavior using Perl or another scripting language.

Grace: The Pocketknife of Data Visualization

Grace (http://plasma-gate.weizmann.ac.il/Grace/)and its predecessor, xmgr, are alternatives to gnuplot as a fairly powerful tool for plotting 2D data. Grace uses a simple graphical interface under the X Window System, which allows a fair amount of menu-driven customization of plots. Like xg fe, Grace provides the fairly simple 20% functionality you need 80% of the time. In addition to its current main distribution site at the Weizmann Institute of Science in Israel (which always has the latest version), there are a number of mirror sites from which Grace can be acquired. The home site also has a useful FAQ and tutorial introduction to working with Grace.

Multidimensional Analysis: XGobi and XGvis

Plotting programs such as Grace and gnuplot work well if your data has two or three variables that can be assigned to the plot axes. Unfortunately, most interesting data in biology has a much higher dimensionality. The science of investigating high-dimensional data is known as multivariate or multidimensional analysis. One significant problem in dealing with multidimensional data is visualization. For those who can't envision an 18-dimensional space, there is XGobi (http://www.research.att.com/areas/stat/xgobi/). XGobi and XGvis are a set of programs freely available from AT&T Labs. XGobi allows you to view data with many variables three dimensions at a time as a constellation of points you can rotate using a mouse. XGvis performs multidimensional scaling, the intelligent squashing of high-dimensional data into a space you can visualize (usually a 2D plot or a rotatable 3D plot). Figure 14-4 shows output from XGobi.

Screenshot from XGobi

Figure 14-4. Screenshot from XGobi

XGobi has a huge number of features; here is a brief explanation to get you started. XGobi takes as input a text file containing columns of data. If you have a datafile named xgdemo.dat, it can be viewed in XGobi by typing the following command at the shell prompt:

% xgobi xgdemo.dat &

XGobi initially presents the points in a 2D scatterplot. Selecting Rotation from the View menu at the top of the window shows a moving 3D plot of the points that you can control with the mouse by clicking within the data points and moving the mouse. Selecting Grand Tour or Correlation Tour from the View menu rotates the points in an automated tour of the data space.

The variable widgets (the circles along the right side of the XGobi interface) represent each of the variables in the data. The line in each widget represents the orientation of that variable's axis in the plot. If the data contains more than three variables, you can select the variables to be represented by clicking first within the widget of the variable you want to dismiss, and then within the widget of the variable to be displayed. Finally, clicking on the name of the corresponding variable displays a menu of transformations for that axis (e.g., natural logarithms, common logs, squares, and square roots). Like the GraphViz graph drawing programs, XGobi and XGvis are superbly documented and easy to install on Linux systems if you follow the instructions on the XGobi home page. Some Linux distributions (such as SuSE) even include XGobi.

Programming for Data Analysis

In this section, we introduce two new programming languages that are well adapted for data analysis. The proposition of learning more languages after just learning Perl may seem a little perverse. Who would want to learn a whole language just to do data analysis? If your statistics requirements can be satisfied with a spreadsheet and calculator, these packages may not be for you. Also, as we saw in the last chapter, there are facilities for creating numerically sophisticated applications using Perl, particularly the PDL.

However, many problems in bioinformatics require the use of involved numerical or statistical calculations. The time required to develop and debug such software is considerable, and it may not be worth your time to work on such code if it's used only once or twice. Fortunately, in the same way that Perl makes developing data-handling programs easy, data analysis languages (for lack of a better term) ease the prototyping and rapid development of data analysis programs. In the next sections, we introduce R (and its commercial cousin, S-plus), a language for doing statistics; and Matlab (and its free cousin, Octave), a language for doing numerical mathematics.

R and S-plus

R is a free implementation of the S statistics programming language developed at AT&T Bell Laboratories. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland. Both R and its commercial cousins (S-plus 3.x, 4.x, and 2000) are available for the Unix and Win32 platforms, and both have a syntax that has been described as "not dissimilar," so we use R to refer to both languages.

R is usually run within an interpreted environment. Instead of writing whole programs that are executed from the command line, R provides its own interactive environment in which statements can be run one at a time or as whole programs. To start the R environment, type in R at the command prompt and hit the Enter key. You should see something like this:

R : Copyright 2000, The R Development Core Team 
Version 1.1.1  (August 15, 2000) 

R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type    "?license" or "?licence" for distribution details. 

R is a collaborative project with many contributors. 
Type    "?contributors" for a list. 

Type    "demo(  )" for some demos, "help(  )" for on-line help, or 
        "help.start(  )" for a HTML browser interface to help. 
Type    "q(  )" to quit R.

The angle bracket (>) at the bottom of this screen is the R prompt, similar to the shell prompt in Unix. In the following examples, we write the R prompt before the things that the user (that's you) is supposed to type.

Before anything else, you should know two commands. Arguably, the most important command in any interactive environment is the one that lets you exit back out to the operating system. In R, the command to quit is:

> q(  )

The second most useful command is the one that provides access to R's online help system, help( ). The default help( ) command, with no arguments, returns an explanation of how to use help( ). If you want help on a specific R function, put the name of the function in the parentheses following help. So, for example, if you want to learn how the source( )function works, you can type:

> help(source)

You can also use ? as shorthand for help( ). So, instead of typing help (source) in the example, you can just enter ?source.

As mentioned earlier, the R environment is interactive. If you type the following:

> 2 + 2

R tells you that two plus two does, in fact, equal four:

> 2 + 2 
[1] 4

The R response (4 ) is preceded by a bracketed number ([1] ), which indicates the position of the answer in the output vector. Unlike Perl, R has no scalar variables. Instead, single numbers like the previous answer are stored in a vector of length one. Also, note that the first element in the vector is numbered 1 instead of 0.[§]

The <- operator is used as the assignment operator. Its function is similar to the = operator in Perl. Typing:

> a <- 2 + 2

produces no output, since the result of the calculation is stored in the variable on the left side of the assignment operator. In order to see what value a variable contains, enter its name at the R prompt:

> a <- 2 + 2 
> a 
[1] 4

Just as Perl has basic datatypes that are useful for working with text, R has datatypes that are useful for doing statistics. We have already seen the vector; R also has matrices, arrays (which are a multidimensional generalization of matrices), and factors (lists of strings that label vectors of the same length).

Online resources for R

The place to go for more information on R is the Comprehensive R Archive Network (CRAN). You can find the CRAN site nearest to you either by using your favorite search engine or off a link from the R Project home page (http://www.R-project.org). CRAN has a number of packages for specific statistical applications implemented in R and available as RPM files (for information on installing RPMs, see Chapter 3). If your project requires sampling, clustering, regression, or factor analysis, R can be a lifesaver. R can even be made to directly access XGobi as an output system, so that the results of your computations can be plotted in two or more dimensions.

You can try R without having to install it, thanks to Rweb (http://www.math.montana.edu/Rweb/), a service provided by the Montana State University Mathematics Department. Rweb accepts your R code, runs it, and returns a page with the results of the calculation. If you want to use R for anything beyond simple demonstrations, though, it's faster to download the RPM files and run R on a local computer.

If you find that R is useful in your work, we vigorously recommend you supplement the R tutorial, An Introduction to R, and the R FAQ (http://cran.r-project.org/) with the third edition of Modern Applied Statistics with S-Plus (see the Bibliography). Both Venables and Ripley are now part of the R development core team, and although their text is primarily an S-plus book, supplements are available from Ripley's web site (http://www.stats.ox.ac.uk/~ripley/)that make the examples in book more easily used under R.

Matlab and Octave

GNU Octave (http://www.gnu.org/software/octave/octave.html) is a freely available programming language whose syntax and functions are similar to Matlab, a commercial programming environment from The MathWorks, Inc. (http://www.mathworks.com/products/matlab/). Matlab is popular among engineers for quickly writing programs that perform large numbers of numerical computations. Octave (or Matlab) is worth a look if you want to write quick prototypes of number-crunching programs, particularly for data analysis or simulation. Both Octave and Matlab are available for Unix and Windows systems. Octave source code, binaries, and documentation are all available online; they are also distributed as part of an increasing number of Linux distributions.

Octave produces graphical output using the gnuplot package mentioned previously. While this arrangement works well enough, it is rather spartan compared to the plotting capabilities of Matlab. In fact, if you are a student, we will take off our open source hats and strongly encourage you to take advantage of the academic pricing on Matlab; it will add years to your life. As a further incentive, a number of the data mining and machine learning tools discussed in the next section are available as Matlab packages.

Visualization: Summary

This section has described solutions to data presentation problems that arise frequently in bioinformatics. For some of the most current work on visualization in bioinformatics, see the European Bioinformatics Institute's visualization information off the Projects link on their industrial relations page (http://industry.ebi.ac.uk). Links to more online visualization and data mining resources are available off the web page for this book. Table 14-1 shows the tools and techniques that are used for data visualization.

Table 14-1. Data Visualization Tools and Techniques

What you do

Why you do it

What you use to do it

View graphics files

To view results and check figures

xzgv

View PDF or PostScript files

To read articles in electronic form

gv, Adobe Acrobat Reader

Manipulate graphics files

For preparation of figures

The GIMP

Plot data in two or three dimensions

To summarize data for presentations

Grace, gnuplot

Multidimensionalvisualization

To explore data with more than three variables

XGobi

Multidimensional scaling

To view high-dimensional data in2D or 3D

XGvis

Plot graphical structures

To draw networks and pathways

GraphViz programs

Print sequence alignment clearly

To format sequence alignment for publication

TEXshade

Statistics-heavy programming for data analysis

For rapid prototyping of statistical data-analysis tools

R (or S-plus)

Numerically intensive programming for data analysis

For rapid prototyping of tools that make heavy use of matrices

GNU Octave (or Matlab)

Data Mining and Biological Information

One of the most exciting areas of modern biology is the application of data mining methods to biological databases. Many of these methods can equally well fall into the category of machine learning, the name used in the artificial intelligence community for the larger family of programs that adapt their behavior with experience. We present here a summary of some techniques that have appeared in recent work in bioinformatics. The list isn't comprehensive but will hopefully provide a starting point for learning about this growing area.

A word of caution: anthropomorphisms have a tendency to creep into discussions of data mining and machine learning, but there is nothing magical about them. Programs are said to "learn" or be "trained," but they are always just following well-defined sets of instructions. As with any of the tools we've described in this book, data mining tools are supplements, rather than substitutes, for human knowledge and intuition. No program is smart enough to take a pile of raw data and generate interesting results, much less a publication-quality article ready for submission to the journal of your choice. As we've stressed before, the creation of a meaningful question, the experimental design, and the meaningful interpretation of results are your responsibility and yours alone.

Problems in Data Mining and Machine Learning

The topics addressed by data mining are ones that statisticians and applied mathematicians have worked on for decades. Consequently, the division between statistics and data mining is blurry at best. If you do work with data mining or machine learning techniques, you will want to have more than a passing familiarity with traditional statistical techniques. If your problem can be solved by the latest data-mining algorithm or a straightforward statistical calculation, you would do well to choose the simple calculation. By the same token, please avoid the temptation to devise your own scoring method without first consulting a statistics book to see if an appropriate measure already exists. In both cases, it will be easier to debug and easier to explain your choice of a standard method over a nonstandard one to your colleagues.

Supervised and unsupervised learning

Machine learning methods can be broadly divided into supervised and unsupervised learning. Learning is said to be supervised when a learning algorithm is given a set of labeled examples from which to learn (the training set) and is then tested on a set of unlabeled examples (the test set). Unsupervised learning is performed when data is available, but the correct labels for each example aren't known. The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data. For example, the MEME program introduced in Chapter 8, performs unsupervised learning in order to find sequence motifs in a set of unaligned sequences. It isn't known ahead of time whether each sequence contains the pattern, where the pattern is, or what the pattern looks like.

Cluster analysis is another kind of unsupervised learning that has received some attention in the analysis of microarray data. Clustering, as shown in Figure 14-5, is the procedure of classifying data such that similar items end up in the same class while dissimilar items don't, when the actual classes aren't known ahead of time. It is a standard technique for working with multidimensional data. Figure 14-5 shows two panels with unadorned dots on the left and dots surrounded by cluster boundaries on the right.

Clustering

Figure 14-5. Clustering

A Collection of Data Mining Techniques

In this section, we describe some data mining methods commonly reported in the bioinformatics literature. The purpose of this section is to provide an executive summary of the complex tricks for data analysis. You aren't expected to be able to implement these algorithms in your programming language of choice. However, if you see any of these methods used to analyze data in a paper, you should be able to recognize the method and, if necessary, evaluate the way in which it was applied. Like any technique in experimental biology, it is important to have an understanding of the machine learning methods used in computational biology to know whether or not they have been used appropriately and correctly.

Decision trees

In its simplest form, a decision tree is a list of questions with yes or no answers, hierarchically arranged, that lead to a decision. For instance, to determine whether a stretch of DNA is a gene, we might have a tree like the one shown in Figure 14-6.

Simple gene decision tree

Figure 14-6. Simple gene decision tree

A tree like this one is easy to work through, since it has a finite number of possibilities at each branch, and any path through the tree leads to a decision. The structure of the tree and the rules at each of the branches are determined from the data by a learning algorithm. Techniques for learning decision trees were described by Leo Breiman and coworkers in the early 1980s, and were later popularized in the machine learning community by J. R. Quinlan, whose freely available C4.5 decision tree software and its commercial successor, C5, are standards in the field.

One major advantage of decision trees over other machine learning techniques is that they produce models that can be interpreted by humans. This is an important feature, because a human expert can look at a set of rules learned by a decision tree and determine whether the learned model is plausible given real-world constraints.[‖] In biology, tree classifiers tend to be used in pattern recognition problems, such as finding gene splice sites or identifying new occurrences of a protein family member. The MORGAN genefinder developed by Steven Salzberg and coworkers is an example of a decision tree approach to genefinding.

Neural networks

Neural networks are statistical models used in pattern recognition and classification. Originally developed in the 1940s as a mathematical model of memory, neural networks are sometimes also called connectionist models because of their representation as nodes (which are usually variables) connected by weighted functions. Figure 14-7 shows the process by which a neural network is constructed. Please note, though, that there is nothing particularly "neural" about these models, nor are there actually physical nodes and connections involved. The idea behind neural networks is that, by working in concert, these simple processing elements can perform more complex computations.

Neural network diagram

Figure 14-7. Neural network diagram

A neural network is composed of a set of nodes that are connected in a defined topology, where each node has input and output connections to other nodes. In general, a neural network will receive an input pattern (for example, an amino acid sequence whose secondary structure is to be predicted), which sets the values of the nodes on the first layer (the input layer). These values are propagated according to transfer functions (the connections) to the next layer of nodes, which propagate their values to the next layer, until the output layer is reached. The pattern of activation of the output layer is the output of the network. Neural networks are used extensively in bioinformatics problems; examples include the PHD (http://www.embl-heidelberg.de/predictprotein/predictprotein.html) and PSIPRED (http://insulin.brunel.ac.uk/psipred/) secondary structure predictors described in Chapter 9, and the GRAIL genefinder (http://compbio.ornl.gov/grailexp/) mentioned in Chapter 7.

Genetic algorithms

Genetic algorithms are optimization algorithms. They search a large number of possible solutions for the best one, where "best" is determined by a cost function or fitness function. Like neural networks, these models were inspired by biological ideas (in this case, population genetics), but there is nothing inherently biological about them. In a genetic algorithm, a number of candidate solutions are generated at random. These candidate solutions are encoded as chromosomes. Parts of each chromosome are then exchanged à la homologous recombination between real chromosomes. The resulting recombined strategies are then evaluated according to the fitness function, and the highest scoring chromosomes are propagated to the next generation. This recombination and propagation loop continues until a suitably good solution is found. Genetic algorithms are frequently used in molecular simulations, such as docking and folding of proteins.

Support vector machines

In late 1998, a machine learning tool called the support vector machine (SVM) began to attract a great deal of attention in bioinformatics. Support vector machines were developed by Vladimir Vapnik of Bell Laboratories, an early pioneer of machine learning methods. They have been applied to a wide range of problems, from optical character recognition to financial time series analysis and recognizing spam (the email variety, not the lunch meat). SVMs were first applied to biological problems by Tommi Jaakola (now at MIT), David Haussler, and coworkers at UC Santa Cruz, who used them for protein sequence classification. They have since been applied to many of the standard computational biology challenge problems (function prediction, structure prediction, and genefinding) but have gained the most attention for their use in microarray analysis.

Support vector machines are supervised classifiers that try to find a linear separation between different classes of points in a high-dimensional space. In a 2D space, this separator is a line; in 3D, it's a plane. In general, this separating surface is called a hyperplane . Support vector machines have two special features. First, instead of just finding any separating hyperplane, they are guaranteed to find the optimal one, or the one whose placement yields the largest separation between the two classes. The data points nearest the frontier between the two classes are called the support vectors.[#] Second, although SVMs are linear classifiers, they can classify nonlinearly separable sets of points by transforming the original data points into a higher dimensional space in which they can be separated by a linear surface.

Table 14-2 shows some of the most popular data-mining tools and techniques.

Table 14-2. Data Mining Tools and Techniques

What you do

Why you do it

What you use to do it

Clustering

To find similar items when a classification scheme isn't known ahead of time

Clustering algorithms, self-organizing maps

Classification

To label each piece of data according to a classification scheme

Decision trees, neural networks, SVMs

Regression

To extrapolate a trend from a few examples

Regression algorithms, neural networks, SVMs, decision trees

Combining estimators

To improve reliability of prediction

Voting methods, mixture methods



[*] When you start to see sequence motifs in words or people's names, it's time to take a break.

[†] Adobe makes a PDF reader as well, named Acrobat Reader, which is available at no cost for Windows, Mac OS, Linux, and a handful of Unix systems.

[‡] Here we are talking about the graph theory kind of graph, in which a set of dots (nodes) are connected by a set of lines (edges). Directed graphs are those in which the edge connecting two nodes has a direction. Edges in directed graphs are usually represented using arrows pointing in the direction of a connection. In an undirected graph, the edges have no direction.

[§] Actually, R vectors do have a zero element, but it doesn't accept values assigned to it, and it returns numeric(0), which is an empty vector.

[‖] The canonical decision-tree urban legend comes from an application of trees by a long-distance telephone company that wanted to learn about churn, the process of losing customers to other long-distance companies. They discovered that an abnormally large number of their customers over the age of 70 were subject to churn. A human recognized something the program did not: humans can die of old age. So, being able to interpret your results can be useful.

[#] Vectors, in this case, refer to the coordinates of the data points. For example, on a 2D map, you might have pairs of (x,y) coordinates representing the location of the data points. These ordered pairs are the vectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset