Chapter VII.2. Bioinformatics

Bioinformatics, or computational biology, combines computer science with molecular biology to solve biological problems on a molecular level. This basically means using computers to study proteins and genes to predict protein structures, drug interactions, and gene splicing.

Because bioinformatics embraces both computer science and molecular biology, there are two common paths to working in bioinformatics. The first involves studying computers and then learning about molecular biology so you'll know what your programs are supposed to do. The second involves studying molecular biology and then learning computer programming so you can write programs to aid in your research.

Each way depends on your main interest. Not all computer scientists want to know or study molecular biology and not all molecular biologists want to go through the hassle of learning computer programming. As a result, bioinformatics is a rare combination of diverse skills that will be in high demand in the near future. If the idea of using a computer to study cloning, genetic engineering, and cures for diseases appeals to you, bioinformatics may be the perfect outlet for your talent.

Note

The terms bioinformatics and computation biology are often used interchangeably. Technically, bioinformatics focuses more on creating algorithms and writing programs whereas computational biology focuses more on using computers as tools for biological research.

The Basics of Bioinformatics

To understand bioinformatics, you must first understand its purpose. Before computers, biologists had two ways to study any problem. First, they could perform an experiment in a laboratory under controlled conditions, which is known as in vitro, or in glass.

A second way to study a problem is to perform an experiment on a living organism, such as a guinea pig or a human volunteer. Because this type of experiment occurred on a living creature, it's called in vivo, or in life.

Both in vitro and in vivo experiments are expensive and time-consuming. Performing in vitro experiments requires laboratory equipment whereas performing in vivo experiments requires live subjects.

Bioinformatics offers biologists a third way to conduct experiments — in silico, or in silicon. Instead of using an expensive laboratory, equipment, and living creatures, bioinformatics lets biologists conduct simulated experiments with a computer.

What makes in silico experiments just as valid as in vitro or in vivo experiments is that they all work with molecules. An in vitro experiment studies molecules in a test tube, an in vivo experiment studies molecules in a live animal, and in silico experiments study molecules as nothing more than data inside the computer. Specifically, in silico experiments (bioinformatics) represent molecules as strings that the computer manipulates.

By using knowledge of how different molecules interact, bioinformatics can simulate molecular interactions, such as how a certain drug might interact with cancer cells. This not only makes experimenting faster, but easier and less expensive to conduct as well. After a bioinformatics experiment confirms a certain result, biologists can go to the next step — testing actual drugs and living cells in test tubes (in vitro) or on living creatures (in vivo).

Representing molecules

Bioinformatics manipulates molecules. Of course, biologists don't care about every molecule in existence, just the ones involved in life, such as proteins. Four important molecules that biologists study are the ones that make up the structure of deoxyribonucleic acid, or DNA. These four molecules are identified by a single letter: Adenine (A), cytosine (C), guanine (G), and thymine (T).

When these molecules form a DNA strand, they link together in a sequence, such as:

ACTGTTG

In a computer, such sequences of molecules can be represented as a string, such as

$DNA = 'ACTGTTG';

Of course, these aren't the only four molecules that biologists study, but the idea is the same. Represent every molecule as a single letter and then re-create the molecular structure as nothing more than a string.

Unfortunately, most molecular structures consist of long strings of redundant one-letter codes. Trying to read these long molecular structures, let alone manipulate them by hand, is nearly impossible. That's where computers and bioinformatics come in.

Computers simplify and automate the tedious process of examining and manipulating molecular structures. Biologists simply have to type the molecular structure correctly and then tell the computer how to manipulate that structure as a series of strings.

Manipulating molecules in a computer

The type of programming language used to manipulate strings of molecules is irrelevant. What's more important is how to manipulate molecular structures. The simplest form of string manipulation is concatenation, which joins multiple strings into one.

In the world of biology, concatenation is similar to gene splicing — biologists can experiment with tearing a molecular structure apart and putting it back together again to see what they can create. In Perl, concatenation can be as simple as the following example:

$DNA1 = 'ACTGTTG';
$DNA2 = 'TGTACCT';
$DNA3 = "$DNA1$DNA2";
print $DNA3;

This simple Perl program would print:

ACTGTTGTGTACCT

Another way to manipulate strings (molecular structures) is by replacing individual molecules with other ones, which can simulate mutation. A mutation simulation program could pick a molecule at random and replace it with another molecule. So the initial structure might look like this:

CCCCCCCCCCC

Then each mutation could progressively scramble the structure by a single molecule, such as:

CCCCCCCCCCC
CCCCCCCTCCC
CCCCACCTCCC
CCCCACCTCCG
CACCACCTCCG

Mutation and concatenation are just two ways to manipulate molecular structures within a computer. If you created half a DNA sequence, you still need to determine the other half. Because DNA consists of two strands bound together in a double helix form, it's easy to determine the second sequence of DNA after you know the first one. That's because each adenine (A) links up with thymine (T) and each cytosine (C) links up with guanine (G).

The two strands of DNA are complimentary sequences. To calculate a complimentary sequence by knowing only one of the sequences, you can use a simple program that replaces every A with a T, every C with a G, every T with an A, and every G with a C. A Perl program to do this might look like this:

$DNA = 'ACTGTTG';
$compDNA = tr/ACGT/TGCA/;

The tr command simply tells Perl to translate or swap one character for another. So the above tr/ACGT/TGCA/; command tells Perl to translate every A into a T, every C into a G, every G into a C, and every A into a T all at once.

The second step in determining a complimentary sequence is to reverse the order of that sequence. That's because sequences are always written a specific way, starting with the end of the sequence known as 5' phosphoryl (also known as 5 prime or 5') and ending with 3' hydroxyl (known as 3 prime or 3'). So to display the complimentary sequence correctly, you have to reverse it using this Perl command:

$DNA = 'ACTGTTG';
$compDNA = tr/ACGT/TGCA/;
$revDNA = reverse $compDNA;

Tip

It's important to know both sequences that make up a DNA strand so you can use both DNA sequences to search for information. When faced with an unknown structure, there's a good chance someone else has already discovered this identical molecular structure. So all you have to do is match your molecular structure with a database of known structures to determine what you have.

Searching Databases

After biologists discover a specific molecular structure, they store information about that sequence in a database. That way other biologists can study that sequence so everyone benefits from this slowly growing body of knowledge.

Unfortunately, there isn't just one database, but several databases that specialize in storing different types of information:

  • GenBank stores nucleotide sequences.

  • Swiss-Prot stores protein sequences.

  • OMIM (Online Mendelian Inheritance in Man) stores human genes and genetic disorders data.

After you find a particular sequence, you can look up articles about particular sequences in PubMed, a database of articles published in biomedical and life science journals.

Although it's possible to search these databases manually, it's usually much faster and easier to write a program that can send a list of sequences to a database, search that database for known sequences that match the ones sent, and then retrieve a list of those known sequences for further study.

Because searching databases is such a common task, biologists have created a variety of tools to standardize and simplify this procedure. One of the more popular tools is Basic Local Alignment and Search Tool, otherwise known as BLAST.

BLAST can look for exact matches or just sequences that are similar to yours within specified limits, such as a sequence that's no more than ten percent different. This process of matching up sequences is sequence alignment or just alignment.

By finding an exact match of your sequence in a database, you can identify what you have. By comparing your sequence with similar ones, you can better understand the possible characteristics of your sequence. For example, a cat is more similar to a dog than a rattlesnake, so a cat would likely behave more like a dog than a rattlesnake.

Note

The BLAST algorithm and computer program was written by the U.S. National Center for Biotechnology Information (NCBI) at Pennsylvania State University (www.ncbi.nlm.nih.gov/BLAST).

The basic idea behind BLAST is to compare one sequence (called a query sequence) with a database to find exact matches of a certain number of characters, such as four. For example, suppose you had a sequence like this:

ATCACCACCTCCG

With BLAST, you could specify that you only want to find matches of four characters or more, such as:

ATCACC TGGTATC

Although you could type molecular sequences by hand, it's far easier to let the computer do it for you, especially if you want to compare a large number of sequences with BLAST. After BLAST gets through comparing your sequences, it returns a list of matching sequences.

Note

Using BLAST to compare sequences to a database of known sequences is an example of data mining. (See Chapter 1 of this mini-book for more information about data mining.)

You could scan through this list of matching yourself, but once again, that's likely to be tedious, slow, and error-prone. Writing a program that can parse through reports generated by BLAST to look for certain characteristics is much simpler. Essentially, you can use the computer to automate sending data to BLAST and then have the computer filter through the results so you see only the sequences that you care about.

Now you could write another program to skim or parse through the database results to filter out only the results you're looking for. Because every database stores information in slightly different formats, you might need to write another program that converts file formats from one database into another one.

Because every biologist is using different information to look for different results, there's no single bioinformatics program standard in the same way that everyone has flocked to a single word processor standard, like Microsoft Word. As a result, bioinformatics involves writing a lot of little custom programs to work with an ever-growing library of standard programs that biologists need and use every day.

Some biologists can learn programming and do much of this work themselves, but it's far more common for biologists to give their data to an army of bioinformatics technicians who take care of the programming details. That way the biologists can focus on what they do best (studying biology) while the programmers can focus on what they do best (writing custom programs). The only way these two groups can communicate is if biologists understand how programming can help them and the programmers understand what type of data and results the biologists need.

Bioinformatics Programming

Because biologists use a wide variety of computers (UNIX, Windows, Linux, and Macintosh), they need a programming language that's portable across all platforms. In addition, biologists need to work with existing programs, such as online databases. Finally, because most biologists aren't trained as programmers, they need a simple language that gets the job done quickly.

Although a language like C/C++ runs on multiple platforms, porting a program from Windows to Linux often requires rewriting to optimize the program for each particular operating system. While figuring out C/C++ isn't necessarily hard, it's not easy either.

A more appropriate programming language is a scripting language. Scripting languages, such as Perl, run on almost every operating system, are easy to learn and use, and include built-in commands for manipulating strings. Best of all, scripting languages are specifically designed to work with existing programs by feeding data to another program and retrieving the results back again.

Although Perl has become the unofficial standard programming language for bioinformatics, biologists also rely on other programming languages because many people feel that Perl is too confusing. Perl's motto is "There's more than one way to do it" — you can perform the exact same action in Perl with entirely different commands.

For example, to concatenate two strings, Perl offers two different methods. The first is to smash two strings together, like this:

$DNA3 = "$DNA1$DNA2";

The second way to concatenate the same two strings uses the dot operator, like this:

$DNA3 = $DNA1 . $DNA2;

The second most popular language used in bioinformatics is Python. Python offers identical features as Perl but many people feel that Python is a simpler language to understand and use because its motto is, "There should be one — and preferably only one — obvious way to do it." To concatenate strings in Python, you can use this command:

$DNA3 = $DNA1 + $DNA2

Another popular bioinformatics programming language is Java. Not only are more programmers familiar with Java, but Java's cross-platform capability also allows it to create compiled programs for each operating system. In comparison, both Perl and Python are interpreted languages — you must load the source code of a Perl or Python program and run it through an interpreter first. Java gives you the convenience of copying and running a compiled program without the nuisance of running source code through an interpreter.

Despite the advantages of other programming languages, Perl is still the language of bioinformatics. If you're going to work in bioinformatics, first learn Perl and then learn Python or Java.

Tip

Biologists have written subprograms in various programming languages to make writing bioinformatics programs easier:

  • Perl: BioPerl (www.bioperl.org)

  • Python: BioPython (http://biopython.org/wiki/Main_Page)

  • Java: BioJava (http://biojava.org/wiki/Main_Page)

  • C++: BioC++ (http://biocpp.sourceforge.net)

For true hard-core computer users, there's even a BioLinux (http://envgen.nox.ac.uk/biolinux.html), which is a version of the Linux operating system that comes loaded with various bioinformatics tools installed and ready to use right away for bioinformatics work.

Because bioinformatics involves performing the same type of tasks, these libraries of bioinformatics subprograms offer code for

  • Accessing databases

  • Transforming database information from one file format to another

  • Manipulating sequences

  • Searching and comparing sequences

  • Displaying results as graphs or 3-D structures

The field of bioinformatics is still growing and changing — the tools and techniques used today may become obsolete tomorrow. (If you've spent any time in the computer industry, you probably already know that applies to every aspect of computers by now.)

In most other fields of computer science, programmers spend more time maintaining and updating existing programs than writing new ones. In bioinformatics, every biologist has different needs, so you could actually spend more time writing custom programs and less time getting stuck patching up someone else's program.

With its curious mix of computer science and biology, bioinformatics is a unique field that's wide open for anyone interested in life science and computer science. If the idea of working in the growing field of biotechnology appeals to you, bioinformatics might be for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset