Chapter 12. Automating Data Analysis with Perl

As we've seen in previous chapters, a vast assortment of software tools exists for bioinformatics. Even though it's likely that someone has already written what you need, you will still encounter many situations in which the best solution is to do it yourself. In bioinformatics, that often means writing programs that sift through mountains of data to extract just the information you require. Perl, the Practical Extraction and Reporting Language, is ideally suited to this task.

Why Perl?

There are a lot of programming languages out there. In our survey of bioinformatics software, we have already seen programs written in Java, C, and FORTRAN. So, why use Perl? The answer is efficiency.[*] Biological data is stored in enormous databases and text files. Sorting through and analyzing this data by hand (and it can be done) would take far too long, so the smart scientist writes computer tools to automate the process. Perl, with its highly developed capacity to detect patterns in data, and especially strings of text, is the most obvious choice. The next obvious choice would probably be Python. Python, the less well known of the two, is a fully object-oriented scripting language introduced by Guido van Rossum in 1988. Python has some outstanding contributed code, including a mature library for numerical methods, tools for building graphical user interfaces quickly and easily, and even a library of functions for structural biology. At the end of the day, however, it's the wealth of existing Perl code for bioinformatics, the smooth integration of that code onto Unix-based systems, cross-platform portability, and an incredibly enthusiastic user community that makes Perl our favorite scripting language for bioinformatics applications.

Perl has a flexible syntax, or grammar, so if you are familiar with programming in other languages such as C or BASIC, it is easy to write Perl code in a C-like or BASIC-like dialect. Perl also takes care of much of the dirty work involved in programming, such as memory allocation, so you can concentrate on solving the problem at hand. It's often the case that programming problems requiring many lines of code in C or Java may be solved in just a few lines of Perl.

Many excellent books have been written about learning and using Perl, so this single chapter obviously can't cover everything you will ever need to know about the language. Perl has a mountain of features, and it's unrealistic to assume you can master it without a serious commitment to learning the art of computer programming. Our aim in this chapter isn't to teach you how to program in Perl, but rather to show you how Perl can help you solve certain problems in bioinformatics. We will take you through some examples that are most immediately useful in real bioinformatics research, such as reading datafiles, searching for character strings, performing back-of-the-envelope calculations, and reporting findings to the user. And we'll explain how our sample programs work. The rest is up to you. The ability to program in any language—but especially in Perl, Python, or Java—is an important skill for any bioinformatician to have. We strongly suggest you take a programming class or obtain one of the books on our list of recommended reading and start from the beginning. You won't regret it.

Where Do I Get Perl?

Perl is available for a variety of operating systems, including Windows and Mac OS, as well as Linux and other flavors of Unix. It's distributed under an open source license, which means that it's essentially free. To obtain Perl from the Web, go to http://www.perl.com/pub/language/info/software.html and follow the instructions for downloading and installing it on your system.

Perl Basics

Once you've installed Perl, or confirmed with your system administrator that it's already installed on your system, you're ready to begin writing your first program. Writing and executing a Perl program can be broken into several steps: writing the program (or script) and saving it in a file, running the program, and reading the output.

Hello World

A Perl program is a text file that contains instructions written in the Perl language. The classic first program in Perl (and many other languages) is called "Hello, World!" It's written like this:

#!/usr/bin/perl -w
# Say hello
print "Hello, World!
";

"Hello, World!" is a short program, but it's still complete. The first line is called the shebang line and tells the computer that this is a Perl program. All Perl programs running on Unix begin with this line.[†] It's a special kind of comment to the Unix shell that tells it where to find Perl, and also instructs it to look for optional arguments. In our version of "Hello World!" we've included the optional argument -w at the end of the line. This argument tells Perl to give extra warning messages if you do something potentially dangerous in your program. It's a good idea to always develop your programs under -w.

The second line starts with a # sign. The # tells Perl that the line of text that follows is a comment, not part of the executable code. Comments are how humans tell each other what each part of the program is intended to do. Make a habit of including comments in your code. That way you and other people can add to your code, debug it successfully, and even more importantly, remember what it was supposed to do in the first place.

The third line calls the print function with a single argument that consists of a text string. At the end of the text string, there is a , which tells the computer to move to a new line after executing the print statement. The print statement ends with a semicolon, as do most statements in Perl.

To try this little program yourself, you can open a text editor such as vi , Emacs, or pico , and type the lines in. When you've finished entering the program, name the file hello.pl and save it in your directory of choice. While you're learning, you might consider creating a new directory (using the mkdir command, which we covered in Chapter 4) called Perl in your home directory. That way you'll always know where to look for your Perl programs.

Now make the file executable using the command:

% chmod +x hello.pl 

(If you need a refresher on chmod , this would be a good time to review the section on changing file permissions in Chapter 4.) To run the program, type:

% hello.pl

Because of the shebang line in our program, this command invokes the Perl interpreter, which reads the rest of the file and then translates your Perl source code into machine code the computer can execute. In this case you'll notice that Hello, World! appears on your computer screen, and then the cursor advances to a new line. You've now written and run your first Perl program!

A Bioinformatics Example

One of the strengths of Perl—and the reason that bioinformaticians love it—is that with a few lines of code, you can automate a tedious task such as searching for a nucleotide string contained in a block of sequence data. To do this, you need a slightly more complex Perl program that might look like this:

#!/usr/bin/perl -w
# Look for nucleotide string in sequence data
  
my $target = "ACCCTG";
my $search_string = 
    'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA'.
    'CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT'.
    'ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC';
  
my @matches;
  
# Try to find a match in letters 1-6 of $search_string, then look at letters 2-7, 
# and so on. Record the starting offset of each match.

foreach my $i (0..length $search_string){
   if( $target eq substr( $search_string, $i, length $target)){
      push @matches, $i;
   }
}

# Make @matches into a comma-separated list for printing
print "My matches occurred at the following offsets: @matches.
";
  
print "done
";

This program is also short and simple, but it's still quite powerful. It searches for the target string "ACCCTG" in a sequence of data and keeps track of the starting location of each match. The program demonstrates variables and loops, which are two basic programming constructs you need to understand to make sense of what is going on.

Variables

A variable is a name that is associated with a data value, such as a string or a number. It is common to say that a variable stores or contains a value. Variables allow you to store and manipulate data in your programs; they are called variables because the values they represent can change throughout the life of a program.

Our sequence matching program declares four variables: $target , $search_string, @matches, and $i. The $ and @ characters indicate the kind of variable each one is. Perl has three kinds of variables built into the language: scalars, arrays, and hashes.

Unlike other programming languages, Perl doesn't require formal declaration of variables; they simply exist upon their first use whether you explicitly declare them or not. You may declare your variables, if you'd like, by using either my or our in front of the variable name. When you declare a variable, you give it a name. A variable name must follow two main rules: it must start with a letter or an underscore (the $ and @ characters aren't considered part of the name), and it must consist of letters, digits, and underscores. The best names are ones that clearly, concisely, and accurately describe the variable's role in the program. For example, it is easier to guess the role of a variable if it is named $target or $sequence, than if it were called $icxl.

Scalars

A scalar variable contains a single piece of data that is either a number or a string. The $ character indicates that a variable is scalar. The first two variables declared in our program are scalar variables:

my $target = "ACCCTG";
my $search_string = 
    "CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA".
    "CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT".
    "ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC";

In this case, "ACCCTG" is the target string we are seeking, and "CCACACCACACCCACAC..." is the sequence data in which we're hoping to find it.

In a scalar variable, a number can be either an integer (0, 1, 2, 3, etc.) or a real number (a number that contains a fractional portion, such as 5.6). A string is a chunk of text that's surrounded by quotes. For example:

"I am a string."
'I, too, am a string'

One of Perl's special features is that it has a number of built-in facilities for manipulating strings, which comes in handy when working with the flat text files common to bioinformatics. We cover flat text files and their more structured relatives, relational databases, in detail in Chapter 13.

Arrays

An array is an ordered list of data. In our sequence matching program, @matches is an array variable used to store the starting locations of all the matches. Each element stored in an array can be accessed by its position in the list, which is represented as a number. In Perl, array variables are given an @ prefix. For example, the following statement declares an array of numbers:

@a = ( 1, "4", 9 ); 

This statement declares an array of strings:

@names = ("T. Herman", "N. Aeschylus", "H. Ulysses", "Standish"); 

And this statement declares an array with both:

@mix = ("Caesar Augustus", "Tiberius", 18, "Caligula", "Claudius");

Note the syntax in the declarations: each element in the array is separated from its neighbors by a comma, each of the strings is quoted, and (unlike American English) the comma appears outside of the quotes.

Because an array is an ordered set of information, you can retrieve each element in an array according to its number. The individual elements in an array are written as $this_array[i], where i is the index of the array element being addressed. Note that i can be either a bare number (such as 21), or a numeric scalar variable (such as $n) that contains a bare number. Here is a Perl statement that uses the print operator to display the second number in @a and the third name in @names on the screen:

print "second number: $a[1]
 third name: $names[2]
"; 

You may be wondering why the element numbers here are one less than what you might think they should be. The reason is that positions in Perl arrays are numbered starting from zero. That is, the first element in an array is numbered 0, the second element is numbered 1, and so on. That's why, in the previous example, the second element in @a is addressed as $a[1]. This is an important detail to remember; mistakes in addressing arrays due to missing that crucial zero element are easy to make.

Hashes

A hash is also known as an associative array because it associates a name (or key, as it's called in Perl) with each piece of data (or value) stored in it. A real-world example of a hash is a telephone book, in which you look up a person's name in order to find her telephone number. Our sequence matching program doesn't use any hashes, but they can be quite handy in bioinformatics programs, as you'll see in a later example. Perl uses the % prefix to indicate hash variables (e.g., %sequences). There are a number of ways to declare a hash and its contents as a list of key/value pairs. Here is the syntax for one declaration style:

%hash = (
    key1 => "value1",
    key2 => "value2", ...
    last_key => "last_value" ); 

A value can then be retrieved from this hash using the corresponding key, as follows:

$value = $hash{"key2"};

For example, you can declare a hash that relates each three-letter amino acid code to its one-letter symbol:

my %three_to_one = (
    ALA => A, CYS => C, ASP => D, GLU => E,
    PHE => F, GLY => G, HIS => H, ILE => I,
    LYS => K, LEU => L, MET => M, ASN => N,
    PRO => P, GLN => Q, ARG => R, SER => S,
    THR => T, VAL => V, TRP => W, TYR => Y
 ); 

The hash entry with the one-letter code for arginine can then be displayed using the following statement:

print "The one-letter code for ARG is $three_to_one{ARG}
"; 

Because there are many popular sequence databases, another place where hashes can be immediately useful is in keeping track of which sequence ID in one database corresponds to a sequence ID in the next. In the following example, we define a hash in which each of the keys is a GenBank identifier (GI) number of a particular enzyme, and each value is the corresponding SWISS-PROT identifier of the same enzyme. Using this hash, a program can take the more cryptic GI number and automatically find the associated SWISS-PROT ID:

#!/usr/bin/perl -w
# define the hash relating GI numbers to SWISSPROT IDs
 %sods = (
     g134606  => "SODC_DROME",
     g134611  => "SODC_HUMAN",
     g464769  => "SODC_CAEEL",
     g1711426 => "SODC_ECOLI" );

# retrieve a value from %sods 
$genbank_id = "g134611";
$swissprot_id = $sods{$genbank_id};
print "$genbank_id is the same as $swissprot_id
";

If you save the previous script to a file, make the file executable, and run it, you should see:

g134611 is the same as SODC_HUMAN

In the first part of this script, you are declaring the hash relating GenBank IDs to SWISS-PROT IDs. In the second part, you access the information stored in that hash. The first step is to assign one of the GenBank IDs to the variable $genbank_id. Then you can retrieve the SWISS-PROT ID that %sods has associated with the string in $genbank_id, and store the SWISS-PROT ID in the variable $swissprot_id. Finally, print the values of the two scalar variables. This example is obviously rather contrived, but it should give you an idea of how useful hashes can be in bioinformatics programs.

Loops

Now that we've talked about scalar, array, and hash variables in Perl, let's return to our sequence matching program and talk about the other main programming construct it employs. A loop is a programming device that repeatedly executes a specific set of commands until a particular condition is reached. Our program uses a foreach loop to iterate through the search string:

foreach my $i (0..length $search_string){
   if( $target eq substr( $search_string, $i, length $target)){
      push @matches, $i;
   }
}

The first time through this loop, Perl starts at 0 and looks at the first six-letter combination in the search string, compares it to the target string, and, if there is a match, records it in @matches. The second cycle of the loop looks at letters 2-7, the third looks at letters 3 -8, and so on. Perl stops executing this sequence when it reaches the end of the search string. At this point, the loop is done, and the program moves on to the next section, where it prints the results. Don't worry if you don't understand all the code in the loop; all that's important right now is that you have a general understanding of what the code is doing.

Subroutines

Although we don't use them in any of our example programs, the use of subroutines in programs is worth mentioning. All modern programming languages provide a way to bundle up a set of statements into a subroutine so that they can be invoked concisely and repeatedly. In Perl, you can create a subroutine with the sub declaration:

sub greet {
    my ($name) = shift;
    print "Hello, $name!
";
}

Once this greet subroutine has been declared, you can invoke it as follows:

greet("world");    # Prints "Hello, world!"
greet("Per");      # Prints "Hello, Per!"

Here, "world" and "Per" are arguments —values passed into the subroutine, where they are then stored in $name. Our greet subroutine just prints a single line and then returns. Usually, subroutines do something a bit more complicated, possibly returning a value:

$length = calculate_length($sequence);

This sets $length to whatever the calculate_length( ) subroutine returns when provided with the single argument $sequence. When a subroutine is used for its return value, it's often called a function.

Pattern Matching and Regular Expressions

A major feature of Perl is its pattern matching, and particularly its use of regular expressions. A regular expression (or regex in the Perl vernacular) is a pattern that can be matched against a string of data. We first encountered regular expressions in our discussion of the Unix command grep , back in Chapter 5. grep, as you may recall, searches for occurrences of patterns in files. When you tell grep to search for a pattern, you describe what you're looking for in terms of a regular expression. As you know, much of bioinformatics is about searching for patterns in data.

Let's look at a Perl example. Say you have a string, such as a DNA sequence, and you want to make sure that there are no illegal characters in it. You can use a regular expression to test for illegal characters as follows:

#!/usr/bin/perl
# check for non-DNA characters
my $string = "CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA";
if( $string =~ m/([^ATCG])/i) {
  print "Warning! Found: $1 in the string";
}

This program contains the regular expression [^ATCG]. Translated into English, the regular expression says "look for characters in $string that don't match A, T, C, or G." (The i at the end of the statement tells Perl to match case insensitively; that is, to pay no attention to case. Perl's default is to treat A differently from a.) If Perl encounters something other than the declared pattern, the program prints out the offending character. The output of this program is:

Warning! Found * in the string

If instead you want to search for a particular combination of letters, like "CAT", you can change the regular expression to read CAT:

#!/usr/bin/perl
# check for CATs
my $string = 
   "CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA";
if( $string =~ m/CAT/i ){
  print "Meow.";
}

The output of this modified program is:

Meow.

Parsing BLAST Output Using Perl

Now that you know enough about how Perl is written to understand these simple programs, let's apply it to one of the most common problems in bioinformatics: parsing BLAST output. As you already know, the result of a BLAST search is often a multimegabyte file full of raw data. The results of several searches can quickly become overwhelming. But by writing a fairly simple program in Perl, you can automate the process of looking for a single string or multiple strings in your data.

Consider the following block of data:

. . .
gb|AC005288.1|AC005288  Homo sapiens chromosome 17, clone hC...   268  2e-68
gb|AC008812.7|AC008812  Homo sapiens chromosome 19 clone CTD...   264  3e-67
gb|AC009123.6|AC009123  Homo sapiens chromosome 16 clone RP1...   262  1e-66
emb|AL137073.13|AL137073  Human DNA sequence from clone RP11...   260  5e-66
gb|AC020904.6|AC020904  Homo sapiens chromosome 19 clone CTB...   248  2e-62
>gb|AC007421.12|AC007421 Homo sapiens chromosome 17, clone hRPC.1030_O_14,  
complete sequence
Query: 3407 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 3466
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1366 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 1425
Query: 3467 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 3526
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1426 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 1485
Query: 3527 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 3586
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1486 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 1545
Query: 3587 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 3646
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1546 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 1605
Query: 3647 tcctaggttggaaagccataggattccattctcatcctgccttcatggtcaaaggcagct 3706
. . .

This is only a small portion of what you might find in a report from a BLAST search. (This is actual data from a BLAST report. The entire file, blast.dat, is too large to reproduce here.) The first six lines of this sample contain information about the BLAST search, as well as other "noise" that's of no importance to the search. The next 13 lines, and the ones that follow it in the actual report, contain the data to analyze. You want the Perl program to look at both the "Query" and "Sbjct" lines in this BLAST report and find the number of occurrences of the following substrings:

  • 'gtccca'

  • 'gcaatg'

  • 'cagct'

  • 'tcggga'

  • Missing data (represented by dashes in the nucleotide sequence strings)

At the same time, you want the program to ignore irrelevant material such as information about the search and other noise. The program should then generate a report file called report.txt that describes the findings for these strings.

In this program you need to create two very long scalar variables to represent each sequence for searching. Let's call them $query_src, and $sbjct_src. In any BLAST output, you'll notice that sometimes the "Sbjct" and "Query" lines aren't contiguous; that is, there are gaps in the data. From a programming perspective, the fact that the gaps exist isn't important; you simply want to read the nucleotides into your scalars consecutively. Here is a sample portion of BLAST data:

Query: 1165  gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217
             |||| ||||||||  ||||||||||||| |||| | |||||||  ||||||||
Sbjct: 11895 gagctcaggagtttgagaccagcctggggaacacggtgaaaccctgtctctac 11843
Query: 1170  caggagttcaagaccagcctg 1190
             |||||||||||||||||||||
Sbjct: 69962 caggagttcaagaccagcctg 69942
Query: 1106  tggtggctcacacctgcaatcccagcact 1134
             ||||||||||| |||| ||||||||||||
Sbjct: 77363 tggtggctcacgcctgtaatcccagcact 77335

In spite of the fact that the line numbers aren't contiguous, the sequence for "Query" starts with 'gagccca' and still ends with 'agcact', and will be 103 (53 + 21 + 29) characters long. As you'll see shortly, the program is designed to ignore the gaps (and the line numbers) and input the data properly. Frequent BLAST users may also notice that in a full BLAST report, each sequence is grouped by E-values. We are ignoring that (usually) important fact in the program.

The Perl program used to search for the five substrings can be broken down into three parts:

  • Inputting the data and preparing it for analysis

  • Searching the data and looking for the patterns

  • Compiling the results and storing them in report.txt

Let's go through the program step by step. Here are the first few lines:

#!/usr/bin/perl
# Search through a large datafile, looking for particular sequences

use strict;

my $REPORT_FILE = "report.txt";
my $blast_file  = $ARGV[0] || 'blast.dat';

unless ( -e $blast_file ) {
  die "$0: ERROR: missing file: $blast_file";
}

This code makes sure that the data is in good order. Since you'll be reading large amounts of data into the variables, tell Perl to tighten up its rules with the line use strict;. This forces you to be more explicit about how you want Perl to do things. use strict is particularly useful when developing large programs or programs you want to modify and reuse. Go on to declare some variables, and in the last few lines, tell Perl to make sure that data actually exists in the input file blast.dat.

In the next block of code, the program reads the sequences into variables:

# First, slurp all the Query sequences into one scalar. Same for the 
# Sbjct sequences.
my ($query_src, $sbjct_src);

# Open the blast datafile and end program (die) if we can't find it
open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!";

# Go through the blast file line by line, concatenating all the Query and 
# Sbjct sequences.
while (my $line = <IN>) {
  chomp $line;
  print "Processing line $.
";

In this section you read all the "Query" sequences into one scalar variable, and the "Sbjct" sequences into another. The program then opens the file for reading with:

 open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!";  

or prints an error message if for some reason it can't find the file. Next, the program goes through the file line by line, removing the line-break characters with chomp $line;. And finally, with a print function, you ask the program to display the current row as it reads in the data.

Now that you have the data in memory, you need to sort through it and extract the material you want. Remember that the datafile included a lot of superfluous material you want to ignore. To do that, instruct Perl to consider only those lines that begin with Query or Sbjct. Now Query and Sbjct, in addition to having the desired sequence data, also have line numbers of varying length you don't want. In order to read the sequence data correctly, you must design the program in such a way that you skip over line numbers no matter how many characters they have, and always land on the first nucleotide. You'll notice in this line of data:

Query: 1165 gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217 

that there is a space between Query, the beginning line number, the sequence data, and the ending line number. Since this happens to be true for all the query and subject lines, it becomes the key to how to read the data correctly. To be sure you get only what you want, split all the query and subject data into a four column array called @words. That task is accomplished by the following lines of code:

    my @words = split /s+/, $line;
    if ($line =~ /^Query/) {
	$query_src .= $words[2];
    } elsif ($line =~ /^Sbjct/) {
	$sbjct_src .= $words[2];
    }
}

# We've now read the blast file, so we can close it.
close IN;

Once you've read the data into @words, you then instruct the program to take only the data from column two of @words (which is filled only with nucleotide sequence data) and store it in the variables $query_src and $sbjct_src. The program then closes the file and moves to a new line. You now have just the data you want, stored in a form you can use.

The next part of the program performs the analysis:

# Now, look for these given sequences...
my @patterns = ('gtccca', 'gcaatg', 'cagct', 'tcggga', '-'),

# ...and when we find them, store them in these hashes
my (%query_counts, %sbjct_counts);

# Search and store the sequences
foreach my $pattern (@patterns) {
   while ( $query_src =~ /$pattern/g ) {
       $query_counts{ $pattern }++;
   }
   while ( $sbjct_src =~ /$pattern/g ) {
       $sbjct_counts{ $pattern }++;
   }
}

The program sets up a loop that runs five times; once for each search string or pattern. Within each iteration of the outer foreach loop, the program runs inner while loops that advance counters each time they find a pattern match. The results are stored in separate hashes called %query_counts and %sbjct_counts.

Here is the last section of the program, which produces the output:

# Create an empty report file
open (OUT, ">$REPORT_FILE") or die "$0: ERROR: Can't write $REPORT_FILE";

# Print the header of the report file, including 
# the current date and time
print OUT "Sequence Report
", 
    "Run by O'Reilly on ", scalar localtime, "
",
    "
NOTE: In the following reports, a dash (-) represents
",
    "      missing data in the chromosomal sequence

", 
    "Total length of 'Query' sequence: ",
    length $query_src, " characters
", "Results for 'Query':
";

# Print the Query matches
foreach my $key ( sort @patterns ) {
    print OUT "	'$key' seen $query_counts{$key}
";
}

print OUT "
Total length of 'Sbjct' sequence: ",
    length $sbjct_src, " characters
", "Results for 'Sbjct':
";

# Print the Sbjct matches
foreach my $key ( sort @patterns ) {
    print OUT "	'$key' seen $sbjct_counts{$key}
";
}

close OUT;

__END__

This code compiles and formats the results and dumps them into a file called report.txt. If you open report.text you see:

Sequence Report
Run by O'Reilly on Tue Jan  9 15:52:48 2001

NOTE: In the following reports, a dash (-) represents
      missing data in the chromosomal sequence

Total length of 'Query' sequence: 1115 characters
Results for 'Query':
        '-' seen 7
        'cagct' seen 11
        'gcaatg' seen 1
        'gtccca' seen 6
        'tcggga' seen 1

Total length of 'Sbjct' sequence: 5845 characters
Results for 'Sbjct':
        '-' seen 12
        'cagct' seen 2
        'gcaatg' seen 6
        'gtccca' seen 1
        'tcggga' seen 6

In this example the results were sent to a file. You can just as easily ask Perl to generate an HTML-coded file you can view with your web browser. Or you can make the process interactive and use Perl to create a CGI script that generates a web form to analyze the data and give you back your results.

We've only scratched the surface in terms of what this sort of program can do. You can easily modify it to look for more general patterns in the data or more specific ones. For example, you might search for `tcggga' and `gcaatg', but only count them if they are connected by `cagct'. You also might search only for breaks in the data. And after all the searches are complete, you might use Perl to automatically store all the results in a database.

If you're feeling a little confused by all this, don't panic. We aren't expecting you to understand all the code we've shown you. As we said at the beginning of the chapter, the purpose of the code isn't to teach you to program in Perl, but to show you how Perl works, and also to show you that programming isn't really all that difficult. If you have what it takes to design an experiment, then you have what it takes to program in Perl or any other language.

Applying Perl to Bioinformatics

The good news is the more you practice programming, the more you learn. And the more you learn, the more you can do. Programming in Perl is all about analyzing data and building tools. As we've said before, biological data is proliferating at an astounding rate. The only chance biologists have of keeping up with the job of analyzing it is by developing libraries of reusable software tools. In Perl, there are a huge number of reusable functions available for use in your programs. Rather than being wrapped in a complete program, a group of related functions are packaged as a module . In your programs, you can use various modules to access the functions they support. There are Perl modules for other scientific disciplines, as well as games, graphics programming, video, artificial intelligence, statistics, and music. And they're all free.

To distinguish them from other Perl files, modules have a .pm suffix. To use functions from a module (say, CGI.pm) in your Perl programs, include the following line after the shebang line:

use CGI;

The primary source of all modules is the Comprehensive Perl Archive Network, or CPAN. CPAN (http://www.cpan.org) is a collection of sites located around the world, each of which mirrors the contents of the main CPAN site in Finland. To find the CPAN site nearest you, check the Perl web site (http://www.perl.com).

Because there are so many modules available, before you sit down to write a new function, it is worth your time to check the CPAN archive to see if anyone has already written it for you. In this section, we briefly describe some Perl modules that are particularly useful for solving common problems in bioinformatics. This list is by no means comprehensive; you should keep an eye on CPAN for current developments.

Bioperl

The Bioperl Project (along with its siblings, Biopython, BioJava, and Bioxml) is dedicated to the creation of an open source library of modules for bioinformatics research. The general idea is that common items in bioinformatics (such as sequences and sequence alignments) are represented as objects in Bioperl. Thus, if you use Bioperl, instead of having to constantly rewrite programs that read and write sequence files, you simply call the appropriate functions from Bio::SeqIO, and dump the resulting sequence data into a sequence object.

Bioperl isn't limited to storing sequences: it currently contains modules for generating and storing sequence alignments, managing annotation data, parsing output from the sequence-database search programs BLAST and HMMer, and has other modules on the way. In addition to the core Bioperl distribution, the ScriptCentral script repository at the Bioperl web site (http://www.bioperl.org) hosts a collection of biology-related scripts. To learn more about downloading, installing, and using Bioperl, see http://www.bioperl.org.

CGI.pm

CGI.pm is a module for programming interactive web pages. The functions it provides are geared toward formatting web pages and creating and processing forms in which users enter information. If you have used the Web, you almost certainly have used web pages written using CGI.pm. For example, let's create a page that asks the user what his favorite color is using an HTML form. When the user enters the data, the script stores it in a field named `color'. When the user hits "Submit," the same page is loaded, only this time, $query->param(`color') contains the name of a color, so the print statement after the "else" is executed. The CGI script looks like this:

#!/usr/bin/perl 

use CGI;                              # Load Perl's CGI module

my $query = new CGI;                  # Create a CGI object named $query

                                      # Send the HTML header and <HTML> tag
print $query->header, $query->start_html;

                    # If the user is visiting the site for the first time, ask him
                    # what his favorite color is  

unless ($query->param('color')) {     # Page 1: Asking the user
    print $query->start_form, "What is your favorite color? ", 
          $query->popup_menu( -name   => "color", 
			      -values => ["red", "green", "blue"] ),
          $query->submit,
          $query->end_form;
} else {                               # Page 2: Telling the user  
    print "Your favorite color is ", $query->param('color'),
}

print $query->end_html;                # Send the </HTML> tag

The results of this script are shown in Figure 12-1.

Our CGI script generates an interactive web page

Figure 12-1. Our CGI script generates an interactive web page

LWP

If CGI.pm automates the Web from the server's perspective, the Library for Web Programming (LWP) automates web interaction from the perspective of the client. Using LWP, Perl programs can submit data to forms, retrieve web pages, and eliminate much of the tedium of manually interacting with web services through a browser. For example, let's say you want to retrieve and print out the HTML source for the main page of http://www.oreilly.com. You can use the LWP::Simple module as follows:

#!/usr/bin/perl
use LWP::Simple;
print get("http://www.oreilly.com");

This retrieves the HTML source code for http://www.oreilly.com and displays it on your screen.

PDL

The Perl Data Language (which is abbreviated PDL and pronounced "piddle") is a module for doing math with matrices. It is frequently used for scientific applications and image processing in conjunction with the GIMP (since computer representations of images are just matrices). In computational biology, PDL is invaluable for working with microarray expression data and scoring matrices, as well as data that begins as images. For example, 2D gels that measure protein-protein interaction are usually stored as images, and image processing tricks can locate and compare gel features.

Why do you need a whole library to do linear algebra with Perl? PDL allows you to work with matrices of arbitrary dimensionality as if they were scalar variables. For example, a 2D matrix constructed using standard Perl arrays looks like $a[$i][$j]. If you wanted to add two array-based matrices (let's call them @a and @b) and store the result to another matrix, @c, you have to write code that looks like this:

for( $i=0; $i<$row_max; $i++ ) {
    for( $j=0; $j<$col_max; $j++ ) {
       $c[$i][$j] = $a[$i][$j] + $b[$i][$j];
    }
}

so that you end up writing two loops, the outer one to iterate over each of the rows, and the inner to iterate over each column. With PDL, you simply write:

$c = $a + $b; 

In other words, when you define your multidimensional arrays as piddles (PDL's name for its matrix data object) instead of Perl arrays, PDL makes it look like you are working with simple scalar objects, even if you are working with several-megabyte matrices. In addition, PDL comes with an interactive mode called perldl that is useful for trying out calculations with PDL, similar to the interactive modes provided by the numerical data analysis packages R and Octave (which we will meet in Chapter 14).

DBI

DBI (short for database interface) is a module for writing programs that interact with relational databases. It allows you to write programs that put data into databases, query databases, and extract records from databases, without ever having to pay attention to the specific database you are using. For example, a script written with DBI can be used with a MySQL database or an Oracle database with only minor changes.

GD

The GD.pm module allows you to generate graphics using Perl programs. GD is often used to create simple, customized plots on web pages, such as web server usage statistics. PaintBlast.pm, a module that generates graphical representations of sequence alignments from BLAST output, is an example of a GD application. It `s available from Bioperl's ScriptCentral.



[*] Efficiency from the programmer's point of view, that is. It takes far less programming time to extract data with Perl than with C or with Java.

[†] Strictly speaking, the shebang line isn't necessary on Windows or Macintosh; neither of those systems has a usr/bin/perl. It's good programming form to always include the line, however, since it's the best place to indicate your optional arguments in Perl. On other platforms, you can run the program by invoking the Perl interpreter explicitly, as in perl hello.pl.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset