As we've seen in previous chapters, a vast assortment of software tools exists for bioinformatics. Even though it's likely that someone has already written what you need, you will still encounter many situations in which the best solution is to do it yourself. In bioinformatics, that often means writing programs that sift through mountains of data to extract just the information you require. Perl, the Practical Extraction and Reporting Language, is ideally suited to this task.
There are a lot of programming languages out there. In our survey of bioinformatics software, we have already seen programs written in Java, C, and FORTRAN. So, why use Perl? The answer is efficiency.[*] Biological data is stored in enormous databases and text files. Sorting through and analyzing this data by hand (and it can be done) would take far too long, so the smart scientist writes computer tools to automate the process. Perl, with its highly developed capacity to detect patterns in data, and especially strings of text, is the most obvious choice. The next obvious choice would probably be Python. Python, the less well known of the two, is a fully object-oriented scripting language introduced by Guido van Rossum in 1988. Python has some outstanding contributed code, including a mature library for numerical methods, tools for building graphical user interfaces quickly and easily, and even a library of functions for structural biology. At the end of the day, however, it's the wealth of existing Perl code for bioinformatics, the smooth integration of that code onto Unix-based systems, cross-platform portability, and an incredibly enthusiastic user community that makes Perl our favorite scripting language for bioinformatics applications.
Perl has a flexible syntax, or grammar, so if you are familiar with programming in other languages such as C or BASIC, it is easy to write Perl code in a C-like or BASIC-like dialect. Perl also takes care of much of the dirty work involved in programming, such as memory allocation, so you can concentrate on solving the problem at hand. It's often the case that programming problems requiring many lines of code in C or Java may be solved in just a few lines of Perl.
Many excellent books have been written about learning and using Perl, so this single chapter obviously can't cover everything you will ever need to know about the language. Perl has a mountain of features, and it's unrealistic to assume you can master it without a serious commitment to learning the art of computer programming. Our aim in this chapter isn't to teach you how to program in Perl, but rather to show you how Perl can help you solve certain problems in bioinformatics. We will take you through some examples that are most immediately useful in real bioinformatics research, such as reading datafiles, searching for character strings, performing back-of-the-envelope calculations, and reporting findings to the user. And we'll explain how our sample programs work. The rest is up to you. The ability to program in any language—but especially in Perl, Python, or Java—is an important skill for any bioinformatician to have. We strongly suggest you take a programming class or obtain one of the books on our list of recommended reading and start from the beginning. You won't regret it.
Perl is available for a variety of operating systems, including
Windows and Mac OS, as well as
Linux and other flavors of Unix. It's distributed under an open
source license, which means that it's essentially free. To
obtain Perl from the Web, go to http://www.perl.com/pub/language/info/software.html
and follow the instructions for downloading and installing it on your
system.
Once you've installed Perl, or confirmed with your system administrator that it's already installed on your system, you're ready to begin writing your first program. Writing and executing a Perl program can be broken into several steps: writing the program (or script) and saving it in a file, running the program, and reading the output.
A Perl program is a text file that contains instructions written in the Perl language. The classic first program in Perl (and many other languages) is called "Hello, World!" It's written like this:
#!/usr/bin/perl -w # Say hello print "Hello, World! ";
"Hello, World!" is a short program, but it's still
complete. The first line is called the shebang line and tells the
computer that this is a Perl program. All Perl programs running on
Unix begin with this line.[†] It's
a special kind of comment to the Unix shell that tells it where to
find Perl, and also instructs it to look for optional arguments. In
our version of "Hello World!" we've included the
optional argument -w
at the end of the line.
This argument tells Perl to give extra warning messages if you do
something potentially dangerous in your program. It's a good
idea to always develop your programs under -w
.
The second line starts with a # sign. The # tells Perl that the line of text that follows is a comment, not part of the executable code. Comments are how humans tell each other what each part of the program is intended to do. Make a habit of including comments in your code. That way you and other people can add to your code, debug it successfully, and even more importantly, remember what it was supposed to do in the first place.
The third line calls the print
function with a
single argument that consists of a text string. At the end of the
text string, there is a
, which tells the
computer to move to a new line after executing the
print
statement. The print
statement ends with a semicolon, as do most statements in Perl.
To try this little program yourself, you can open a text editor such
as vi
, Emacs, or
pico
, and type the lines in. When
you've finished entering the program, name the file
hello.pl
and save it in your directory of
choice. While you're learning, you might consider creating a
new directory (using the mkdir
command, which we
covered in Chapter 4) called
Perl
in your home directory. That way
you'll always know where to look for your Perl programs.
Now make the file executable using the command:
% chmod +x hello.pl
(If you need a refresher on
chmod
, this
would be a good time to review the section on changing file
permissions in Chapter 4.) To run the program,
type:
% hello.pl
Because of the shebang line in our program, this command invokes the
Perl interpreter, which reads the rest of the file and then
translates your Perl source code into machine code the computer can
execute. In this case you'll notice that Hello,
World!
appears on your computer screen, and then the
cursor advances to a new line. You've now written and run your
first Perl program!
One of the strengths of Perl—and the reason that bioinformaticians love it—is that with a few lines of code, you can automate a tedious task such as searching for a nucleotide string contained in a block of sequence data. To do this, you need a slightly more complex Perl program that might look like this:
#!/usr/bin/perl -w # Look for nucleotide string in sequence data my $target = "ACCCTG"; my $search_string = 'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA'. 'CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT'. 'ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC'; my @matches; # Try to find a match in letters 1-6 of $search_string, then look at letters 2-7, # and so on. Record the starting offset of each match. foreach my $i (0..length $search_string){ if( $target eq substr( $search_string, $i, length $target)){ push @matches, $i; } } # Make @matches into a comma-separated list for printing print "My matches occurred at the following offsets: @matches. "; print "done ";
This program is also short and simple, but it's still quite powerful. It searches for the target string "ACCCTG" in a sequence of data and keeps track of the starting location of each match. The program demonstrates variables and loops, which are two basic programming constructs you need to understand to make sense of what is going on.
A variable is a name that is associated with a data value, such as a string or a number. It is common to say that a variable stores or contains a value. Variables allow you to store and manipulate data in your programs; they are called variables because the values they represent can change throughout the life of a program.
Our sequence matching program declares four variables:
$target
,
$search_string
, @matches
,
and $i
. The
$ and @ characters indicate the kind of
variable each one is. Perl has three kinds of variables built into
the language: scalars, arrays, and hashes.
Unlike other programming languages, Perl doesn't require formal
declaration of variables; they simply exist upon their first use
whether you explicitly declare them or not. You may declare your
variables, if you'd like, by using either
my
or our
in front of the
variable name. When you declare a variable, you give it a name. A
variable name must follow two main rules: it must start with a letter
or an underscore (the $ and @ characters aren't considered part
of the name), and it must consist of letters, digits, and
underscores. The best names are ones that clearly, concisely, and
accurately describe the variable's role in the program. For
example, it is easier to guess the role of a variable if it is named
$target
or $sequence
, than
if it were called $icxl
.
A scalar variable contains a single piece of data that is either a number or a string. The $ character indicates that a variable is scalar. The first two variables declared in our program are scalar variables:
my $target = "ACCCTG"; my $search_string = "CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA". "CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT". "ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC";
In this case, "ACCCTG" is the target string we are seeking, and "CCACACCACACCCACAC..." is the sequence data in which we're hoping to find it.
In a scalar variable, a number can be either an integer (0, 1, 2, 3, etc.) or a real number (a number that contains a fractional portion, such as 5.6). A string is a chunk of text that's surrounded by quotes. For example:
"I am a string." 'I, too, am a string'
One of Perl's special features is that it has a number of built-in facilities for manipulating strings, which comes in handy when working with the flat text files common to bioinformatics. We cover flat text files and their more structured relatives, relational databases, in detail in Chapter 13.
An array is an ordered list of data. In our
sequence matching program, @matches
is an array
variable used to store the starting locations of all the matches.
Each element stored in an array can be accessed by its position in
the list, which is represented as a number. In Perl, array variables
are given an @ prefix. For example, the following statement declares
an array of numbers:
@a = ( 1, "4", 9 );
This statement declares an array of strings:
@names = ("T. Herman", "N. Aeschylus", "H. Ulysses", "Standish");
And this statement declares an array with both:
@mix = ("Caesar Augustus", "Tiberius", 18, "Caligula", "Claudius");
Note the syntax in the declarations: each element in the array is separated from its neighbors by a comma, each of the strings is quoted, and (unlike American English) the comma appears outside of the quotes.
Because an array is an ordered set of information, you can retrieve
each element in an array according to its number. The individual
elements in an array are written as
$this_array[i]
, where i is
the index of the array element being addressed. Note that
i can be either a bare number (such as
21
), or a numeric scalar variable (such as
$n
) that contains a bare number. Here is a Perl
statement that uses the print
operator to
display the second number in @a
and the third
name in @names
on the screen:
print "second number: $a[1] third name: $names[2] ";
You may be wondering why the element numbers here are one less than
what you might think they should be. The reason is that positions in
Perl arrays are numbered starting from zero. That is, the first
element in an array is numbered 0, the second element is numbered 1,
and so on. That's why, in the previous example, the second
element in @a
is addressed as
$a[1]
. This is an important detail to remember;
mistakes in addressing arrays due to missing that crucial zero
element are easy to make.
A hash
is also known as an associative
array
because it associates a name (or key, as
it's called in Perl) with each piece of data (or value) stored
in it. A real-world example of a hash is a telephone book, in which
you look up a person's name in order to find her telephone
number. Our sequence matching program doesn't use any hashes,
but they can be quite handy in bioinformatics programs, as
you'll see in a later example. Perl uses the
% prefix
to indicate hash variables (e.g., %sequences
).
There are a number of ways to declare a hash and its contents as a
list of key/value pairs. Here is the syntax for one declaration
style:
%hash = ( key1 => "value1", key2 => "value2", ... last_key => "last_value" );
A value can then be retrieved from this hash using the corresponding key, as follows:
$value = $hash{"key2"};
For example, you can declare a hash that relates each three-letter amino acid code to its one-letter symbol:
my %three_to_one = ( ALA => A, CYS => C, ASP => D, GLU => E, PHE => F, GLY => G, HIS => H, ILE => I, LYS => K, LEU => L, MET => M, ASN => N, PRO => P, GLN => Q, ARG => R, SER => S, THR => T, VAL => V, TRP => W, TYR => Y );
The hash entry with the one-letter code for arginine can then be displayed using the following statement:
print "The one-letter code for ARG is $three_to_one{ARG} ";
Because there are many popular sequence databases, another place where hashes can be immediately useful is in keeping track of which sequence ID in one database corresponds to a sequence ID in the next. In the following example, we define a hash in which each of the keys is a GenBank identifier (GI) number of a particular enzyme, and each value is the corresponding SWISS-PROT identifier of the same enzyme. Using this hash, a program can take the more cryptic GI number and automatically find the associated SWISS-PROT ID:
#!/usr/bin/perl -w # define the hash relating GI numbers to SWISSPROT IDs %sods = ( g134606 => "SODC_DROME", g134611 => "SODC_HUMAN", g464769 => "SODC_CAEEL", g1711426 => "SODC_ECOLI" ); # retrieve a value from %sods $genbank_id = "g134611"; $swissprot_id = $sods{$genbank_id}; print "$genbank_id is the same as $swissprot_id ";
If you save the previous script to a file, make the file executable, and run it, you should see:
g134611 is the same as SODC_HUMAN
In the first part of this script, you are declaring the hash relating
GenBank IDs to SWISS-PROT IDs. In the second part, you access the
information stored in that hash. The first step is to assign one of
the GenBank IDs to the variable $genbank_id
.
Then you can retrieve the SWISS-PROT ID that
%sods
has associated with the string in
$genbank_id
, and store the SWISS-PROT ID in the
variable $swissprot_id
. Finally, print the
values of the two scalar variables. This example is obviously rather
contrived, but it should give you an idea of how useful hashes can be
in bioinformatics programs.
Now that we've talked about scalar,
array, and hash variables in Perl, let's return to our sequence
matching program and talk about the other main programming construct
it employs. A loop is a programming device that
repeatedly executes a specific set of commands until a particular
condition is reached. Our program uses a foreach
loop to iterate through the search string:
foreach my $i (0..length $search_string){ if( $target eq substr( $search_string, $i, length $target)){ push @matches, $i; } }
The first time through this loop, Perl starts at 0 and looks at the
first six-letter combination in the search string, compares it to the
target string, and, if there is a match, records it in
@matches
. The second cycle of the loop looks at
letters 2-7, the third looks at letters 3 -8, and so on. Perl stops
executing this sequence when it reaches the end of the search string.
At this point, the loop is done, and the program moves on to the next
section, where it prints the results. Don't worry if you
don't understand all the code in the loop; all that's
important right now is that you have a general understanding of what
the code is doing.
Although we
don't use them in any of our example programs, the use of
subroutines in programs is worth mentioning. All modern programming
languages provide a way to bundle up a set of statements into a
subroutine so that they can be invoked concisely
and repeatedly. In Perl, you can create a subroutine with the
sub
declaration:
sub greet { my ($name) = shift; print "Hello, $name! "; }
Once this greet
subroutine has been declared,
you can invoke it as follows:
greet("world"); # Prints "Hello, world!" greet("Per"); # Prints "Hello, Per!"
Here, "world"
and "Per"
are
arguments
—values
passed into the subroutine, where they are then stored in
$name
. Our greet
subroutine
just prints a single line and then returns. Usually, subroutines do
something a bit more complicated, possibly returning a value:
$length = calculate_length($sequence);
This sets $length
to whatever the
calculate_length( )
subroutine returns when
provided with the single argument $sequence
.
When a subroutine is used for its return value, it's often
called a
function.
A major feature of Perl is its
pattern matching, and particularly
its use of regular expressions. A
regular expression (or
regex
in the Perl vernacular) is a
pattern that can be matched against a string of data. We first
encountered regular expressions in our discussion of the Unix command
grep
, back in Chapter 5.
grep
, as you may recall, searches for
occurrences of patterns in files. When you tell
grep
to search for a pattern, you describe what
you're looking for in terms of a regular expression. As you
know, much of bioinformatics is about searching for patterns in data.
Let's look at a Perl example. Say you have a string, such as a DNA sequence, and you want to make sure that there are no illegal characters in it. You can use a regular expression to test for illegal characters as follows:
#!/usr/bin/perl # check for non-DNA characters my $string = "CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA"; if( $string =~ m/([^ATCG])/i) { print "Warning! Found: $1 in the string"; }
This program contains the regular expression
[^ATCG]
. Translated into English, the regular
expression says "look for characters in
$string
that don't match A, T, C, or
G." (The i
at the end of the statement
tells Perl to match case insensitively; that is,
to pay no attention to case. Perl's default is to treat A
differently from a.) If Perl encounters something other than the
declared pattern, the program prints out the offending character. The
output of this program is:
Warning! Found * in the string
If instead you want to search for a particular combination of
letters, like "CAT", you can change the regular
expression to read CAT
:
#!/usr/bin/perl # check for CATs my $string = "CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA"; if( $string =~ m/CAT/i ){ print "Meow."; }
The output of this modified program is:
Meow.
Now that you know enough about how Perl is written to understand these simple programs, let's apply it to one of the most common problems in bioinformatics: parsing BLAST output. As you already know, the result of a BLAST search is often a multimegabyte file full of raw data. The results of several searches can quickly become overwhelming. But by writing a fairly simple program in Perl, you can automate the process of looking for a single string or multiple strings in your data.
Consider the following block of data:
. . . gb|AC005288.1|AC005288 Homo sapiens chromosome 17, clone hC... 268 2e-68 gb|AC008812.7|AC008812 Homo sapiens chromosome 19 clone CTD... 264 3e-67 gb|AC009123.6|AC009123 Homo sapiens chromosome 16 clone RP1... 262 1e-66 emb|AL137073.13|AL137073 Human DNA sequence from clone RP11... 260 5e-66 gb|AC020904.6|AC020904 Homo sapiens chromosome 19 clone CTB... 248 2e-62 >gb|AC007421.12|AC007421 Homo sapiens chromosome 17, clone hRPC.1030_O_14, complete sequence Query: 3407 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 3466 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1366 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 1425 Query: 3467 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 3526 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1426 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 1485 Query: 3527 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 3586 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1486 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 1545 Query: 3587 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 3646 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1546 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 1605 Query: 3647 tcctaggttggaaagccataggattccattctcatcctgccttcatggtcaaaggcagct 3706 . . .
This is only a small portion of what you might find in a report from
a BLAST search. (This is actual data from a BLAST report. The entire
file, blast.dat
, is too large to reproduce
here.) The first six lines of this sample contain information about
the BLAST search, as well as other "noise" that's
of no importance to the search. The next 13 lines, and the ones that
follow it in the actual report, contain the data to analyze. You want
the Perl program to look at both the "Query" and
"Sbjct" lines in this BLAST report and find the number of
occurrences of the following substrings:
'gtccca'
'gcaatg'
'cagct'
'tcggga'
Missing data (represented by dashes in the nucleotide sequence strings)
At the same time, you want the program to ignore irrelevant material
such as information about the search and other noise. The program
should then generate a report file called
report.txt
that describes the findings for these
strings.
In this program you need to create two very long scalar variables to
represent each sequence for searching. Let's call them
$query_src
, and $sbjct_src
.
In any BLAST output, you'll notice that sometimes the
"Sbjct" and "Query" lines aren't
contiguous; that is, there are gaps in the data. From a programming
perspective, the fact that the gaps exist isn't important; you
simply want to read the nucleotides into your scalars consecutively.
Here is a sample portion of BLAST data:
Query: 1165 gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217 |||| |||||||| ||||||||||||| |||| | ||||||| |||||||| Sbjct: 11895 gagctcaggagtttgagaccagcctggggaacacggtgaaaccctgtctctac 11843 Query: 1170 caggagttcaagaccagcctg 1190 ||||||||||||||||||||| Sbjct: 69962 caggagttcaagaccagcctg 69942 Query: 1106 tggtggctcacacctgcaatcccagcact 1134 ||||||||||| |||| |||||||||||| Sbjct: 77363 tggtggctcacgcctgtaatcccagcact 77335
In spite of the fact that the line numbers aren't contiguous, the sequence for "Query" starts with 'gagccca' and still ends with 'agcact', and will be 103 (53 + 21 + 29) characters long. As you'll see shortly, the program is designed to ignore the gaps (and the line numbers) and input the data properly. Frequent BLAST users may also notice that in a full BLAST report, each sequence is grouped by E-values. We are ignoring that (usually) important fact in the program.
The Perl program used to search for the five substrings can be broken down into three parts:
Inputting the data and preparing it for analysis
Searching the data and looking for the patterns
Compiling the results and storing them in
report.txt
Let's go through the program step by step. Here are the first few lines:
#!/usr/bin/perl # Search through a large datafile, looking for particular sequences use strict; my $REPORT_FILE = "report.txt"; my $blast_file = $ARGV[0] || 'blast.dat'; unless ( -e $blast_file ) { die "$0: ERROR: missing file: $blast_file"; }
This code makes sure that the data is in good order. Since
you'll be reading large amounts of data into the variables,
tell Perl to tighten up its rules with the line
use
strict;
. This forces
you to be more explicit about how you want Perl to do things.
use strict
is particularly useful when
developing large programs or programs you want to modify and reuse.
Go on to declare some variables, and in the last few lines, tell Perl
to make sure that data actually exists in the input file
blast.dat
.
In the next block of code, the program reads the sequences into variables:
# First, slurp all the Query sequences into one scalar. Same for the # Sbjct sequences. my ($query_src, $sbjct_src); # Open the blast datafile and end program (die) if we can't find it open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!"; # Go through the blast file line by line, concatenating all the Query and # Sbjct sequences. while (my $line = <IN>) { chomp $line; print "Processing line $. ";
In this section you read all the "Query" sequences into one scalar variable, and the "Sbjct" sequences into another. The program then opens the file for reading with:
open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!";
or prints an error message if for some reason it can't find the
file. Next, the program goes through the file line by line, removing
the line-break characters with chomp $line;
. And
finally, with a print function, you ask the program to display the
current row as it reads in the data.
Now that you have the data in memory, you need to sort through it and
extract the material you want. Remember that the datafile included a
lot of superfluous material you want to ignore. To do that, instruct
Perl to consider only those lines that begin with
Query
or Sbjct
. Now
Query
and Sbjct
, in
addition to having the desired sequence data, also have line numbers
of varying length you don't want. In order to read the sequence
data correctly, you must design the program in such a way that you
skip over line numbers no matter how many characters they have, and
always land on the first nucleotide. You'll notice in this line
of data:
Query: 1165 gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217
that there is a space between Query
, the
beginning line number, the sequence data, and the ending line number.
Since this happens to be true for all the query and subject lines, it
becomes the key to how to read the data correctly. To be sure you get
only what you want, split all the query and subject data into a four
column array called @words
. That task is
accomplished by the following lines of code:
my @words = split /s+/, $line; if ($line =~ /^Query/) { $query_src .= $words[2]; } elsif ($line =~ /^Sbjct/) { $sbjct_src .= $words[2]; } } # We've now read the blast file, so we can close it. close IN;
Once you've read the data into @words
, you
then instruct the program to take only the data from column two of
@words
(which is filled only with nucleotide
sequence data) and store it in the variables
$query_src
and $sbjct_src
.
The program then closes the file and moves to a new line. You now
have just the data you want, stored in a form you can use.
The next part of the program performs the analysis:
# Now, look for these given sequences... my @patterns = ('gtccca', 'gcaatg', 'cagct', 'tcggga', '-'), # ...and when we find them, store them in these hashes my (%query_counts, %sbjct_counts); # Search and store the sequences foreach my $pattern (@patterns) { while ( $query_src =~ /$pattern/g ) { $query_counts{ $pattern }++; } while ( $sbjct_src =~ /$pattern/g ) { $sbjct_counts{ $pattern }++; } }
The program sets up a loop that runs five times; once for each search
string or pattern. Within each iteration of the outer
foreach
loop, the program runs inner
while
loops that advance counters each time they
find a pattern match. The results are stored in separate hashes
called %query_counts
and
%sbjct_counts
.
Here is the last section of the program, which produces the output:
# Create an empty report file open (OUT, ">$REPORT_FILE") or die "$0: ERROR: Can't write $REPORT_FILE"; # Print the header of the report file, including # the current date and time print OUT "Sequence Report ", "Run by O'Reilly on ", scalar localtime, " ", " NOTE: In the following reports, a dash (-) represents ", " missing data in the chromosomal sequence ", "Total length of 'Query' sequence: ", length $query_src, " characters ", "Results for 'Query': "; # Print the Query matches foreach my $key ( sort @patterns ) { print OUT " '$key' seen $query_counts{$key} "; } print OUT " Total length of 'Sbjct' sequence: ", length $sbjct_src, " characters ", "Results for 'Sbjct': "; # Print the Sbjct matches foreach my $key ( sort @patterns ) { print OUT " '$key' seen $sbjct_counts{$key} "; } close OUT; __END__
This code compiles and formats the results and dumps them into a file
called report.txt
. If you open
report.text
you see:
Sequence Report Run by O'Reilly on Tue Jan 9 15:52:48 2001 NOTE: In the following reports, a dash (-) represents missing data in the chromosomal sequence Total length of 'Query' sequence: 1115 characters Results for 'Query': '-' seen 7 'cagct' seen 11 'gcaatg' seen 1 'gtccca' seen 6 'tcggga' seen 1 Total length of 'Sbjct' sequence: 5845 characters Results for 'Sbjct': '-' seen 12 'cagct' seen 2 'gcaatg' seen 6 'gtccca' seen 1 'tcggga' seen 6
In this example the results were sent to a file. You can just as easily ask Perl to generate an HTML-coded file you can view with your web browser. Or you can make the process interactive and use Perl to create a CGI script that generates a web form to analyze the data and give you back your results.
We've only scratched the surface in terms of what this sort of program can do. You can easily modify it to look for more general patterns in the data or more specific ones. For example, you might search for `tcggga' and `gcaatg', but only count them if they are connected by `cagct'. You also might search only for breaks in the data. And after all the searches are complete, you might use Perl to automatically store all the results in a database.
If you're feeling a little confused by all this, don't panic. We aren't expecting you to understand all the code we've shown you. As we said at the beginning of the chapter, the purpose of the code isn't to teach you to program in Perl, but to show you how Perl works, and also to show you that programming isn't really all that difficult. If you have what it takes to design an experiment, then you have what it takes to program in Perl or any other language.
The good news is the more you practice programming, the more you learn. And the more you learn, the more you can do. Programming in Perl is all about analyzing data and building tools. As we've said before, biological data is proliferating at an astounding rate. The only chance biologists have of keeping up with the job of analyzing it is by developing libraries of reusable software tools. In Perl, there are a huge number of reusable functions available for use in your programs. Rather than being wrapped in a complete program, a group of related functions are packaged as a module . In your programs, you can use various modules to access the functions they support. There are Perl modules for other scientific disciplines, as well as games, graphics programming, video, artificial intelligence, statistics, and music. And they're all free.
To distinguish them from other Perl files, modules have a
.pm
suffix. To use functions from a module
(say, CGI.pm) in your Perl programs, include the following line after
the shebang line:
use CGI;
The primary source of all modules is the
Comprehensive Perl
Archive Network, or CPAN. CPAN (http://www.cpan.org
) is a collection of sites
located around the world, each of which mirrors the contents of the
main CPAN site in Finland. To find the CPAN site nearest you, check
the Perl web site (http://www.perl.com
).
Because there are so many modules available, before you sit down to write a new function, it is worth your time to check the CPAN archive to see if anyone has already written it for you. In this section, we briefly describe some Perl modules that are particularly useful for solving common problems in bioinformatics. This list is by no means comprehensive; you should keep an eye on CPAN for current developments.
The Bioperl Project (along with its siblings, Biopython, BioJava, and Bioxml) is dedicated to the creation of an open source library of modules for bioinformatics research. The general idea is that common items in bioinformatics (such as sequences and sequence alignments) are represented as objects in Bioperl. Thus, if you use Bioperl, instead of having to constantly rewrite programs that read and write sequence files, you simply call the appropriate functions from Bio::SeqIO, and dump the resulting sequence data into a sequence object.
Bioperl isn't limited to storing sequences: it currently
contains modules for generating and storing sequence alignments,
managing annotation data, parsing output from the sequence-database
search programs BLAST and HMMer, and has other modules on the way. In
addition to the core Bioperl distribution, the ScriptCentral script
repository at the Bioperl web site (http://www.bioperl.org
) hosts a collection of
biology-related scripts. To learn more about downloading, installing,
and using Bioperl, see http://www.bioperl.org
.
CGI.pm is a
module for programming interactive web pages. The functions it
provides are geared toward formatting web pages and creating and
processing forms in which users enter information. If you have used
the Web, you almost certainly have used web pages written using
CGI.pm. For example, let's create a page that asks the user
what his favorite color is using an HTML form. When the user enters
the data, the script stores it in a field named
`color'. When the user hits
"Submit," the same page is loaded, only this time,
$query->param(`color')
contains
the name of a color, so the print statement
after the "else" is executed. The CGI script looks like
this:
#!/usr/bin/perl use CGI; # Load Perl's CGI module my $query = new CGI; # Create a CGI object named $query # Send the HTML header and <HTML> tag print $query->header, $query->start_html; # If the user is visiting the site for the first time, ask him # what his favorite color is unless ($query->param('color')) { # Page 1: Asking the user print $query->start_form, "What is your favorite color? ", $query->popup_menu( -name => "color", -values => ["red", "green", "blue"] ), $query->submit, $query->end_form; } else { # Page 2: Telling the user print "Your favorite color is ", $query->param('color'), } print $query->end_html; # Send the </HTML> tag
The results of this script are shown in Figure 12-1.
If CGI.pm automates the Web from the server's perspective,
the
Library for Web
Programming (LWP) automates web interaction from the perspective of
the client. Using LWP, Perl programs can submit data to forms,
retrieve web pages, and eliminate much of the tedium of manually
interacting with web services through a browser. For example,
let's say you want to retrieve and print out the HTML source
for the main page of http://www.oreilly.com
. You
can use the LWP::Simple module as follows:
#!/usr/bin/perl use LWP::Simple; print get("http://www.oreilly.com");
This retrieves the HTML source code for http://www.oreilly.com
and displays it on your
screen.
The Perl Data Language (which is abbreviated PDL and pronounced "piddle") is a module for doing math with matrices. It is frequently used for scientific applications and image processing in conjunction with the GIMP (since computer representations of images are just matrices). In computational biology, PDL is invaluable for working with microarray expression data and scoring matrices, as well as data that begins as images. For example, 2D gels that measure protein-protein interaction are usually stored as images, and image processing tricks can locate and compare gel features.
Why do you need a whole library to do linear algebra with Perl? PDL allows you to work with matrices of arbitrary dimensionality as if they were scalar variables. For example, a 2D matrix constructed using standard Perl arrays looks like $a[$i][$j]. If you wanted to add two array-based matrices (let's call them @a and @b) and store the result to another matrix, @c, you have to write code that looks like this:
for( $i=0; $i<$row_max; $i++ ) { for( $j=0; $j<$col_max; $j++ ) { $c[$i][$j] = $a[$i][$j] + $b[$i][$j]; } }
so that you end up writing two loops, the outer one to iterate over each of the rows, and the inner to iterate over each column. With PDL, you simply write:
$c = $a + $b;
In other words, when you define your multidimensional arrays as
piddles (PDL's name for its matrix data object) instead of Perl
arrays, PDL makes it look like you are working with simple scalar
objects, even if you are working with several-megabyte matrices. In
addition, PDL comes with an interactive mode called
perldl
that is useful for trying out
calculations with PDL, similar to the interactive modes provided by
the numerical data analysis packages R and Octave (which we will meet
in Chapter 14).
DBI (short for database interface) is a module for writing programs that interact with relational databases. It allows you to write programs that put data into databases, query databases, and extract records from databases, without ever having to pay attention to the specific database you are using. For example, a script written with DBI can be used with a MySQL database or an Oracle database with only minor changes.
The GD.pm module allows you to generate graphics using Perl programs. GD is often used to create simple, customized plots on web pages, such as web server usage statistics. PaintBlast.pm, a module that generates graphical representations of sequence alignments from BLAST output, is an example of a GD application. It `s available from Bioperl's ScriptCentral.
[*] Efficiency from the programmer's point of view, that is. It takes far less programming time to extract data with Perl than with C or with Java.
[†] Strictly speaking, the shebang line isn't necessary on
Windows or Macintosh; neither of those systems has a
usr/bin/perl
. It's good programming form
to always include the line, however, since it's the best place
to indicate your optional arguments in Perl. On other platforms, you
can run the program by invoking the Perl interpreter explicitly, as
in perl hello.pl
.