Perl may well be the best thing to happen to the Unix programming environment in years; it is worth the price of admission to Linux alone.[48] Perl is a text- and file-manipulation language, originally intended to scan large amounts of text, process it, and produce nicely formatted reports from that data. However, as Perl has matured, it has developed into an all-purpose scripting language capable of doing everything from managing processes to communicating via TCP/IP over a network. Perl is free software originally developed by Larry Wall, the Unix guru who brought us the rn newsreader and various popular tools, such as patch. Today it is maintained by Larry and a group of volunteers.
Perl’s main strength is that it incorporates the most widely used features of other powerful languages, such as C, sed, awk, and various shells, into a single interpreted script language. In the past, performing a complicated job required juggling these various languages into complex arrangements, often entailing sed scripts piping into awk scripts piping into shell scripts and eventually piping into a C program. Perl gets rid of the common Unix philosophy of using many small tools to handle small parts of one large problem. Instead, Perl does it all, and it provides many different ways of doing the same thing. In fact, this chapter was written by an artificial intelligence program developed in Perl. (Just kidding, Larry.)
Perl provides a nice programming interface to many features that were sometimes difficult to use in other languages. For example, a common task of many Unix system administration scripts is to scan a large amount of text, cut fields out of each line of text based on a pattern (usually represented as a regular expression), and produce a report based on the data. Let’s say we want to process the output of the Unix last command, which displays a record of login times for all users on the system, as so:
mdw ttypf loomer.vpizza.co Sun Jan 16 15:30 - 15:54 (00:23) larry ttyp1 muadib.oit.unc.e Sun Jan 16 15:11 - 15:12 (00:00) johnsonm ttyp4 mallard.vpizza.c Sun Jan 16 14:34 - 14:37 (00:03) jem ttyq2 mallard.vpizza.c Sun Jan 16 13:55 - 13:59 (00:03) linus FTP kruuna.helsinki. Sun Jan 16 13:51 - 13:51 (00:00) linus FTP kruuna.helsinki. Sun Jan 16 13:47 - 13:47 (00:00)
If we want to count up the total login time for each user (given in parentheses in the last field), we could write a sed script to splice the time values from the input, an awk script to sort the data for each user and add up the times, and another awk script to produce a report based on the accumulated data. Or, we could write a somewhat complex C program to do the entire task — complex because, as any C programmer knows, text processing functions within C are somewhat limited.
However, you can easily accomplish this task with a simple Perl script. The facilities of I/O, regular-expression pattern matching, sorting by associative arrays, and number crunching are all easily accessed from a Perl program with little overhead. Perl programs are generally short and to the point, without a lot of technical mumbo jumbo getting in the way of what you want your program to actually do.
Using Perl under Linux is really no different than on other Unix systems. Several good books on Perl already exist, including the O’Reilly books Programming Perl, by Larry Wall, Randal L. Schwartz, and Tom Christiansen; Learning Perl, by Randal L. Schwartz and Tom Christiansen; Advanced Perl Programming by Sriram Srinivasan; and Perl Cookbook by Tom Christiansen and Nathan Torkington. Nevertheless, we think Perl is such a great tool that it deserves something in the way of an introduction. After all, Perl is free software, as is Linux; they go hand in hand.
What we really like about Perl is that it lets you immediately jump to the task at hand; you don’t have to write extensive code to set up data structures, open files or pipes, allocate space for data, and so on. All these features are taken care of for you in a very friendly way.
The example of login times, just discussed, serves to introduce many of the basic features of Perl. First, we’ll give the entire script (complete with comments) and then a description of how it works. This script reads the output of the last command (see the previous example) and prints an entry for each user on the system, describing the total login time and number of logins for each. (Line numbers are printed to the left of each line for reference):
1 #!/usr/bin/perl 2 3 while (<STDIN>) { # While we have input... 4 # Find lines and save username, login time 5 if (/^(S*)s*.*((.*):(.*))$/) { 6 # Increment total hours, minutes, and logins 7 $hours{$1} += $2; 8 $minutes{$1} += $3; 9 $logins{$1}++; 10 } 11 } 12 13 # For each user in the array... 14 foreach $user (sort(keys %hours)) { 15 # Calculate hours from total minutes 16 $hours{$user} += int($minutes{$user} / 60); 17 $minutes{$user} %= 60; 18 # Print the information for this user 19 print "User $user, total login time "; 20 # Perl has printf, too 21 printf "%02d:%02d, ", $hours{$user}, $minutes{$user}; 22 print "total logins $logins{$user}. "; 23 }
Line 1 tells the loader that this script should be executed through
Perl, not as a shell script. Line 3 is the beginning of the program.
It is the head of a simple while
loop, which C and
shell programmers will be familiar with: the code within the braces
from lines 4-10 should be executed while a certain expression is
true. However, the conditional expression
<STDIN>
looks funny. Actually, this
expression reads a single line from the standard input (represented
in Perl through the name STDIN
) and makes the line
available to the program. This expression returns a true value
whenever there is input.
Perl reads input one line at a time (unless you tell it to do
otherwise). It also reads by default from standard input, again,
unless you tell it to do otherwise. Therefore, this
while
loop will continuously read lines from
standard input, until there are no lines left to be read.
The evil-looking mess on line 5 is just an if
statement. As with most programming languages, the code within the
braces (on lines 7-9) will be executed if the expression that follows
the if
is true. But what is the expression between
the parentheses? Those readers familiar with Unix tools, such as
grep and sed, will peg this
immediately as a regular expression: a cryptic
but useful way to represent a pattern to be matched in the input
text. Regular expressions are usually found between delimiting
slashes (/.../
).
This particular regular expression matches lines of the form:
mdw ttypf loomer.vpizza.co Sun Jan 16 15:30 - 15:54 (00:23)
This
expression also “remembers” the
username (mdw) and the total login time for this entry
(00:23
). You needn’t worry about
the expression itself; building regular expressions is a complex
subject. For now, all you need to know is that this
if
statement finds lines of the form given in the
example, and splices out the username and login time for processing.
The username is assigned to the variable $1
, the
hours to the variable $2
, and the minutes to
$3
. (Variables in Perl begin with the
$
character, but unlike the shell, the
$
must be used when assigning to the variable as
well.) This assignment is done by the regular expression match itself
(anything enclosed in parentheses in a regular expression is saved
for later use to one of the variables $1
through
$9
).
Lines 6-9
actually process these three pieces of information. And they do it in
an interesting way: through the use of an associative array. Whereas a normal array is indexed with a number as
a subscript, an associative array is indexed by an arbitrary string.
This lends itself to many powerful applications; it allows you to
associate one set of data with another set of data gathered on the
fly. In our short program, the keys are the usernames, gathered from
the output of last. We maintain three
associative arrays, all indexed by username:
hours
, which records the total number of hours the
user logged in; minutes
, which records the number
of minutes; and logins
, which records the total
number of logins.
As an example, referencing the variable
$hours{'mdw'}
returns the total number of hours
that the user mdw was logged in. Similarly, if the username mdw is
stored in the variable $1
, referencing
$hours{$1}
produces the same effect.
In lines 6-9, we increment the values of these arrays according to the data on the present line of input. For example, given the input line:
jem ttyq2 mallard.vpizza.c Sun Jan 16 13:55 - 13:59 (00:03)
line 7 increments the value of the hours
array,
indexed with $1
(the username, jem), by the number
of hours that jem was logged in (stored in the variable
$2
). The Perl increment operator
+=
is equivalent to the corresponding C operator.
Line 8 increments the value of minutes
for the
appropriate user similarly. Line 9 increments the value of the
logins
array by one, using the
++
operator.
Associative arrays are one of the most useful features of Perl. They allow you to build up complex databases while parsing text. It would be nearly impossible to use a standard array for this same task. We would first have to count the number of users in the input stream and then allocate an array of the appropriate size, assigning a position in the array to each user (through the use of a hash function or some other indexing scheme). An associative array, however, allows you to index data directly using strings and without regard for the size of the array in question. (Of course, performance issues always arise when attempting to use large arrays, but for most applications this isn’t a problem.)
Let’s move on. Line 14 uses the Perl
foreach
statement, which you may be used to if you
write shell scripts. (The foreach
loop actually
breaks down into a for
loop, much like that found
in C.) Here, in each iteration of the loop, the variable
$user
is assigned the next value in the list given
by the expression sort(keys %hours)
.
%hours
simply refers to the entire associative
array hours
that we have constructed. The function
keys
returns a list of all the keys used to index
the array, which is in this case a list of usernames. Finally, the
sort
function sorts the list returned by
keys
. Therefore, we are looping over a sorted list
of usernames, assigning each username in turn to the variable
$user
.
Lines 16 and 17 simply correct for situations where the number of
minutes is greater than 60; it determines the total number of hours
contained in the minutes
entry for this user and
increments hours
accordingly. The
int
function returns the integral portion of its
argument. (Yes, Perl handles floating-point numbers as well;
that’s why use of int
is
necessary.)
Finally, lines 19-22 print the total login time and number of logins
for each user. The simple print
function just
prints its arguments, like the awk function of
the same name. Note that variable evaluation can be done within a
print
statement, as on lines 19 and 22. However,
if you want to do some fancy text formatting, you need to use the
printf
function (which is just like its C
equivalent). In this case, we wish to set the minimum output length
of the hours
and minutes
values
for this user to 2 characters wide, and to left-pad the output with
zeroes. To do this, we use the printf
command on
line 21.
If this script is saved in the file logintime
, we
can execute it as follows:
papaya$ last | logintime
User johnsonm, total login time 01:07, total logins 11.
User kibo, total login time 00:42, total logins 3.
User linus, total login time 98:50, total logins 208.
User mdw, total login time 153:03, total logins 290.
papaya$
Of course, this example doesn’t serve well as a Perl tutorial, but it should give you some idea of what it can do. We encourage you to read one of the excellent Perl books out there to learn more.
The previous example introduced the most commonly used Perl features by demonstrating a living, breathing program. There is much more where that came from — in the way of both well-known and not-so-well-known features.
As we mentioned, Perl provides a
report-generation mechanism beyond the standard
print
and printf
functions.
Using this feature, the programmer defines a report
“format” that describes how each
page of the report will look. For example, we could have included the
following format definition in our example:
format STDOUT_TOP = User Total login time Total logins -------------- -------------------- ------------------- . format STDOUT = @<<<<<<<<<<<<< @<<<<<<<< @#### $user, $thetime, $logins{$user} .
The STDOUT_TOP
definition describes the header of
the report, which will be printed at the top of each page of output.
The STDOUT
format describes the look of each line
of output. Each field is described beginning with the
@
character; @<<<<
specifies a left-justified text field, and @####
specifies a numeric field. The line below the field definitions gives
the names of the variables to use in printing the fields. Here, we
have used the variable $thetime
to store the
formatted time string.
To use this report for the output, we replace lines 19-22 in the original script with the following:
$thetime = sprintf("%02d:%02d", $hours{$user}, $minutes{$user}); write;
The first line uses the sprintf
function to format
the time string and save it in the variable
$thetime
; the second line is a
write
command that tells Perl to go off and use
the given report format to print a line of output.
Using this report format, we’ll get something looking like this:
User Total login time Total logins -------------- -------------------- ------------------- johnsonm 01:07 11 kibo 00:42 3 linus 98:50 208 mdw 153:03 290
Using other report formats we can achieve different (and better-looking) results.
Perl comes with a huge number of modules that you can plug in to your programs for quick access to very powerful features. A popular online archive called CPAN (for Comprehensive Perl Archive Network) contains even more modules: net modules that let you send mail and carry on with other networking tasks, modules for dumping data and debugging, modules for manipulating dates and times, modules for math functions — the list could go on for pages.
If you hear of an interesting module, check first to see whether
it’s already loaded on your system. You can look at
the directories where modules are located (probably under
/usr/lib/perl5
) or just try loading in the
module and see if it works. Thus, the command:
$ perl -MCGI -e 1
Can't locate CGI in @INC...
gives you the sad news that the CGI.pm module is not on your system. CGI.pm is popular enough to be included in the standard Perl distribution, and you can install it from there, but for many modules you will have to go to CPAN (and some don’t make it into CPAN either). CPAN, which is maintained by Jarkko Hietaniemi and Andreas König, resides on dozens of mirror sites around the world because so many people want to download its modules. The easiest way to get onto CPAN is to visit http://www.perl.com/CPAN-local/.
The following program — which we wanted to keep short, and therefore neglected to find a useful task to perform — shows two modules, one that manipulates dates and times in a sophisticated manner and another that sends mail. The disadvantage of using such powerful features is that a huge amount of code is loaded from them, making the runtime size of the program quite large:
#! /usr/local/bin/perl # We will illustrate Date and Mail modules use Date::Manip; use Mail::Mailer; # Illustration of Date::Manip module if ( Date_IsWorkDay( "today", 1) ) { # Today is a workday $date = ParseDate( "today" ); } else { # Today is not a workday, so choose next workday $date=DateCalc( "today", "+ 1 business day" ); } # Convert date from compact string to readable string like "April 8" $printable_date = UnixDate( $date, "%B %e" ); # Illustration of Mail::Mailer module my ($to) = "the_person@you_want_to.mail_to"; my ($from) = "[email protected]"; $mail = Mail::Mailer->new; $mail->open( { From => $from, To => $to, Subject => "Automated reminder", } ); print $mail <<"MAIL_BODY"; If you are at work on or after $printable_date, you will get this mail. MAIL_BODY $mail->close; # The mail has been sent! (Assuming there were no errors.)
The reason
packages are so easy to use is that Perl added object-oriented
features in version 5. The Date module used in the previous example
is not object-oriented, but the Mail module is. The
$mail
variable in the example is a Mailer object,
and it makes mailing messages straightforward through methods like
new
, open
, and
close
.
To do some major task like parsing HTML, just read in the proper CGI
package and issue a new
command to create the
proper object — all the functions you need for parsing HTML will
then be available.
If you want to give a graphical interface to your Perl script, you can use the Tk module, which originally was developed for use with the Tcl language, the Gtk module, which uses the newer GIMP Toolkit (GTK), or the Qt module, which uses the Qt toolkit that also forms the base of the KDE. The book Learning Perl/Tk by Nancy Walsh (O’Reilly) shows you how to do graphics with the Perl/Tk module.
Another abstruse feature of Perl is its
ability to (more or less) directly access several Unix system calls,
including interprocess communications. For example, Perl provides the
functions msgctl
, msgget
,
msgsnd
, and msgrcv
from System
V IPC. Perl also supports the
BSD socket implementation, allowing communications
via TCP/IP directly from a Perl program. No longer
is C the exclusive language of networking daemons and clients. A Perl
program loaded with IPC features can be very powerful
indeed — especially considering that many client-server
implementations call for advanced text processing features such as
those provided by Perl. It is generally easier to parse protocol
commands transmitted between client and server from a Perl script,
rather than write a complex C program to do the work.
As an example,
take the well-known SMTP daemon, which handles the
sending and receiving of electronic mail. The SMTP
protocol uses internal commands such as recv from
and mail to
to enable the client to communicate
with the server. Either the client or the server, or both, can be
written in Perl, and can have full access to Perl’s
text- and file-manipulation features as well as the vital socket
communication functions.
Perl is a fixture of CGI programming — that is, writing small programs that run on a web server and help web pages become more interactive.
One of the features of (some might say “problems
with”) Perl is the ability to abbreviate — and
obfuscate — code considerably. In the first script, we have used
several common shortcuts. For example, input into the Perl script is
read into the variable $_
. However, most
operations act on the variable $_
by default, so
it’s usually not necessary to reference
$_
by name.
Perl also gives you several ways of doing the same thing, which can, of course, be either a blessing or a curse depending on how you look at it. In Programming Perl, Larry Wall gives the following example of a short program that simply prints its standard input. All the following statements do the same thing:
while ($_ = <STDIN>) { print; } while (<STDIN>) { print; } for (;<STDIN>;) { print; } print while $_ = <STDIN>; print while <STDIN>;
The programmer can use the syntax most appropriate for the situation at hand.
Perl is popular, and not just because it is useful. Because Perl provides much in the way of eccentricity, it gives hackers something to play with, so to speak. Perl programmers are constantly outdoing each other with trickier bits of code. Perl lends itself to interesting kludges, neat hacks, and both very good and very bad programming. Unix programmers see it as a challenging medium to work with — because Perl is relatively new, not all the possibilities have been exploited. Even if you find Perl too baroque for your taste, there is still something to be said for its artistry. The ability to call oneself a “Perl hacker” is a point of pride within the Unix community.
[48] Truth be told, Perl also exists now on other systems, such as Windows. But it is not even remotely as well-known and ubiquitous there as it is on Linux.