11.3. Making Things Better

Here we discuss a few ways of improving your program's use of resources. You'll find terse descriptions of many more in Programming Perl (Chapter 8 in the second edition, Chapter 24 in the third).

11.3.1. Improving Execution Speed

It is common for improvements in execution speed to cost dearly in the readability department. Adding lookup caches, rearranging statement order, and inlining subroutine calls conspire to make a program less maintainable; therefore, as we have said before, you do these things only when you have to.

Once you've identified the code that is taking the most time, the next step is to optimize that code. A few suggestions follow.

Inside a loop, take every opportunity to get out of it as early as possible.


I was just now writing some code to parse a log file containing lines like

20010208001507:POP3-Server:[198.137.241.43]:gwbush

The first statement in the loop was:

next unless my ($year, $month, $day, $hour, $minute,
                $second, $user)
  = /(dddd)(dd)(dd)(dd)(dd)(dd):POP3-Server:.*:(.*)/;

This may have been a natural translation of the obvious solution, but it was a tad slow on a 3 million line log file! Since many input lines did not match this pattern, I was able to speed up the elimination of those lines by inserting the following line of code at the beginning of the loop:

next unless index ($_, ':POP3-Server:') > 0;

(but not before making the mistake of leaving the parentheses out—can you see why that was wrong?).


This works because the string :POP3-Server: still uniquely identifies the lines we want. We can do more; in general, simple string operations work faster than regular expressions, so we can follow that line of code with these statements instead of the one that originally began the loop:

my ($year, $month, $day, $hour, $minute, $second)
  = unpack "A4A2A2A2A2A2", $line;
my ($user) = $line =~ /:POP3-Server:.*:(.*)/;

However, this benchmarks no faster than the previous code. We can use one more piece of information about the line to get an improvement: the last characters before the username are ]:, and they don't occur anywhere else on the line. Benchmarking this code shows a speed increase of 50% over the alternative:

use Benchmark;
$line = <DATA>;
timethese (200000,
   {
   'unpack' => q{
      my ($year, $month, $day, $hour, $minute, $second)
         = unpack "A4A2A2A2A2A2", $line;
      my $user = substr $line, index ($line, ']:')+2;
      },
   'regex' => q{
      my ($year  , $month , $day, $hour,
          $minute, $second, $user)
          = ($line =~
/(dddd)(dd)(dd)(dd)(dd)(dd):POP3-Server:.*:(.*)/
            );
      }
   });
__END__
20010208001507:POP3-Server:[198.137.241.43]:gwbush

Benchmark: timing 200000 iterations of regex, unpack...
     regex: 15 wallclock secs (15.63 usr +  0.02 sys =
                               15.65 CPU)
    unpack: 10 wallclock secs (10.42 usr +  0.00 sys =
                               10.42 CPU)

Memoize function calls with common arguments.


Using the CPAN module Memoize (http://search.cpan.org/search?dist=Memoize) by Mark-Jason Dominus to transparently remember the results of any function call gives you an easy way to trade memory for speed. You could do the same thing with slightly less execution overhead in your program by creating appropriate data structures and checking for the existence of an entry there before calling a function, but Memoize allows you to swap this capability in and out at the cost of a single line, and it can even save the results to disk to make them persistent.

We can illustrate this with an example that looks up the canonical Internet host name for various servers. It might be quite common to call gethostbyname for the same input more than once in, for instance, an application processing log files, and it's reasonable to assume that within a short period—say, a few minutes—the results won't change. Here's a benchmark:

use Benchmark;
use Memoize;

sub get_host_info
   {
     (gethostbyname shift)[0] || 'undefined';
   }

sub get_hosts
   {
   for (1..shift)
      {
        my $true_name = get_host_info "www.$_.com"
                        for 'aaa' .. 'abc';
      }
   }

get_hosts(1);   # Prime the local cache

timethese (4,
   {
   bare => q{ get_hosts(10) },
   memo => q{ use Memoize;
              memoize q(get_host_info); get_hosts(10) }
   });

For benchmark comparison purposes only, we first do a lookup on all the hosts so that the local nameserver has a chance to put the information in its cache. That stops the second benchmark from having an unfair advantage. Then we look up the host information for several hosts 10 times each. The results are:

timing 4 iterations of bare, memo...
      bare: 533 wallclock secs ( 0.70 usr +  0.65 sys =
                                1.35 CPU)
      memo: 11 wallclock secs ( 0.28 usr +  0.01 sys =
                                0.29 CPU)
     (warning: too few iterations for a reliable count)

Four iterations is just enough to smooth out any discrepancies caused by random fluctuations, but the memoized chunk still executes fast enough to cause Benchmark to warn us. In this case, the point has been already made.

Do everything possible in Perl rather than calling external programs.


Calling out to an external program may be de rigueur in shell scripts, but it costs time (spawning another process) that you often don't need to expend in a Perl script. There is no need to call awk, sed, cut, head, grep, or similar text filters unless you're not under time pressure and it happens to read better.

Be aware of where the dividing line lies, though. While you can use modules to send e-mail for you, opening a pipe to a mail program that takes care of queuing it is likely to be both faster and easier to read.

11.3.2. Improving Memory Usage

Just as the main CPU performance improvements are found in loops, the main memory gains are found in large data structures, so focus on how big your arrays and hashes get. Don't try to economize by using scalars to save the data instead; that takes up even more space because of the overhead attached to any variable in Perl. Prefer hashes to scalars and arrays to hashes; here's why:

#!/usr/bin/perl -wl
sub procsize
   {
   (split /s+/, `ps -lp $$ | tail -1`)[9];
   # May not be 9 on your system
   }
sub randstring { join '', map chr rand 255, 1..100 }

my $base = procsize();

eval "$$_ = randstring" for 'aaa' .. 'zzz';
print "Scalars: ", procsize() - $base;  $base = procsize;

$hash{$_} = randstring   for 'aaa' .. 'zzz';
print "Hash   : ", procsize() - $base;  $base = procsize;

push @array, randstring  for 'aaa' .. 'zzz';
print "Array  : ", procsize() - $base;

First we create a large number of scalars, each holding a random 100-character string. Then we create a hash with three-character keys and the same number and lengths of values, and populate an array similarly. After each step, we measure how much our process size has grown by looking at what ps prints in the SZ column. (It may be in a different column on your system; on non-Unix boxes you'll need to use whatever support your OS gives for measuring process size.) The output we got was

Scalars: 1749
Hash   : 682
Array  : 592

Packing multiple values in a scalar avoids the overhead of creating a new one. So instead of

push @array, $value;
#...
foreach $element (@array) { ... }

you can do

$scalar .= "$separator$value";
#...
while (($element) = $scalar =~ s/(.*?)($separator|$)//o)
   { ... }

(Or, if speed is an issue, cook up something a little wordier using substr and index. If you were forming many small arrays, then you can just turn each one into a scalar that you split on $separator one at a time.) This is a last-ditch optimization that costs dearly in readability; comment the heck out of it if you have to use it.

Don't form unnecessary lists.


Some operations generate large lists when all you want to do is go through the elements one at a time. Structured data can usually be parsed a line at a time:

while (<INPUT>) { ... }

rather than reading the whole file:

{
local $/;
$file = <INPUT>;
}

Although in the case of data that's not structured by lines, that may really be your best choice. The worst of both worlds would be to use the wrong loop statement:

for my $line (<INPUT>) { ... }

because that would read all the lines of the file into a list and then iterate through them one at a time.

If you look at that example with scorn, thinking, “But I would never do that,” then how about this?

for my $file (grep !/^..?$/, readdir DIR) { ... }

Looks like an innocent enough standard idiom, and indeed it's quite safe, as long as you don't have any directories containing a humongous number of entries. But if you do, be aware that it requires temporary storage for a list nearly twice the size of those entries. If you'd like to bulletproof the code against that contingency, try this:

while (defined (my $file = readdir DIR))
   {
   next if $file =~ /^..?$/;
   ...
   }

If you have very large hashes, consider tieing them to a database to trade memory for speed and disk space:

use DB_File;        # There are several alternatives to this
my %large_hash;
tie %large_hash,    # The hash to tie
    'DB_File',      # The type of database to use
    'database';     # The file to store the database in
# use %large_hash...
untie %large_hash;  # Close the database file

However, you could easily abnegate your savings by using a common idiom:

for my $key (keys %large_hash) { ... }  # wrong!

because this will cause Perl to form in memory the list of all the keys. Use each instead, because it returns only two things at a time (or one, in scalar context):

while (my ($key, $value) = each %large_hash) { ... }

(Although you can't add or delete hash elements inside the loop.) If your hash values contain references (your hash keys can't, unless you're using the Tie::RefHash module), then you should use Gurusamy Sarathy's MLDBM (MultiLevel Database) module from CPAN (http://search.cpan.org/search?dist=MLDBM). Otherwise, the referents to which your references refer will not get stored in the database.

Avoid unnecessary copying.


Perl loves to copy data around. When you write a subroutine, say

sub bake
   {
   my ($temperature, $how_long) = @_;
   [...]

it's convenient to think of its first line as a kind of prototype, but in fact it's copying data. You may need to rethink this approach when passing huge variables to a subroutine. Caveat: While it's fine to get your data directly from @_, be aware that @_ is an alias for your original arguments; be sure you don't modify any of them without documenting the action. Furthermore, named parameter passing works by copying @_ into a hash; you'll have to come up with another way to recognize nonpositional arguments.

Sean M. Burke points out another application of this principle:

Once I wrote a program to do some pretty horrific data conversion on an idiosyncratic markup language. Now, line-at-a-time processing was right out—the notation was SGML-like, i.e., freeform with its white space. Most or all of the manipulation was of the form

$lexicon =~ s{<subhead>(.*?)</subhead>}
             {handle_subhead($1)}ieg;

The problem was that $lexicon was quite large (250+ KB), and applying lots and lots and LOTS of such regex search and replace operations took forever! Why? Because all the modifications of a large scalar involved copying it each time. One day it occurred to me that no replacement operation involved crossing entry boundaries (i.e., record boundaries), so I changed the program to chop the lexicon into entries, and then applied the string replacements to each of those via a foreach. Since each entry was only ~2KB, there was MUCH less painful swapping, so the program ran about 50 times faster!

11.3.3. Improving Disk Space Usage

Avoid temporary files.


If your operating system supports pipes, use them. Even if your operating system doesn't support them, Perl on your system might; try it.[5]

[5] As of version 5.6.0, fork in Perl works on Windows.

If one part of your program writes a temporary file for another part to read, you can change this to use a pipe instead, but only if you can get those two parts of your program to execute reasonably concurrently. Otherwise you'll fill up the pipe when you write to it, and your program will twiddle its virtual thumbs waiting for the pipe to be unclogged. (If you've ever needed a plumber on the weekend, you know the feeling.) If you've coded something like this, consider breaking it into two separate programs that can be run concurrently and piped together.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset