Chapter 19. Miscellanea

Advice is what we ask for when we already know the answer but wish we didn't.

Erica Jong How to Save Your Own Life

This chapter contains a handful of guidelines that do not fit cleanly into any of the previous categories. They cover reasons for revision control, the intricacies of interfacing with other languages, the care and feeding of configuration files, the trouble with tied variables, the complexities of caching, flaws in formats, optimal optimization, and the cunning cruelty of cleverness.

Revision Control

Use a revision control system.

Maintaining control over the creation and modification of your source code[124] is utterly essential for robust team-based development. Just as you wouldn't use an editor without an Undo button or a word processor that can't merge documents, so too you shouldn't use a filesystem you can't rewind, or a development environment that can't integrate the work of many contributors.

Programmers make mistakes, and occasionally those mistakes will be catastrophic. They will reformat the disk with the most recent version of the code. Or they'll mistype an editor macro and write zeros all through the source of a critical core module. Or two developers will unwittingly edit the same file at the same time and half their changes will be lost. Revision control systems can prevent those kinds of problems.

Moreover, occasionally the very best debugging technique is to just give up, stop trying to get yesterday's modifications to work correctly, roll the code back to a known stable state, and start over again. Less drastically, comparing the current condition of your code with the most recent stable version from your repository (even just a line-by-line diff) can often help you isolate your recent "improvements" and work out which of them is the problem.

Revision control systems such as RCS, CVS, Subversion, Monotone, darcs, Perforce, GNU arch, or BitKeeper can protect against calamities, and ensure that you always have a working fallback position if maintenance goes horribly wrong. The various systems have different strengths and limitations, many of which stem from fundamentally different views on what exactly revision control is. So it's a good idea to audition the various revision control systems and find the one that works best for you. Pragmatic Version Control Using Subversion, by Mike Mason (Pragmatic Bookshelf, 2005) and Essential CVS, by Jennifer Vesperman (O'Reilly, 2003) are useful starting points.

After all, rm * is never more than half a dozen keystrokes away.

Other Languages

Integrate non-Perl code into your applications via the Inline:: modules.

Occasionally you may need to use code resources that are not written in Perl. Most often this will be C code, but it might also be C++, Java, Python, Ruby, Tcl, Scheme, AWK, or even Basic.

The CPAN provides interface tools for hooking all of these languages up to a Perl program, but most of those tools are very challenging to use correctly. By far the most frequently used is xsubpp, a compiler for Perl's own "XS" interface description language (see the perlxstut manpage[125]).

Hooking Perl to C using XS requires you to write a shell .pm module to bootstrap an object file that has been compiled from C code, which was in turn generated by xsubpp from a .xs source file containing pseudo-C annotated with an XS interface description. If that sounds horribly complicated, then you have achieved an accurate understanding of the use of xsubpp. Example 19-1 shows just how much work is involved in even a very simple example.

Example 19-1. Creating a fast C-based rounding subroutine using XS

> cat Round.pm

package Round;
use strict;
use warnings;

use base qw( Exporter DynaLoader );
our $VERSION = '0.01';

@EXPORT = qw( round );

bootstrap Round $VERSION;

1;
__END__

> cat rounded.pl

use Round;
use IO::Prompt;

while (my $num = prompt -num => 'Enter a number: ') {
    print rounded($num), "
";
}

> cat Round.xs

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"

MODULE = Round     PACKAGE = Round

int
rounded(arg)
    double  arg
CODE:
    int res;
    /* Round towards zero... */
    if (arg > 0.0)      { res = floor(arg + 0.5); }
    else if (arg < 0.0) { res = ceil(arg - 0.5); }
    else                { res = 0; }
OUTPUT:
    res

> cat Makefile.PL

use ExtUtils::MakeMaker;
WriteMakefile(
    NAME         => 'Round',
    VERSION_FROM => 'Round.pm',
    LIBS         => ['-lm'],
);
> perl Makefile.PL

Checking if your kit is complete...
Looks good
Writing Makefile for Mytest

> make install

umask 0 && cp Mytest.pm ./blib/Mytest.pm
perl xsubpp -typemap typemap Mytest.xs >Mytest.tc && mv Mytest.tc Mytest.c
Please specify prototyping behavior for Mytest.xs (see perlxs man)
cc -c Mytest.c
Running Mkbootstrap for Mytest ()
chmod 644 Mytest.bs
LD_RUN_PATH="" ld -o ./blib/auto/Mytest/Mytest.sl -b M ytest.o
chmod 755 ./blib/auto/Mytest/Mytest.sl
cp Mytest.bs ./blib/auto/Mytest/Mytest.bs
chmod 644 ./blib/PA-RISC1.1/auto/Mytest/Mytest.bs
Manifying ./blib/man3/Mytest.3

It's probably not surprising that most Perl programmers recoil from that approach. A much less demanding alternative is to use the Inline module, which allows you to include standard C code directly in your Perl application. Example 19-2 shows the rounded.pl application from Example 19-1, but now reimplemented using Inline.

Example 19-2. Creating a fast C-based rounding subroutine using Inline::C

> cat rounded.pl

use Inline C => q{
    int
    rounded(double arg) {
        /* Round towards zero... */
        if (arg > 0.0)      { return floor(arg + 0.5); }
        else if (arg < 0.0) { return ceil(arg - 0.5);  }
        else                { return 0;                }
    }
};

use IO::Prompt;
while (my $num = prompt -num => 'Enter a number: ') {
    print rounded($num), "
";}

Notice that, in this second version, there's no need for a separate .xs file, or a .pm wrapper module, or any explicit translation process, or compilation step. You just type your C code into your Perl source, as part of the use Inline C statement. Then, when rounded.pl is executed, Inline's import( ) subroutine parses the C code, builds a suitable XS representation, compiles it with the local C compiler, and loads the resulting object file back into the running process.

Of course, if that happened every time you ran the program, any performance benefit you might have gained by writing rounded( ) in C would clearly be more than overwhelmed by the costs of continually reparsing and recompiling. Fortunately, Inline caches any object files it builds and reparses and recompiles the original C source code only when that code actually changes. The very first time rounded.pl was run there would be a noticeable compilation delay, but thereafter the application would start almost instantly and reap the full performance benefits of its partial C implementation.

The second great advantage of using Inline is that there are other CPAN modules that allow it to also handle inlined C++, Java, Python, Ruby, Tcl, Scheme, AWK, bc, Basic, Parrot, and assembler[126]. You can even mix and match multiple languages within the same Perl program. For example, you might implement the number-crunching components in C, the GUI in Java, and the embedded artificial intelligence in Scheme.

Configuration Files

Keep your configuration language uncomplicated.

If you're going to provide a configuration mechanism for your application, make it declarative and minimal. Keep in mind that configuration files are one of the few components of your system that are directly read by end-users, so they need to be simple. They're also one of the few components of your system that are directly written by end-users. So they need to be even simpler.

It's almost always enough to just support some variation on the widely used INI file format: named sections, individual key/value pairs, multiline values, repeated values (or lists), and comments. Example 19-3 shows a typical configuration file with all of those features.

Example 19-3. A simple configuration language

> cat ~/.demorc

[Interface]
# Configurable bits that others will see...

Author: Jan-Yu Eyrie
E-mail: eju@calnet

Disclaimer: This code is provided AS IS, and comes with
          : ABSOLUTELY NO WARRANTY OF ANY KIND WHATSOEVER!
          : It's buggy, slow, and will almost certainly
          : break your computer. Use at your own risk!

[Internals]
# Stuff no-one else sees...

# Look-up path for plug-ins...
lib: ~/lib/perl5
lib: ~/lib/perl
lib: /usr/share/lib/perl

[strict]    # Don't allow malformed inputs
[verbose]   # Report every step[log]       # And log every transaction

Fancier features like nested or hierarchical data structures, separate syntaxes for lists and scalar values, special notations for boolean configuration variables, or character escapes, are almost always a bad idea. The extra syntax will confuse most users and—worse—make it far more likely that they'll inadvertently type something that's valid, but not what they intended.

Don't use XML as your configuration file format. It may be human-readable, but it's almost never human-comprehensible, and the ratio of mark-up to content is vastly too high. No-one wants to write or maintain a configuration file that looks like Example 19-4.

Example 19-4. An XML-based configuration language

> cat ~/.demoxml

<section name="Interface">
    <!-- Configurable bits that others will see... -->
    <var> <name>author</name> <value>Jan-Yu Eyrie</value> </var>
    <var> <name>e-mail</name> <value>eju@calnet</value> </var>
    <var>
        <name>disclaimer</name>
        <value>
             This code is provided AS IS, and comes with
             ABSOLUTELY NO WARRANTY OF ANY KIND WHATSOEVER!
             It's buggy, slow, and will almost certainly
             break your computer. Use at your own risk!
        </value>
    </var>
</section>
<section name="Internals">
    <!-- Stuff no-one else sees... -->
    <!-- Look-up path for plug-ins... -->
    <var>
        <name>lib</name>
        <value>
            <list>
                <item>~/lib/perl5</item>
                <item>~/lib/perl</item>
                <item>~/lib/perl5</item>
                <item>/usr/share/lib/perl</item>
            </list>
        </value>
    </var>
</section>
<section name="strict">    <!-- Don't allow malformed inputs --> </section>
<section name="verbose">   <!-- Report</comment -->              </section>
<section name="log">       <!-- And every step -->               </section>

Whatever format you choose, don't ever parse configuration files "manually" (i.e., with readlines and regexes and loops and all the other associated forms of torture). Don't write your own configuration file parsing module, either; there are already far too many configuration-file tools available on CPAN (see http://search.cpan.org/search?q=Config).

Evaluating the available modules to determine the best fit for your particular application is an onerous task, but in most cases it's probably sufficient to look at just three of them:

Config::General

This is a very powerful configuration file processor that covers almost all configuration needs very thoroughly. The configuration syntax it supports is fixed, but includes XML-ish blocks, Perl-like heredocs, and recursive file inclusion. The resulting configuration file format is therefore considerably more sophisticated than the style recommended earlier.

Config::Std

This configuration language and parser was specifically designed using the criteria suggested in this guideline (and elsewhere in this book). It supports a fixed syntax that is identical to that shown in Example 19-3.

Config::Tiny

This module is by far the smallest and fastest of the three. It parses the basic Windows INI format, which may be too minimal for some applications, as it doesn't support repeated or multiline values.

All three of these alternatives allow you to read configuration files into an internal data structure, update that data structure, and then write it back in the appropriate configuration file syntax. For example, the program:

use Config::Std;

# Read in the config file...
read_config '~/.demorc' => my %config;

# Update the library path and disclaimer...
$config{Internals}{lib} = ['~/.plugins', '/lib/share/plugins'];
$config{Interface}{Disclaimer} = 'Whatever, dude!';

# Delete the verbose option...
delete $config{verbose};

# Add a "Limits" section...
$config{Limits}{max_time}  = 1000;
$config{Limits}{max_space} = 1e6;

# Write back config file...write_config %config;

would update the configuration file shown in Example 19-3, to produce the file shown in Example 19-5.

Example 19-5. The configuration file, reloaded

> cat ~/.demorc

[Interface]
# Configurable bits that others will see...

Author: Jan-Yu Eyrie
E-mail: eju@calnet

Disclaimer: Whatever, dude!

[Internals]
# Stuff no-one else sees...

# Look-up path for plug-ins...
lib: ~/.plugins
lib: /lib/share/plugins

[strict]    # Don't allow malformed inputs

[log]       # And log every transaction

[Limits]

max_time: 1000max_space: 1000000

Note that Config::Std has an important advantage here compared to most other configuration-file parsers. When it writes back the configuration file, it always preserves the original comments, as well as the order in which sections and their associated configuration variables appear.

Formats

Don't use formats.

The format statement is one of the oldest and most fundamental features of Perl. It implements the original "R" of the "Practical Extraction and Reporting Language".

And even here in the 21st century—where data is more typically restructured, marked-up, CSS'd, JavaScripted, hyperlinked, and finally browsed—a simple text-based report is still often a cleaner and more usable alternative, especially in command-line environments:

> contacts -find 'Damian'

 ==================================
| NAME           | AGE | ID NUMBER |
|----------------+-----+-----------|
| Damian M.      | 40  |    869942 |
| Conway         |     |           |
|==================================|
| COMMENTS                         |
|----------------------------------|
| Do not feed after midnight. Do   |
| not mix with quantum physics. Do |
| not allow subject to talk for    |
| "as long as he likes".           | ==================================

But building such a report with format, as in Example 19-6, has some serious drawbacks, especially in terms of best-practice programming. For a start, formats are statically defined (i.e., specified at compile time), so it's difficult to build a format as your program executes; you have to resort to a string eval (see Chapter 9). Formats rely on global variables for configuration, and on package variables for the data they are to format (see Chapter 5). They also have to write their formatted text to a named filehandle (see Chapter 10). That's three best-practice strikes against formats already.

Example 19-6. Building a report with format

# Predeclare report format with the necessary package variables...
our ($name, $ID, $age, $comments);

format CONTACT =
 ==================================
| NAME           | AGE | ID NUMBER |
|----------------+-----+-----------|
| ^<<<<<<<<<<<<< | ^|| | ^>>>>>>>> |~~
  $name,           $age, $ID,
|==================================|
| COMMENTS                         |
|----------------------------------|
| ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< |~~
  $comments,
 ==================================
.

# and later...

# Grab contact information...

($ID, $name, $age, my $comments_ref) = get_contact($search_string);

# Massage comments into a single string...
$comments = join "
", @{$comments_ref};

# Open output stream to STDOUT and write formatted data...
open *CONTACT, q{>-} or croak "Can't open stdout: $OS_ERROR";
write *CONTACT;

Formats aren't re-entrant or recursive either, so you can't use a format to format the page header of another format. And you can't (easily) pre-format data that's going to be squeezed into a particular field.

Formats only provide a limited (and non-extensible) range of field types; they can't (easily) format bulleted lists, tables, comma'd numbers, monetary amounts, fully justified text, or verbatim data (i.e., with no line filling). And although it's easy to give each formatted page a header, page footers are difficult to produce.

Perl 6 will remedy all these problems and deficiencies by replacing the format declaration with a form( ) function, which can be called at run time to produce a formatted string that can be either printed at once or used in further formatting operations.

This new approach to text-based report generation is also available in Perl 5, via the Perl6::Form CPAN module. For example, a report like that generated by Example 19-6 could also be produced without package variables, named filehandles, or explicit list flattening by the code in Example 19-7.

Example 19-7. Building a report with Perl6::Form

use Perl6::Form;

my ($ID, $name, $age, $comments_ref) = get_contact($search_string);

print form  {bullet=>'*'},
    ' =================================== ',
    '| NAME            | AGE | ID NUMBER |',
    '|-----------------+-----+-----------|',
    '| {[[[[[[[[[[[[[} | {|} | {>>>>>>>} |',
       $name,            $age, $ID,
    '|===================================|',
    '| COMMENTS                          |',
    '|-----------------------------------|',
    '| * {[[[[[[[[[[[[[[[[[[[[[[[[[[[[[} |',
         $comments_ref,
    ' =================================== ',;

Note the use of an asterisk as a bullet to improve the readability of the comments:

 ==================================
| NAME           | AGE | ID NUMBER |
|----------------+-----+-----------|
| Damian M.      | 40  |    869942 |
| Conway         |     |           |
|==================================|
| COMMENTS                         |
|----------------------------------|
| * Do not feed after midnight.    |
| * Do not mix with quantum        |
|   physics.                       |
| * Do not allow subject to talk   |
|   for "as long as he likes".     | ==================================

Ties

Don't tie variables or filehandles.

Ties provide a way of replacing the behaviour any type of variable, or of a filehandle. The full mechanism is described in the standard perltie documentation[127].

Tied variables look exactly like ordinary scalars, arrays, or hashes, but they don't act exactly like them. Their whole purpose is to hide special non-standard behaviour inside a familiar interface. As such, they can be wonderfully Perlish and Lazy, making it easy (for example) to create a variable that automatically self-increments every time its value is accessed:

# Create a variable whose value cycles from zero to five...
use Tie::Cycle;
tie my $next_index, 'Tie::Cycle', [0..5];

# Read in monthly results...
my @cyclic_buffer;
while (my $next_val = prompt 'Next: ') {
    # Saving them in a six-month cyclic buffer...
    $cyclic_buffer[$next_index] = $next_val;
    # And printing the moving average each month...
    print 'Half-yearly moving average: ',
          sum(@cyclic_buffer)/@cyclic_buffer, "
";
}

Every time $next_index is used as an index into @cyclic_buffer, it moves on to the next value in [0..5]. When there are no more values, it loops back to zero and starts again. So $cyclic_buffer[$next_index] is always the next element in the cyclic buffer, even though $next_index is never explicitly incremented or reset.

And that's the problem. If $next_index had been tied further away from the loop[128], it might easily seem to some maintainer that every new value is being assigned into the same element of the buffer. Tied variables make any code that uses them less maintainable, because they make normal variable operations behave in unexpected, non-standard ways.

They're also less efficient. A tied variable is actually a wrapper around some blessed object, and so every access on any tied variable requires a method call (instead of being implemented in highly optimized C code).

Finally, tied variables can sometimes make your code less robust, as it's very easy to subtly misimplement some aspect of the expected variable behaviour.

Tied variables can always be replaced with method calls on objects:

# Create an iterator object whose value cycles from zero to five...
use List::Cycle;
my $index = List::Cycle->new({ vals => [0..5] });

# Read in monthly results...
my @cyclic_buffer;
while (my $next_val = prompt 'Next: ') {
    # Saving them in a six-month cyclic buffer...
    $cyclic_buffer[$index->next(  )] = $next_val;

    # And printing the moving average each month...
    print 'Half-yearly moving average: ',
           sum(@cyclic_buffer)/@cyclic_buffer, "
";}

Often they can also be replaced with a simple subroutine:

# Create a subroutine whose value cycles from zero to five...
{
    my $next   = -1;
    my @values = (0..5);

    sub next_index { return $next = ($next+1) % @values }
}

# Read in monthly results...
my @cyclic_buffer;
while (my $next_val = prompt 'Next: ') {
    # Saving them in a six-month cyclic buffer...
    $cyclic_buffer[next_index(  )] = $next_val;

    # And printing the moving average each month...
    print 'Half-yearly moving average: ',
        sum(@cyclic_buffer)/@cyclic_buffer, "
";}

Either way, the distinctive syntax of the method or subroutine calls provides an essential clue to the distinctive behaviour of the buffer index. So either of the previous solutions will be much more comprehensible, maintainable, and reliable than a tied variable.

Cleverness

Don't be clever.

Tied variables are a clever idea, but "cleverness" is the natural enemy of maintainable code. Unfortunately, Perl provides endless opportunities for cleverness.

For example, imagine coming across this result selector in production code:

$optimal_result = [$result1=>$result2]->[$result2<=$result1];

The syntactic symmetry is very elegant, of course, and devising it obviously provided the original developer with a welcome diversion from the tedium of everyday coding. But a clever line of code like that is a (recurring) nightmare to understand and to maintain, and imposes an unnecessary burden on everyone in the development and maintenance teams.

Cleverness doesn't have to be nearly that flagrant either. Having finally deduced that the example expression returns the smaller of the two results[129], you would almost certainly be tempted to immediately replace it with something like the following:

$optimal_result = $result1 <= $result2 ? $result1 : $result2;

While that's certainly an improvement in both readability and efficiency, it still requires some careful thought to verify that it's doing the right (i.e., minimizing) thing. And everyone who maintains this code will still have to decode that expression—possibly every time they come across it.

However, it's also possible to write that same expression in a way that's so obvious, straightforward, and plain-spoken that it requires no effort at all to verify that it implements the desired behaviour:

use List::Util qw( min );

$optimal_result = min($result1, $result2);

It's not "clever" and it's even marginally slower, but it is clean, clear, efficient, scalable, and easy to maintain. And that's always a much better choice.

Encapsulated Cleverness

If you must rely on cleverness, encapsulate it.

Very occasionally a genuine need for efficiency may (appear to) make it essential to use non-obvious Perl idioms such as:

# Make sure the requests are unique...
@requests  = keys %{ {map {$_=>1} @raw_requests} };

This statement takes each request in @raw_requests, converts it to a pair ($_=>1) in which the request is now the key, and uses that list of pairs to initialize an anonymous hash ({map {$_=>1} @raw_requests }), which folds every repeated request into the same hash key. The hash is then dereferenced (%{ {map {$_=>1} @raw_requests}), and its unique keys are retrieved (keys %{ {map {$_=>1} @raw_requests} }) and finally assigned into @requests.

But an expression that complex should never be left raw in code. If it's kept at all, it should be kept as a dirty little secret, shamefully hidden away in a subroutine in some dark corner of your code:

sub unique {
    return keys %{ { map {$_=>1} @_ } };  # Mea culpa!
}

# and later...@requests = unique(@raw_requests);

Apart from the obvious advantage that the request-handling code becomes vastly more readable, encapsulating the cleverness has another important benefit: when the cleverness proves not to be as clever as you first thought, it's very easy to replace it with something that's both slightly more readable and very much more efficient:

sub unique {
    my %uniq;            # Use keys of this hash to track unique values

    @uniq{@_} = (  );      # Use the args as those keys (the values don't matter)
    return keys %uniq;   # Return those unique values}

In this version, the list of values that's passed in (@_) is used as the list of keys in a hash slice (@uniq{@_}) of the %uniq hash. Slicing that hash creates all the requested keys within the hash, with the values for repeated keys overwriting earlier ones. The slice produces a list of entries, which is assigned an empty list (@uniq{@_} = ( )), as only the keys themselves matter. The assignment is required, though, because the hash keys are created only when the slice is an lvalue. The keys of the post-sliced hash are therefore the unique values from the original argument list, which the subroutine then simply returns (return keys %uniq).

This version is faster than the previous one because it doesn't have to build an interim key/value/key/value . . . list and then use that to initialize the hash. Instead, the slicing operation creates all the required keys at once, and the empty list assignment means that no values need to be assigned (or have space allocated for them).

Outside the subroutine there are important benefits too. Because the "clever" code remains hidden away, every line of client code that uses unique( ) will immediately benefit from the improved performance, but without needing to be rewritten in any way.

Of course, later you'll probably realize that both versions of unique( ) share two flaws: they don't preserve the order of any list they're given (which is a problem if that list is a search path), and they convert all the list data to strings (which is a problem if that data was originally a list of objects or references).

However, once you realize that, you still won't have to change the client code, because you can just rewrite the encapsulated nastiness once again:

sub unique {
    my %seen;                        # Keys track values already seen
    return grep {!$seen{$_}++} @_;   # Keep only those not yet seen}

This version uses a hash (%seen) merely as a look-up table indicating which values have already been encountered. Then it filters the original argument list, keeping only those values that have not yet been seen and incrementing the count of "sightings" for each element as it's examined ($seen{$_}++). The use of a post-increment is critical here as it's essential that the count be incremented every time, but that the count be zero the first time the grep encounters a particular element, so the filter will let that first instance through.

Because grep is merely a filter which passes acceptable values through unchanged, this version of unique( ) preserves both the type and the order of the arguments it's given. And, though it's not quite as fast as the sliced solution shown earlier, it still runs faster than the "clever" anonymous hash version.

Benchmarking

Don't optimize code—benchmark it.

It's natural to think that a single expression like:

keys %{ { map {$_=>1} @_ } }

will be more efficient than two statements:

my %seen;
return grep {!$seen{$_}++} @_;

But, unless you are deeply familiar with the internals of the Perl interpreter[130], intuitions about the relative performance of two constructs are exactly that: unconscious guesses.

The only way to know for sure which of two—or more—alternatives will perform better is to actually time each of them. The standard Benchmark module makes that easy, as Example 19-8 illustrates.

Example 19-8. Benchmarking the uniqueness functions

# A short list of not-quite-unique values...
our @data = qw( do re me fa so la ti do );

# Various candidates...
sub unique_via_anon {
    return keys %{ { map {$_=>1} @_ } };
}

sub unique_via_grep {
    my %seen;
    return grep { !$seen{$_}++ } @_;
}

sub unique_via_slice {
    my %uniq;
    @uniq{@_} = (  );
    return keys %uniq;
}

# Compare the current set of data in @data
sub compare {
    my ($title) = @_;

    print "
[$title]
";

    # Create a comparison table of the various timings, making sure that
    # each test runs at least 10 CPU seconds...
    use Benchmark qw( cmpthese );
    cmpthese -10, {
        anon   => 'my @uniq = unique_via_anon(@data)',
        grep   => 'my @uniq = unique_via_grep(@data)',
        slice  => 'my @uniq = unique_via_slice(@data)',
    };

    return;
}

compare('8 items, 10% repetition'),

# Two copies of the original data...
@data = (@data) x 2;
compare('16 items, 56% repetition'),

# One hundred copies of the original data...
@data = (@data) x 50;compare('800 items, 99% repetition'),

The cmpthese( ) subroutine is passed a number, followed by a reference to a hash of tests. The number specifies either the exact number of times to run each test (if the number is positive) or the absolute number of CPU seconds to run the test for (if the number is negative). Typical values are around 10,000 repetitions or 10 CPU seconds, but the module will warn you if the test is too short to produce an accurate benchmark.

The keys of the test hash are the names of your tests, and the corresponding values specify the code to be tested. Those values can be either strings (which are eval'd to produce executable code) or subroutine references (which are called directly).

Specifying your tests as strings usually produces results that are more accurate. Subroutine references have to be re-invoked each time each test is repeated, which adds a subroutine call overhead to every test and tends to obscure small relative differences in performance. So benchmarking via eval'd strings is the better practice here (despite the general exhortations against them in Chapter 8).

Unfortunately, eval'd code can see only those lexical variables in the scope where the eval is executed, not those in the scope where the string is created. That means the string eval inside cmpthese( ) cannot see lexicals in the scope where cmpthese( ) was called. So for accurate benchmarking, it's necessary to also ignore the strong admonitions in Chapter 5, and pass data to Benchmark tests using package variables (e.g. our @data) rather than lexicals.

Of course, if you use anonymous subroutines for your tests, they can see any lexicals in their declaration scope, in which case you can avoid both globals and eval, the only problem being that your results may be badly skewed by the extra subroutine call costs.

The benchmarking code in Example 19-8 would print out something like the following:

[8 items, 10% repetitions]
         Rate  anon  grep slice
anon  28234/s    --  -24%  -47%
grep  37294/s   32%    --  -30%
slice 53013/s   88%   42%    --

[16 items, 50% repetitions]
         Rate  anon  grep slice
anon  21283/s    --  -28%  -51%
grep  29500/s   39%    --  -32%
slice 43535/s  105%   48%    --

[800 items, 99% repetitions]
        Rate  anon  grep slice
anon   536/s    --  -65%  -89%
grep  1516/s  183%    --  -69%slice 4855/s  806%  220%    --

Each of the tables printed has a separate row for each named test. The first column lists the absolute speed of each candidate in repetitions per second, while the remaining columns allow you to compare the relative performance of any two tests. For example, in the final test, tracing across the grep row to the anon column reveals that the grepped solution was 1.83 times (183%) faster than using an anonymous hash. Tracing further across the same row also indicates that grepping was 69% slower (-69% faster) than slicing.

Overall, the indication from the three tests is that the slicing-based solution is consistently the fastest for this particular set of data on this particular machine. It also appears that as the data set increases in size, slicing also scales much better than either of the other two approaches.

However, those two conclusions are effectively drawn from only three data points (namely, the three benchmarking runs). To get a more definitive comparison of the three methods, you'd also need to test other possibilities, such as a long list of non-repeating items or a short list with nothing but repetitions.

Better still, test on the real data that you'll actually be "unique-ing".

For example, if that data is a sorted list of a quarter million words, with only minimal repetitions, and which has to remain sorted, then test exactly that:

our @data = slurp '/usr/share/biglongwordlist.txt';

use Benchmark qw( cmpthese );
    cmpthese 10, {
        # Note:the non-grepped solutions need a post-uniqification re-sort
        anon   => 'my @uniq = sort(unique_via_anon(@data))',
        grep   => 'my @uniq =      unique_via_grep(@data)',
        slice  => 'my @uniq = sort(unique_via_slice(@data))',    };

Not surprisingly, this benchmark indicates that the grepped solution is markedly superior on a large sorted data set:

      s/iter  anon slice  grep
anon    4.28    --   -3%  -46%
slice   4.15    3%    --  -44%grep    2.30   86%   80%    --

Perhaps more interestingly, the grepped solution still benchmarks as being marginally faster when the two hash-based approaches aren't re-sorted. This suggests that the better scalability of the sliced solution that was seen in the earlier benchmark is a localized phenomenon, and is eventually undermined by the growing costs of allocation, hashing, and bucket-overflows as the sliced hash grows very large.

Above all, that last example demonstrates that benchmarks only benchmark the cases you actually benchmark. And that useful conclusions about performance can only be drawn from benchmarking real data.

Memory

Don't optimize data structures—measure them.

Intuitions about the relative space efficiency of different data structures aren't very reliable, either. If you are concerned about the memory footprint of a data structure that you are using, the Devel::Size module makes it easy to see how heavy the burden actually is:

# This look-up table is handy, but seems to be too bloated...
my %lookup = load_lookup_table($file);

# So let's look at how much memory it's using...
use Devel::Size qw( size total_size );
use Perl6::Form;

my $hash_mem  = size(\%lookup);           # Storage overheads only
my $total_mem = total_size(\%lookup);     # Overheads plus actual data
my $data_mem  = $total_mem - $hash_mem;   # Data only

print form(
    'hash alone: {>>>,>>>,>>} bytes',  $hash_mem,
    'data alone: {>>>,>>>,>>} bytes',  $data_mem,
    '============================',
    'total:      {>>>,>>>,>>} bytes',  $total_mem,);

That might print something like:

hash alone:    8,704,075 bytes
data alone:    8,360,250 bytes
==============================total:        17,064,325 bytes

which indicates that storing your 8.36MB of data in a hash has incurred an overhead of an additional 8.70MB for buckets, hash tables, keys, and other internals.

The total_size( ) subroutine takes a reference to a variable and returns the total number of bytes of memory used by that variable. This includes both:

  • The memory that the variable uses for its own implementation. For example, the buckets that are needed to implement a hash, or the flag bits that are used inside every scalar.

  • The memory used by the data that the variable stores. For example, the space required for the keys and values in a hash, or for the value in a scalar.

The size( ) subroutine also takes a variable reference, but returns only the number of bytes that the variable uses for itself, excluding the memory required to store its data.

Caching

Look for opportunities to use caches.

It makes sense not to do the same calculation twice, if the result is small enough that it can reasonably be stored for reuse. The simplest form of that is putting a result into an interim variable whenever it will be used more than once. That is, instead of calling the same functions twice on the same data:

print form(
    'hash alone: {>>>,>>>,>>} bytes', size(\%lookup),
    'data alone: {>>>,>>>,>>} bytes', total_size(\%lookup)-size(\%lookup),
    '==============================',
    'total:      {>>>,>>>,>>} bytes', total_size(\%lookup),
);

call them once, store the results temporarily, and retrieve them each time they're needed:

my $hash_mem  = size(\%lookup);
my $total_mem = total_size(\%lookup);
my $data_mem  = $total_mem - $hash_mem;

print form(
    'hash alone: {>>>,>>>,>>} bytes',  $hash_mem,
    'data alone: {>>>,>>>,>>} bytes',  $data_mem,
    '==============================',
    'total:      {>>>,>>>,>>} bytes',  $total_mem,);

This often has the additional benefit of allowing you to name the interim values in ways that make the code more comprehensible.

Subroutines like size( ) and total_size( ) and functions like rand( ) or readline( ) don't always return the same result when called with the same arguments. Such subroutines are good candidates for temporary and localized reuse of results, but not for longer-term caching.

On the other hand, pure functions like sqrt( ) and int( ) and crypt( ) do always return the same result for the same list of arguments, so their return values can be stored long-term and reused whenever they're needed again. For example, if you have a subroutine that returns a case-insensitive SHA-512 digest:

sub lc_digest {
    my ($text) = @_;

    use Digest::SHA qw( sha512 );
    return sha512(lc $text);
}

then you could (potentially) speed it up over many calls by giving it a private look-up table in which results can be cached as they're computed, as shown in Example 19-9.

Example 19-9. Adding a cache to a digest subroutine

{
    my %cache;

    sub lc_digest {
        my $text = lc shift;

        # Compute the answer only if it's not already known...
        if (!exists $cache{$text}) {
            use Digest::SHA qw( sha512 );
            $cache{$text} = sha512($text);
        }

        return $cache{$text};
    }}

On the other hand, if the range of possible data for a computation is small and the number of computations is large, then it's often simpler and more efficient to pre-compute the entire look-up table and then access it directly, thereby eliminating the cost of a subroutine call. For example, suppose you were doing some kind of image processing and needed square roots for pixel intensity values in the range 0 to 255. You could write:

for my $row (@image_rows) {
    for my $pixel_value (@{$row}) {
        $pixel_value = sqrt($pixel_value);
    }}

or you could dramatically reduce the number of sqrt operations by precomputing all possible values and creating a look-up table:

my @sqrt_of = map { sqrt $_ } 0..255;

for my $row (@image_rows) {
    for my $pixel_value (@{$row}) {
        $pixel_value = $sqrt_of[$pixel_value];
    }}

For a thorough discussion of the many applications and advantages of caching, see Chapter 3 of Higher-Order Perl, by Mark Jason Dominus (Morgan Kaufmann, 2005)

Memoization

Automate your subroutine caching.

The logic required to implement a caching strategy is always the same: check whether the result is already cached; otherwise, compute and cache it; either way, return the cached result. So, as usual, there's a CPAN module that automates the task: Memoize.

To add caching to a subroutine (a process called memoization), you simply define the subroutine without any caching, and load Memoize. The module will automatically export a memoize( ) subroutine, which you then call, passing it a string containing the name of the subroutine you want cached. Like so:

sub lc_digest {
    my ($text) = @_;

    use Digest::SHA qw( sha512 );
    return sha512(lc $text);
}

use Memoize;memoize( 'lc_digest' );

Notice how much cleaner this is than the "manually cached" version in Example 19-9.

It's also more reliable, as you can focus on getting the computation correct, and leave the details of the caching strategy to Memoize. For example, the caches that the module installs correctly differentiate between subroutine calls in list and scalar context. This is important, because the same subroutine called with the same arguments might still return different values, depending on whether it was expected to return a list or a single value. Forgetting this distinction is a very common error when implementing caching manually[131].

The memoize( ) subroutine has many other options for fine-tuning the kinds of caching it confers. The module documentation provides detailed descriptions of the many possibilities, including caching results in a database so that the cache persists between executions of your program.

Caching for Optimization

Benchmark any caching strategy you use.

Caching is a strategy that would seem to have no downside. After all, computing a value only once is obviously always going to be quicker than recomputing it many times[132]. That's true, of course, but it isn't the whole story. Occasionally, caching can backfire and actually make a computation slower.

It's certainly the case that computing once is always quicker than recomputing every time. However, caching isn't quite a case of computing-once; it's actually a case of computing-once-and-forever-after-rechecking-whether-you've-already-computed-and-if-so-then-accessing-the-previously-computed-value. That more complicated process may not always be quicker than recomputing every time. Searching and then accessing a look-up table has an intrinsic cost, which can occasionally be greater than redoing the entire calculation. Especially if the look-up table is a hash.

So, whenever you decide to add caching to a computation, it's essential to benchmark the resulting code, to make sure that the cache look-up costs aren't more expensive that the computation itself. For example, for the pixel square roots from the previous guideline, a simple speed comparison:

use Benchmark qw( cmpthese );

my @sqrt_of = map {sqrt $_} 0..255;

cmpthese -30, {
    recompute      => q{ for my $n (0..255) { my $res = sqrt $n      } },
    look_up_array  => q{ for my $n (0..255) { my $res = $sqrt_of[$n] } },};

reveals that, in this instance, using a look-up table is only about 9% faster than just calling sqrt directly every time:

                 Rate         recompute   look_up_array
recompute       3951/s            --             -8%look_up_array   4291/s            9%             --

You then need to decide whether that marginal performance improvement is enough to warrant the additional complexity in the code.

By the way, if you're using Memoize to install your caching, you can tell it to install the memoized version of your subroutine under a different name by using the INSTALL option. This makes it easy to benchmark the relative performance of two versions:

# Install a memoized version of lc_digest() as fast_lc_digest(  )...
memoize( 'lc_digest', INSTALL=>'fast_lc_digest' );

# See if it is actually any faster on real data sets...
cmpthese -30, {
    nomemo => q{ for my $text (@real_data) { my $res = lc_digest($text); }      },
    memo   => q{ for my $text (@real_data) { my $res = fast_lc_digest($text); } },};

Profiling

Don't optimize applications—profile them.

In the previous guideline, the benchmarked comparison between repeatedly computing sqrt $pixel_value and repeatedly looking up $sqrt_of[$pixel_value] indicated that caching provided a 9% improvement:

                 Rate         recompute   look_up_array
recompute       3951/s            --             -8%look_up_array   4291/s            9%             --

That sounds impressive, but it's important to keep those numbers in perspective. Each iteration of the test did 256 square root retrievals. So, overall, the test was achieving 1,011,456 (i.e., 3951 x 256) sqrt calls per second, compared to 1,098,496 @sqrt_of look-ups per second.

Suppose you were processing the 786,432 pixels of a typical 1024 x 768 image. Using the example performance figures, the repeated sqrt calls would require around 0.78 seconds to process that many pixels, whereas the look-up table would take only about 0.72 seconds. Adding a cache to this section of your code would save you a grand total of 0.06 seconds per image.

That's an all-too-common outcome when code is optimized: developers focus their efforts on those components that are easy to optimize, rather than on those components in which improvements will produce the greatest benefit.

How do you find those places where optimization will do the most good? By understanding where your application spends most of its time. And the easiest way to do that is to profile your program using the standard Devel::DProf module, which can determine how long your application spends within each subroutine in your source code. That is, instead of running your program in the usual way:

> perl application.pl datafile

run it under the auspices of the profiler module:

> perl -d:DProf application.pl datafile

The -d: debugging flag is a shorthand for -MDevel::, so specifying -d:DProf is the same as specifying -MDevel::DProf. That tells perl to include that profiling module before the start of the application.pl source.

The module itself simply watches every subroutine call within your code, noting how much time elapses between the invocation and return, and adding that duration to a record of the total time spent in each subroutine. At the end of the program, the module creates a file called tmon.out in the current directory.

It's possible to directly interpret the raw data in the file (see the module docs for details), but much easier to understand it by passing it through the standard dprofpp application.

For example, you could use Devel::DProf to profile a call to your new autoformat application, and then summarize the results with dproffpp like so:

> perl -d:DProf ~/bin/autoformat  < Ch_19.txt  > Ch_19_formatted.txt

> dprofpp tmon.out

Total Elapsed Time = 3.212516 Seconds
  User+System Time = 0.722516 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 16.6   0.120  0.416      3   0.0399 0.1387  main::BEGIN
 11.0   0.080  0.138      9   0.0089 0.0153  Text::Autoformat::BEGIN
 8.30   0.060  0.066      1   0.0600 0.0665  Getopt::Declare::parse
 7.89   0.057  0.057    221   0.0003 0.0003  Text::Balanced::_failmsg
 6.09   0.044  0.075      1   0.0437 0.0749  Text::Autoformat::autoformat
 5.54   0.040  0.089      3   0.0133 0.0298  Getopt::Declare::Arg::BEGIN
 5.26   0.038  0.252      1   0.0381 0.2516  Getopt::Declare::new
 4.01   0.029  0.059     26   0.0011 0.0023  Getopt::Declare::BEGIN
 3.88   0.028  0.111     70   0.0004 0.0016  Text::Balanced::extract_codeblock
 3.60   0.026  0.074    133   0.0002 0.0006  Text::Balanced::_match_codeblock
 2.77   0.020  0.020      1   0.0200 0.0199  Text::Balanced::ErrorMsg::BEGIN
 2.77   0.020  0.030      3   0.0066 0.0099  vars::BEGIN
 2.77   0.020  0.020      5   0.0040 0.0040  Exporter::as_heavy
 2.77   0.020  0.020     17   0.0012 0.0012  Exporter::import 1.38   0.010  0.010      1   0.0100 0.0100  AutoLoader::AUTOLOAD

For each subroutine, the table produced by dprofpp shows (amongst other details):

  • The total time spent in the subroutine (%Time), as a percentage of the total time spent running the application

  • The actual amount of time spent within the subroutine (ExclSec)

  • The actual amount of time spent within the subroutine and any subroutines it called (CumulS)

  • The number of calls made to the subroutine (#Calls)

Looking at the output of the previous example, you can see that the program spends about 8% of its time in Text::Balanced::_failmsg( ), making a total of 221 calls to that subroutine. In contrast, it spends only about half that amount of time in 70 calls to Text::Balanced::extract_codeblock( ). So it probably makes sense to focus any optimization efforts on _failmsg( ) first.

If you need finer-grained profiling, the Devel::SmallProf CPAN module allows you to count how many times each line of your program is executed, which makes it easier to determine precisely what statement is causing a particular subroutine to be expensive:

> perl -d:SmallProf application.pl datafile

The result is a file named smallprof.out, which is actually a copy of your source code with each line prefixed by the number of times it was executed, the total amount of time spent executing the line, and the line number. Although the module lacks a utility like dprofpp to summarize its output, it is an excellent investigative tool once you have identified the main suspects using Devel::DProf.

Enbugging

Be careful to preserve semantics when refactoring syntax.

The guidelines in this book are designed to improve the robustness, efficiency, and maintainability of your code. However, you need to take extra care when applying them retrospectively to the source of existing applications.

If you're rewriting existing code to bring it into line with the practices suggested here, be sure that you preserve the existing behaviour of the code through the changes you make. For example, if you have a loop such as:

for (@candidates) { next unless m/^Name: (.+?); $/; $target_name = $1 and last }

you might refactor it to:

# Find the first candidate with a valid Name: field...
CANDIDATE:
for my $candidate (@candidates) {
    # Extract the contents of the Name: field...
    my ($name)
        = $candidate =~ m/^Name: (.+?); $/xms;

    # ...or try elsewhere...
    next CANDIDATE if !defined $name;

    # If name found, save it and we're done...
    $target_name = $name;
    last CANDIDATE;
}

However, adding the /xms (as recommended in Chapter 12) will alter the semantics of the pattern inside the regular expression. Specifically, it will change the meaning of ^, $, .+?, and the space character. Even though the pattern's syntax didn't change, its behaviour did. That's a particularly subtle way to break a piece of code.

In this case, you have to make sure that you apply all of the guidelines in Chapter 12, changing the pattern as well as the flags, so as to preserve the original behaviour:

# Find the first candidate with a valid Name: field...
CANDIDATE:
for my $candidate (@candidates) {
    # Extract the Name: field...
    my ($name)
        = $candidate =~ m{A Name: s+ ([^N]+) ; s+ 
? z}xms;

    # ...or try elsewhere...
    next CANDIDATE if !defined $name;

    # If name found, save it and we're done...
    $target_name = $name;
    last CANDIDATE;}

Good intentions don't prevent bad outcomes, especially when you're renovating existing code. Making obvious improvements can also introduce unobvious bugs. However, the recommendations in Chapter 18 can help you detect when virtue has become its own punishment. For example, having rewritten that statement, you could immediately rerun prove -r, which—assuming your test suite was sufficiently comprehensive—would highlight the unintended change in behaviour.



[124] And documentation, and data files, and document templates, and makefiles, and style sheets, and change logs, and any other resources your system requires. Revision control isn't only for source.

[125] If you dare!

[126] For an up-to-date list, see http://search.cpan.org/search?q=Inline.

[127] And more extensively explained in Chapter 9 of Object Oriented Perl (Manning, 1999) or Chapter 14 of Programming Perl (O'Reilly, 2000).

[128] And as the code is maintained and extended, that critical tying of the variable will drift away from the code where the magic variable is used.

[129] The first square brackets ([$result1=>$result2]) create an anonymous array containing the two results in order. Then the second brackets index into that array. The index they use is the integer result of the less-than-or-equal-to comparison: $result2<=$result1. That comparison is true (1) if $result2 is no bigger than $result1, in which the resulting index (1) selects the second element ($result2) from the anonymous array. If $result1 is smaller than $result2, then the result of the comparison is false, which becomes zero when used as an index, which selects the first element ($result1) from the anonymous array. So the entire expression always selects whichever of the two results is smaller. Of course, it takes seven lines of careful analysis to work that out, but that's a small price to pay to enjoy High Art.

[130] In which case you already have far more serious personal issues to deal with.

[131] It's not, however, a defect in the manually cached version of lc_digest( ) in Example 19-9, because sha512( ) always returns a single value, regardless of its call context.

[132] At least, until your calculations turn quantum.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset