Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. I/O

On two occasions I have been asked [by members of Parliament], "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?."
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question

—Charles Babbage

Input and output are critical in any design, because they mediate the interface of an application or library. To most users of your software, what your I/O components do is their entire experience of what the software is. So good I/O practices are essential to usability.

I/O operations are also particularly susceptible to inefficiencies, especially on large data sets. I/O is frequently the bottleneck in a system, and usually doesn't scale well. So good I/O practices are essential to performance too.

Yet another concern is that I/O deals with the software's external environment, which is typically less reliable than its own internals. Dealing successfully with the multiple failure modes of operating systems, filesystems, network connections, and human beings requires careful and conservative programming. So good I/O practices are essential to robustness as well.

Filehandles

Don't use bareword filehandles.

One of the most efficient ways for Perl programmers to bring misery and suffering upon themselves and their colleagues is to write this:

open FILE, '<', $filename
    or croak "Can't open '$filename': $OS_ERROR";

Using a bareword like that as a filehandle causes Perl to store the corresponding input stream descriptor in the symbol table of the current package. Specifically, the stream descriptor is stored in the symbol table entry whose name is the same as the bareword; in this case, it's *FILE. By using a bareword, the author of the previous code is effectively using a package variable to store the filehandle.

If that symbol has already been used as a filehandle anywhere else in the same package, executing this open statement will close that previous filehandle and replace it with the newly opened one. That's going to be a nasty surprise for any code that was already relying on reading input with <FILE>^[51].

The writer of this particular code also chose the imaginative name FILE for this particular filehandle. That's one of the commonest names used for package filehandles ^[52], so the chances of colliding with someone else's open filehandle are greatly enhanced.

As if these pitfalls with bareword filehandles weren't bad enough, barewords are even more unreliable if there's a subroutine of the same name currently in scope. And worse still, under those circumstances they may fail silently. For example:

# Somewhere earlier in the same package (but perhaps in a different file)...
use POSIX;

# and later...

# Open filehandle to the external device...
open EXDEV, '<', $filename
    or croak "Can't open '$filename': $OS_ERROR";

# And process data stream...
while (my $next_reading = <EXDEV>) {
    process_reading($next_reading);
}

The POSIX module will have quietly exported a subroutine representing the POSIX error-code EXDEV into the package's namespace (just as if that constant had been declared in a use constant pragma). So the open statement is really:

open EXDEV(  ), '<', $filename
    or croak "Can't open '$filename': $OS_ERROR";

When that statement executes, it will first call the EXDEV( ) subroutine, which happens to return the value 18. The open statement then uses that value as a bareword filehandle name, opens an input stream to the requested file, and stores the resulting filehandle in the package's *18 symbol table entry ^[53].

Unfortunately, the EXDEV( ) subroutine isn't visible within the angle brackets of a subsequent input operation (i.e., <EXDEV>), because the input operator always treats an enclosed bareword as the direct name of the package filehandle that it's supposed to read from. As a result, the angle brackets attempt to read from *EXDEV, which results in a completely accurate, but highly confusing error message:

readline(  ) on unopened filehandle EXDEV

The usual conundrum at that point is: how can the filehandle possibly be unopened, when the open statement on the immediately preceding line didn't throw an exception??? And if the obvious culprit (the use POSIX) is off in another file somewhere, it can be very difficult to track down what's going wrong.

Curiously, the code would work as intended if it were rewritten like so:

# And process data stream...
while (my $next_reading = <18>) {
    process_reading($next_reading);
}

But that's hardly an ideal solution. The ideal solution is not to use bareword filehandles at all.

Indirect Filehandles

Use indirect filehandles.

Indirect filehandles provide a much cleaner and less error-prone alternative to bareword filehandles, and from Perl 5.6 onwards they're as easy to use as barewords. Whenever you call open with an undefined scalar variable as its first argument, open creates an anonymous filehandle (i.e., one that isn't stored in any symbol table), opens it, and puts a reference to it in the scalar variable you passed.

So you can open a file and store the resulting filehandle in a lexical variable, all in one statement, like so:

open my $FILE, '<', $filename
    or croak "Can't open '$filename': $OS_ERROR";

The my $FILE embedded in the open statement first declares a new lexical variable in the current scope. That variable is created in an undefined state, so the open fills it with a reference to the filehandle it's just created, as described earlier.

Under versions of Perl prior to 5.6, open isn't able to create the necessary filehandle automatically, so you have to do it yourself, using the gensym( ) subroutine from the standard Symbol module:

use Symbol qw( gensym );

# and later...

my $FILE = gensym(  );
open $FILE, '<', $filename    or croak "Can't open '$filename': $OS_ERROR";

Either way, once the open filehandle is safely stored in the variable, you can read from it like so:

$next_line = <$FILE>;

And now it doesn't matter that the name of that filehandle is $FILE (at least, not from the point of view of code robustness). Sure, it's still a lousy, lazy, unimaginative, uninformative name, but now it's a lousy, lazy, unimaginative, uninformative, lexical name, so it won't sabotage anyone else's lousy, lazy, unimaginative, uninformative name ^[54].

Even if there's already another $FILE variable in the same scope, the open won't clobber it; you'll merely get a warning:

"my" variable $FILE masks earlier declaration in same scope

and the new $FILE will hide the old one until the end of its scope.

Apart from avoiding the perils of global namespaces and the confusion of barewords that are sometimes subroutine calls, lexical filehandles have yet another advantage: they close themselves automatically when their lexical variable goes out of scope.

This feature ensures that a filehandle doesn't accidentally outlive the scope in which it's opened, it doesn't unnecessarily consume system resources, and it's also less likely to lose unflushed output if the program terminates unexpectedly. Of course, it's still preferable to close filehandles explicitly (see the "Cleanup" guideline later in this chapter), but lexical filehandles still improve the robustness of your code even when you forget.

Localizing Filehandles

If you have to use a package filehandle, localize it first.

Very occasionally, you simply have to use a package filehandle, rather than a lexical. For example, you might have existing code that relies on hard-wired bareword filehandle names.

In such cases, make sure that the symbol table entry involved is always referred to explicitly, with a leading asterisk. And, more importantly, always localize that typeglob within the smallest possible scope. For example:

# Wrap the Bozo::get_data(  ) subroutine cleanly.
# (Apparently this subroutine is hard-wired to only read from a filehandle
#  named DATA::SRC. And it's used in hundreds of places throughout our
#  buffoon-monitoring system, so we can't change it. At least we fired the
#  clown that wrote this, didn't we???)...
sub get_fool_stats {
    my ($filename) = @_;

    # Create a temporary version of the hardwired filehandle...
    local *DATA::SRC;

    # Open it to the specified file...
    open *DATA::SRC, '<', $filename
        or croak "Can't open '$filename': $OS_ERROR";

    # Call the legacy subroutine...
    return Bozo::get_data(  );}

Applying local to the *DATA::SRC typeglob temporarily replaces that entry in the symbol table. Thereafter, the filehandle that is opened is stored in the temporary replacement typeglob, not in the original. And it's the temporary *DATA::SRC that Bozo::get_data( ) sees when it's called. Then, when the results of that call are returned, control passes back out of the body of get_fool_stats( ), at which point any localization within that scope is undone, and any pre-existing *DATA::SRC filehandle is restored.

Localization prevents most of the usual problems with bareword filehandles, because it ensures that the original *DATA::SRC is unaffected by the non-lexical open inside the call to get_fool_stats( ). It also guarantees that the filehandle is automatically closed at the end of the subroutine call. And using explicitly asterisked typeglobs instead of barewords avoids any confusion if there's also a DATA::SRC( ) subroutine.

Nonetheless, if you have a choice, lexical filehandles are still a better alternative. Unlike localized typeglobs, lexicals are strictly limited to the scope in which they are created. In contrast, localized package filehandles are available not only in their own scope, but—as the previous example illustrates—they can also be seen in any deeper scope that is called from their own scope. So a localized package filehandle can still potentially be pre-empted (i.e., broken) by another careless open in some nested scope.

Opening Cleanly

Use either the IO::File module or the three-argument form of open.

You may have noticed that all of the examples so far use the three-argument form of open. This variant was introduced in Perl 5.6 and is more robust that the older two-argument version, which is susceptible to very rare, but subtle, failures:

# Log system uses a weird but distinctive naming scheme...
Readonly my $ACTIVE_LOG => '>temp.log<';
Readonly my $STATIC_LOG => '>perm.log<';

# and later...

open my $active,  "$ACTIVE_LOG"  or croak "Can't open '$ACTIVE_LOG': $OS_ERROR";
open my $static, ">$STATIC_LOG"  or croak "Can't open '$STATIC_LOG': $OS_ERROR";

This code executes successfully, but it doesn't do what it appears to. The $active filehandle is opened for output to a file named temp.log<, not for input from a file named >temp.log<. And the $static filehandle is opened for appending to a file named perm.log<, rather than overwriting a file named >perm.log<. That's because the two open statements are equivalent to:

open my $active, '>temp.log<'   or croak "Can't open '>temp.log<': $OS_ERROR";
open my $static, '>>perm.log<'  or croak "Can't open '>perm.log<': $OS_ERROR";

and the '>' and '>>' prefixes on the second arguments tell open to open the files whose names appear after the prefixes in the corresponding output modes.

Using a three-argument open instead ensures that the specified opening mode can never be subverted by bizarre filenames, since the second argument now specifies only the opening mode, and the filename is supplied separately and doesn't have to be decoded at all:

# Log system uses a weird but distinctive naming scheme...
Readonly my $ACTIVE_LOG => '>temp.log<';
Readonly my $STATIC_LOG => '>perm.log<';

# and later...

open my $active, '<', $ACTIVE_LOG  or croak "Can't open '$ACTIVE_LOG': $OS_ERROR";open my $static, '>', $STATIC_LOG  or croak "Can't open '$STATIC_LOG': $OS_ERROR";

And, as a small side-benefit, each open becomes visually more explicit about the intended mode of the resulting filehandle, which improves the readability of the resulting code slightly.

The only time you should use the two-argument form of open is if you need to open a stream to or from the standard I/O streams:

open my $stdin,  '<-' or croak "Can't open stdin: $OS_ERROR";
open my $stdout, '>-' or croak "Can't open stdout: $OS_ERROR";

The three-argument forms:

open my $stdin,  '<', '-' or croak "Can't open '-': $OS_ERROR";
open my $stdout, '>', '-' or croak "Can't open '-': $OS_ERROR";

don't have the same special magic; they simply attempt to open a file named "-" for reading or writing.

As an alternative to using open at all, you can also use Perl's object-oriented I/O interface to open files via the standard IO::File module. For example, the earlier log system example could also be written:

# Log system uses a weird but distinctive naming scheme...
Readonly my $ACTIVE_LOG => '>temp.log<';
Readonly my $STATIC_LOG => '>perm.log<';

# and later...
use IO::File;

my $active = IO::File->new($ACTIVE_LOG, '<')
    or croak "Can't open '$ACTIVE_LOG': $OS_ERROR";
my $static = IO::File->new($STATIC_LOG, '>')    or croak "Can't open '$STATIC_LOG': $OS_ERROR";

The resulting filehandles in $active and $static can still be used like any other filehandle. In fact, the only significant difference between using IO::File->new( ) and using open is that the OO version blesses the resulting filehandle into the IO::File class, whereas open produces raw filehandles that act like objects of the IO::Handle class (even though they're not actually blessed).

Error Checking

Never open, close, or print to a file without checking the outcome.

These three I/O functions are probably the ones that fail most often. They can fail because a path is bad, or a file is missing, or inaccessible, or has the wrong permissions, or a disk crashes, or the network fails, or the process runs out of file descriptors or memory, or the filesystem is read-only, or any of a dozen other problems.

So writing unguarded I/O statements like this:

open my $out,  '>', $out_file;
print {$out} @results;
close $out;

is sheer optimism, especially when it's not significantly harder to check that everything went to plan:

open my $out,  '>', $out_file  or croak "Couldn't open '$out_file': $OS_ERROR";
print {$out} @results          or croak "Couldn't write '$out_file': $OS_ERROR";close $out                     or croak "Couldn't close '$out_file': $OS_ERROR";

Or, more forgivingly, as part of a larger interactive process:

SAVE:
while (my $save_file = prompt 'Save to which file? ') {
    # Open specified file and save results...
    open my $out, '>', $save_file  or next SAVE;
    print {$out} @results          or next SAVE;
    close $out                     or next SAVE;

    # Save succeeded, so we're done...
    last SAVE;}

Also see the "Builtin Failures" guideline in Chapter 13 for a less intrusive way to ensure that every open, print, and close is properly checked.

Checking every print to a terminal device is also laudable, but not essential. Failure in such cases is much rarer, and usually self-evident. Besides, if your print statements can't reach the terminal, it's unlikely that your warnings or exceptions will either.

Cleanup

Close filehandles explicitly, and as soon as possible.

Lexical filehandles, and even localized package filehandles, automatically close as soon as their variable or localization goes out of scope. But, depending on the structure of your code, that can still be suboptimal:

sub get_config {
    my ($config_file) = @_;

    # Access config file or signal failure...
    open my $fh, '<', $config_file
        or croak "Can't open config file: $config_file";

    # Load file contents...
    my @lines = <$fh>;

    # Storage for config data...
    my %config;
    my $curr_section = $EMPTY_STR;

    # Decode config data...
    CONFIG:
    for my $line (@lines) {
        # Section markers change the second-level hash destination...
        if (my ($section_name) = $line =~ m/ A [ ([^]]+) ] /xms) {
            $curr_section = $section_name;
            next CONFIG;
        }

        # Key/value pairs are stored in the current second-level hash...
        if (my ($key, $val) = $line =~ m/A s* (.*?) s* : s* (.*?) s* z/xms) {
            $config{$curr_section}{$key} = $val;
            next CONFIG;
        }

        # Ignore everything else
    }
    return \%config;
}

The problem here is that the input file remains open after it's used, and stays open for however long the decoding of the data takes.

The sooner a filehandle is closed, the sooner the internal and external resources it controls are freed up. The sooner it's closed, the less chance there is for accidental reuse or misuse. The sooner an output filehandle is closed, the sooner the written file is in a stable state.

The previous example would be more robust if it didn't rely on the scope boundary to close the lexical filehandle when the subroutine returns. It should have been written:

sub get_config {
    my ($config_file) = @_;

    # Access config file or signal failure...
    open my $fh, '<', $config_file
        or croak "Can't open '$config_file': $OS_ERROR";

    # Load file contents and close file...
    my @lines = <$fh>;
    close $fh
        or croak "Can't close '$config_file' after reading: $OS_ERROR";

    # [Decode config data and return, as before]}

Input Loops

Use while (<>), not for (<>).

Programmers are occasionally tempted to write input loops using a for, like this:

use Regexp::Common;
Readonly my $EXPLETIVE => $RE{profanity};

for my $line (<>) {
    $line =~ s/$EXPLETIVE/[DELETED]/gxms;
    print $line;
}

That's presumably because for loops are inherently finite in their number of iterations, and hence intrinsically more robust. Or perhaps it's just that the keyword is two characters shorter.

Whatever the reason, using a for loop to iterate input is a very inefficient and brittle solution. The iteration list of a for loop is (obviously) a list context. So in the example, the <> operator is called in a list context. Evaluating <> in list context causes it to read in every line it can, building a temporary list as it does. Once the input is complete, that list becomes the list to be iterated by the for.

There are several problems with that approach. For a start, it means the for loop won't start to iterate until the entire input stream has been read and an end-of-file encountered. This means that the previous code can't be used interactively. Moreover, constructing a (potentially very long) list of the input lines is expensive, both in terms of the memory required to store the entire list and in terms of the time required to allocate that memory and to actually build the list.

Worst of all, the for input loop doesn't scale well. Its memory requirements are linearly proportional to the total size of the input, with something like a 200% overhead ^[55]. That means that a sufficiently large input might actually break the input loop with a memory allocation failure (Out of memory!), or at least slow it down intolerably with excessive memory allocation and swapping overheads.

In contrast, an equivalent while loop:

while (my $line = <>) {
    $line =~ s/$EXPLETIVE/[DELETED]/gxms;
    print $line;}

reads and processes only one line at a time. This version can be used interactively, and never allocates more memory than is needed to accommodate the longest individual line. So use a while instead of a for when reading input.

By the way, the same problems don't arise when iterating large ranges:

for my $n (2..1_000_000_000) {
    my @factors = factors_of($n);

    if (@factors == 2) {
        print "$n is prime
";
    }
    else {
        print "$n is composite with factors: @factors
";
    }}

In modern versions of Perl, ranges are lazily evaluated, so the previous code doesn't first have to build a list of 999,999,999 consecutive integers before the for can start iterating.

Line-Based Input

Prefer line-based I/O to slurping.

Reading in an entire file in a single <> operation is colloquially known as "slurping". But the considerations of memory allocation discussed in the previous section mean that slurping the contents of a file and then manipulating those contents monolithically, like so:

# Slurp the entire file (see the next guideline)...
my $text = do { local $/; <> };

# Wash its mouth out...
$text =~ s/$EXPLETIVE/[DELETED]/gxms;

# Print it all back out...
print $text;

is generally slower, less robust, and less scalable than processing the contents a line at a time:

while (my $line = <>) {
    $line =~ s/$expletive/[DELETED]/gxms;
    print $line;}

Reading an entire file into memory makes sense only when the file is unstable in some way, or is being updated asynchronously and you need a "snapshot", or if your planned text processing is likely to cross line boundaries:

sub get_C_code {
    my ($filename) = @_;

    # Get a handle on the code...
    open my $in, '<', $filename
        or croak "Can't open C file '$filename': $OS_ERROR";

    # Read it all in...
    my $code = do { local $/; <$in> };

    # Convert any C-style comment to a single space...
    use Regexp::Common;   # See Chapter 12
    $code =~ s{ $RE{comment}{C} }{$SPACE}gxms;

    return $code;}

Because C comments can span multiple lines, it's necessary to load the entire file into memory at once so the pattern can detect such cases.

Simple Slurping

Slurp a filehandle with a do block for purity.

Whenever you do need to read in an entire file at once, the syntax shown in the final example of the previous guideline is the right way to do it:

my $code = do { local $/; <$in> };

Localizing the global $/ variable (a.k.a. $RS or $INPUT_RECORD_SEPARATOR, under use English) temporarily replaces it with a version whose value is undef. But, if the input record separator is undefined, there is effectively no input record separator, so Perl treats the input as a single, unseparated record, and the single <> (or readline) reads in the entire input stream as a single "line".

Reading in a complete file or stream this way is much more efficient than "concatenative" approaches such as:

my $code;
while (my $line = <$in>) {
    $code .= $line;
}

or:

my $code = join $EMPTY_STR, <$in>;

That second alternative is particularly bad because, like the for (<>) discussed earlier, the join evaluates the read operation in a list context, constructs a list of individual lines, and then joins them back together to create a single string. This process requires about three times as much memory as:

my $code = do { local $/; <$in> };

It's also appreciably slower, and doesn't scale nearly as well as the size of the input text increases ^[56].

Note that it's important to put that localization-and-read inside a do {...} or in some other small block. A common mistake is to write this instead:

$/ = undef;
my $text = <$in>;

That works perfectly well, in itself, but it also undefines the global input record separator, rather than its temporary localized replacement. But the global input record separator controls the read behaviour of every filehandle—even those that are lexically scoped, or in other packages. So, if you don't localize the change in $/ to some small scope, you're dooming every subsequent read everywhere in your program to vile slurpitude.

Power Slurping

Slurp a stream with Perl6::Slurp for power and simplicity.

Reading in an entire input stream is common enough, and the do {...} idiom is ugly enough, that the next major version of Perl (Perl 6) will provide a built-in function to handle it directly. Appropriately, that builtin will be called slurp.

Perl 5 doesn't have an equivalent builtin, and there are no plans to add one, but the future functionality is available in Perl 5 today, via the Perl6::Slurp CPAN module. Instead of:

my $text = do { local $/; <$file_handle> };

you can just write:

use Perl6::Slurp;

my $text = slurp $file_handle;

which is cleaner, clearer, more concise, and consequently less error-prone.

The slurp( ) subroutine is also much more powerful. For example, if you have only the file's name, you would have to write:

my $text = do {
    open my $fh, '<', $filename or croak "$filename: $OS_ERROR";
    local $/;
    <$fh>;
};

which almost seems more trouble than it's worth. Or you can just give slurp( ) the filename directly:

my $text = slurp $filename;

and it will open the file and then read in its full contents for you.

In a list context, slurp( ) acts like a regular <> or readline, reading in every line separately and returning them all in a list:

my @lines = slurp $filename;

The slurp( ) subroutine also has a few useful features that <> and readline lack. For example, you can ask it to automatically chomp each line before it returns:

my @lines = slurp $filename, {chomp => 1};

or, instead of removing the line-endings, it can convert each one to some other character sequence (say, '[EOL]'):

my @lines = slurp $filename, {chomp => '[EOL]'};

or you can change the input record separator—just for that particular call to slurp( )—without having to monkey with the $/ variable:

# Slurp chunks...
my @paragraphs = slurp $filename, {irs => $EMPTY_STR};

Setting the input record separator to an empty string causes <> or slurp to read "paragraphs" instead of lines, where each "paragraph" is a chunk of text ending in two or more newlines.

You can even use a regular expression to specify the input record separator, instead of the plain string that Perl's standard $/ variable restricts you to:

# Read "human" paragraphs (separated by two or more whitespace-only lines)...
my @paragraphs = slurp $filename, {irs => qr/
 s* 
/xms};

Standard Input

Avoid using *STDIN, unless you really mean it.

The *STDIN stream doesn't always mean " …from the tty". And it never means " …from the files specified on the command line", unless you go out of your way to arrange for it to mean that:

close *STDIN or croak "Can't close STDIN: $OS_ERROR";
for my $filename (@ARGV) {
    open *STDIN, '<', $filename or croak "Can't open STDIN: $OS_ERROR";
    while (<STDIN>) {
        print substr($_,2);
    }
}

which is, of course, so complicated and ugly that it constitutes its own punishment.

*STDIN is always attached to the zero^th file descriptor of your process. By default, that's bound to the terminal (if any), but you certainly can't rely on that default. For example, if data is being piped into your process, then *STDIN will be bound to file descriptor number 1 of the previous process in the pipeline. Or if your input to your process is being redirected from a file, then *STDIN will be connected to that file.

To cope with these diverse possibilities and the possibility that the user just typed the desired input file(s) on the command line without bothering with any redirection arrows, it's much safer to use Perl's vastly cleverer alternative: *ARGV. The *ARGV stream is connected to wherever *STDIN is connected, unless there are filenames on the command line, in which case it's connected to the concatenation of those files.

So you can allow your program to cope with interactive input, shell-level pipes, file redirections, and command-line file lists by writing this instead:

while (my $line = <ARGV>) {
    print substr($line, 2);}

In fact, you use this magic filehandle all the time, possibly without even realizing it. *ARGV is the filehandle that's used when you don't specify any other:

while (my $line = <>) {
    print substr($line, 2);}

It's perfectly good practice to use that shorter—and more familiar—form. This guideline is intended mainly to prevent you from unintentionally "fixing it": trying to be explicit, but then using the wrong filehandle:

while (my $line = <STDIN>) {
    print substr($line, 2);
}

Printing to Filehandles

Always put filehandles in braces within any print statement.

It's easy to lose a lexical filehandle that's being used in the argument list of a print:

print $file $name, $rank, $serial_num, "
";

Putting braces around the filehandle helps it stand out clearly:

print {$file} $name, $rank, $serial_num, "
";

The braces also convey your intentions regarding that variable; namely, that you really did mean it to be treated as a filehandle, and didn't just forget a comma.

You should also use the braces if you need to print to a package-scoped filehandle:

print {*STDERR} $name, $rank, $serial_num, "
";

Another acceptable alternative is to load the IO::Handle module and then use Perl's object-oriented I/O interface:

use IO::Handle;

$file->print( $name, $rank, $serial_num, "
" );*STDERR->print( $name, $rank, $serial_num, "
" );

Simple Prompting

Always prompt for interactive input.

There are few things more frustrating than firing up a program and then sitting there waiting for it to complete its task, only to realize after a few minutes that it's actually been just sitting there too, silently waiting for you to start interacting with it:

# The quit command is case-insensitive and may be abbreviated...
Readonly my $QUIT => qr/A q(?:uit)? z/ixms;

# No command entered yet...
my $cmd = $EMPTY_STR;

# Until the q[uit] command is entered...
CMD:
while ($cmd !~ $QUIT) {
    # Get the next command...
    $cmd = <>;
    last CMD if not defined $cmd;

    # Clean it up and run it...
    chomp $cmd;
    execute($cmd)
        or carp "Unknown command: $cmd";
}

Interactive programs should always prompt for interaction whenever they're being run interactively:

# Until the q[uit] command is entered...
CMD:
while ($cmd !~ $QUIT) {
    # Prompt if we're running interactively...
    if (is_interactive(  )) {
        print get_prompt_str(  );
    }

    # Get the next command...
    $cmd = <>;
    last CMD if not defined $cmd;

    # Clean it up and run it...
    chomp $cmd;
    execute($cmd)
        or carp "Unknown command: $cmd";}

Interactivity

Don't reinvent the standard test for interactivity.

The is_interactive( ) subroutine used in the previous guideline is surprisingly difficult to implement. It sounds simple enough: just confirm that both input and output filehandles are connected to the terminal. If the input isn't, there's no need to prompt, as the user won't be entering the data directly. And if the output isn't, there's no need to prompt, because the user wouldn't see the prompt message anyway.

So most people just write:

sub is_interactive {
    return -t *ARGV && -t *STDOUT;
}

# and later...

if (is_interactive(  )) {
    print $PROMPT;
}

Unfortunately, even with the use of *ARGV instead of *STDIN (in accordance with the earlier "Standard Input" guideline), that implementation of is_interactive( ) doesn't work.

For a start, the *ARGV filehandle has the special property that it only opens the files in @ARGV when the filehandle is actually first read. So you can't just use the -t builtin on *ARGV:

-t *ARGV

*ARGV won't be opened until you read from it, and you can't read from it until you know whether to prompt; and to know whether to prompt, you have to check where *ARGV was opened to, but *ARGV won't be opened until you read from it.

Several other magical properties of *ARGV can also prevent simple -t tests on the filehandle from providing the correct answer, even if the input stream is already open. In order to cope with all the special cases, you have to write:

use Scalar::Util qw( openhandle );

sub is_interactive {
    # Not interactive if output is not to terminal...
    return 0 if not -t *STDOUT;

    # If *ARGV is opened, we're interactive if...
    if (openhandle *ARGV) {
        # ...it's currently opened to the magic '-' file
        return -t *STDIN if $ARGV eq '-';

        # ...it's at end-of-file and the next file is the magic '-' file
        return @ARGV>0 && $ARGV[0] eq '-' && -t *STDIN if eof *ARGV;

        # ...it's directly attached to the terminal
        return -t *ARGV;
    }

    # If *ARGV isn't opened, it will be interactive if *STDIN is attached
    # to a terminal and either there are no files specified on the command line
    # or if there are one or more files and the first is the magic '-' file
    return -t *STDIN && (@ARGV==0 || $ARGV[0] eq '-'),}

That is not something you want to have to (re)write yourself for each interactive program you create. Nor something you're ever going to want to maintain yourself. Fortunately, it's already written for you and available from the CPAN, in the IO::Interactive module. Instead of the horrendous subroutine definition shown earlier, you can just write:

use IO::Interactive qw( is_interactive );

# and later...
if (is_interactive(  )) {
    print $PROMPT;}

Alternatively, you could use the module's interactive( ) subroutine, which provides a special filehandle that sends output to *STDOUT only if the terminal is interactive (and just discards it otherwise):

use IO::Interactive qw( interactive );

# and later...print {interactive} $PROMPT;

Power Prompting

Use the IO::Prompt module for prompting.

Because programs so often need to prompt for interactive input and then read that input, it's probably not surprising that there would be a CPAN module to make that process easier. It's called IO::Prompt and it exports only a single subroutine: prompt( ). At its simplest, you can just write:

use IO::Prompt;

my $line = prompt 'Enter a line: ';

The specified string will be printed (but only if the program is interactive), and then a single line will be read in. That line will also be automatically chomped^[57], unless you specifically request it not be.

The prompt( ) subroutine can also control the echoing of characters. For example:

my $password = prompt 'Password: ', -echo => '*';

which echoes an asterisk for each character typed in:

> Password: ***********

You can even prevent echoing entirely (by echoing an empty string in place of each character):

my $password = prompt 'Password: ', -echo => $EMPTY_STR;

prompt( ) can return a single key-press (without requiring the Return key to be pressed as well):

my $choice = prompt 'Enter your choice [a-e]: ', -onechar;

It can ignore inputs that are not acceptable:

my $choice = prompt 'Enter your choice [a-e]: ', -onechar,
                    -require=>{ 'Must be a, b, c, d, or e: ' => qr/[a-e]/xms };

It can be restricted to certain kinds of common inputs (e.g., only integers, only valid filenames, only 'y' or 'n'):

CODE:
while (my $ord = prompt -integer, 'Enter a code (zero to quit): ') {
    if ($ord == 0) {
        exit if prompt -yn, 'Really quit? ';
        next CODE;
    }
    print qq{Character $ord is: '}, chr($ord), qq{'
};}

It has many more features, but the real power of prompt( ) is that it abstracts the ask-answer-verify sequence of operations into a single higher-level command, which can significantly reduce the amount of code you need to write. For example, the command-processing loop shown earlier in the "Simple Prompting" guideline:

# No command entered yet...
my $cmd = $EMPTY_STR;

# Until the q[uit] command is entered...
CMD:
while ($cmd !~ $QUIT) {
    # Prompt if we're running interactively...
    if (is_interactive(  )) {
        print get_prompt_str(  );
    }

    # Get the next command...
    $cmd = <>;
    last CMD if not defined $cmd;

    # Clean it up and run it...
    chomp $cmd;
    execute($cmd)
        or carp "Unknown command: $cmd";
}

can be reduced to:

# Until the q[uit] command is entered...
while ( my $cmd = prompt(get_prompt_str(  ), -fail_if => $QUIT) ) {
    # Run whatever else was...
    execute($cmd) or carp "Unknown command: $cmd";}

Note especially that the $cmd variable no longer has to be defined outside the loop and can be more appropriately restricted in scope to the loop block itself.

Progress Indicators

Always convey the progress of long non-interactive operations within interactive applications.

As annoying as it is to sit like a mushroom whilst some mute program waits for your unprompted input, it's even more frustrating to tentatively start typing something into an interactive program, only to discover that the program is still busy initializing, or calculating, or connecting to a remote device:

# Initialize from any config files...
for my $possible_config ( @CONFIG_PATHS ) {
    init_from($possible_config);
}

# Connect to remote server...
my $connection;
TRY:
for my $try (1..$MAX_TRIES) {
    # Retry connection with increasingly tolerant timeout intervals...
    $connection = connect_to($REMOTE_SERVER, { timeout => fibonacci($try) });
    last TRY if $connection;
}
croak "Can't contact server ($REMOTE_SERVER)"
    if !$connection;

# Interactive portion of the program starts here...
while (my $cmd = prompt($prompt_str, -fail_if=>$QUIT)) {
    remote_execute($connection, $cmd)
        or carp "Unknown command: $cmd";
}

It's much better—and not much more onerous—to give an active indication that an interactive program is busy doing something non-interactive:

# Initialize from any config files...
print {*STDERR} 'Initializing...';
for my $possible_config ( @CONFIG_PATHS ) {
    print {*STDERR} '.';
    init_from($possible_config);
}
print {*STDERR} "done
";

# Connect to remote server...
print {*STDERR} 'Connecting to server...';
my $connection;
TRY:
for my $try (1..$MAX_TRIES) {
    print {*STDERR} '.';
    $connection = connect_to($REMOTE_SERVER, { timeout => fibonacci($try) });
    last TRY if $connection;
}
croak "Can't contact server ($REMOTE_SERVER)"
    if not $connection;
print {*STDERR} "done
";

# Interactive portion of the program starts here...

Better still, factor those messages out into a set of utility subroutines:

# Utility subs to provide progress reporting...

sub _begin_phase {
    my ($phase) = @_;
    print {*STDERR} "$phase...";
    return;
}
sub _continue_phase {
    print {*STDERR} '.';
    return;
}
sub _end_phase {
    print {*STDERR} "done
";
    return;
}

_begin_phase('Initializing'),
for my $possible_config ( @CONFIG_PATHS ) {
    _continue_phase(  );
    init_from($possible_config);
}
_end_phase(  );

_begin_phase('Connecting to server'),
my $connection;
TRY:
for my $try (1..$MAX_TRIES) {
    _continue_phase(  );
    $connection = connect_to($REMOTE_SERVER, { timeout => fibonacci($try) });
    last TRY if $connection;
}
croak "Can't contact server ($REMOTE_SERVER)"
    if not $connection;
_end_phase(  );# Interactive portion of the program starts here...

Note that some of the comments have been dispensed with, as the _begin_phase( ) calls adequately document each non-interactive code paragraph.

Automatic Progress Indicators

Consider using the Smart::Comments module to automate your progress indicators.

As an alternative to coding the inline progress indicators or writing utility subroutines (as suggested in the previous guideline), you might prefer to use the Smart::Comments CPAN module, which keeps the comments about phases, and dispenses with the indicator code instead:

use Smart::Comments;
for my $possible_config ( @CONFIG_PATHS ) {  ### Initializing...  done
    init_from($possible_config);
}
my $connection;
TRY:
for my $try (1..$MAX_TRIES) {                ### Connecting to server...  done
    $connection = connect_to($REMOTE_SERVER, {timeout=>$TIMEOUT});
    last TRY if $connection;
}
croak "Can't contact server ($REMOTE_SERVER)"
    if not $connection;# Interactive portion of the program starts here...

Smart::Comments allows you to put a specially marked comment (###) on the same line as any for or while loop. It then uses that comment as a template, from which it builds an automatic progress indicator for the loop. Other useful features of the Smart::Comments module are described under "Semi-Automatic Debugging" in Chapter 18.

Autoflushing

Avoid a raw select when setting autoflushes.

When it comes to maintainable code, it doesn't get much worse than this commonly used Perl idiom:

select((select($fh), $|=1)[0]);

The evil one-argument form of select^[58] takes a filehandle and makes it the (global!) default destination for print statements from that point onwards. That is, after a select, instead of writing to *STDOUT, any print statement that isn't given an explicit filehandle will now write to the filehandle that was select'd.

This change of default happens even if the newly selected filehandle was formerly confined to a lexical scope:

for my $filename (@files) {
    # Open a lexical handle (will be automatically closed at end of iteration)
    open my $fh, '>', $filename
        or next;

    # Make it the default print target...
    select $fh;
    # Print to it...
    print "[This file intentionally left blank]
";
}

In actual applications, that last print statement would probably be replaced by a long series of separate print statements, controlled by some complex text-generation algorithm. Hence the desire to make the current $fh the default output filehandle, so as to avoid having to explicitly specify the filehandle in every print statement.

Unfortunately, because select makes its argument the global default for print, when the final iteration of the loop is finished, the last file that was successfully opened will remain the global print default. That filehandle won't be garbage-collected and auto-closed like all the other filehandles were, because the global default still refers to it. And for the remainder of your program, every print that isn't given an explicit filehandle will print to that final iterated filehandle, rather than to *STDOUT.

So don't use one-argument select. Ever.

And that appalling select statement shown at the start of this guideline?

select((select($fh), $|=1)[0]);

Well, that's the "classic" way to make the filehandle in $fh autoflush; that is, to write out its buffer on every print, not just when it sees a newline. First, you select the filehandle you want to autoflush (select($fh)). Then you set the punctuation variable that controls autoflushing of the currently selected filehandle ($|=1). The sneaky bit is that you do those two things in a list ((select($fh), $|=1)), so their return values become the two values of that list. Because select returns the previous default filehandle—the one that you just replaced—that previous filehandle must now be the first element of the list. So if you index back into the list, requesting the first element ((select($fh), $|=1)[0]), you'll get back the previously selected filehandle. Then all you need to do is pass that filehandle to select again (select((select($fh), $|=1)[0])) to restore the original default, and your journey to the Dark Side will be complete^[59].

Fortunately, if you're using lexical filehandles, there's no need for this kind of necroselectomancy. Lexical filehandles act like fully-fledged objects of the IO::Handle class so, if you're willing to load the IO::Handle module, there's a much simpler method for setting their autoflush behaviour:

use IO::Handle;

# and later...$fh->autoflush(  );

You can even use this same approach on the standard package-scoped filehandles:

use IO::Handle;

# and later...*STDOUT->autoflush(  );

^[51]Not that we should have too much sympathy for that code, as it's behaving just as badly by using the FILE bareword itself.

^[52]The other Four Horsemen of the I/O-pocalypse being IN, OUT, FH, and HANDLE.

^[53]Yes, it's a valid symbol name: the regex capture variable $18 lives there.

^[54]Of course, it would be even better if you gave the variable a lucid, lyric, imaginative, informative name (see Chapter 2). But that may not be your forte: "Dammit, Jim, I'm a Perl hacker, not Walt Whitman!"

^[55]Under Perl 5.8, for example, to read in 100,000 lines of 30 characters each (i.e., 3 MB of data) in a for loop requires just under 6 MB of allocated memory for the initial list. Reading in a file of one million such lines requires 59 MB of allocated memory, before the loop even starts. In contrast, the equivalent while loop never uses more than 55 bytes for either file.

^[56]By the way, for all its virtues, the do {...} approach isn't the fastest way to slurp a file of known (and very large) length. The very quickest way to do that is with a low-level system read:

sysread $fh, $text, -s $fh;

But then, of course, you have to live with the cryptic syntax, and with any idiosyncrasies that low-level I/O might be subject to on your particular platform. If you do need to use this highest-speed approach to slurping files, at least consider using the File::Slurp CPAN module, which encapsulates that messy sysread in a tidy read_file( ) subroutine.

^[57]How many times have you read in a line, then immediately had to chomp it? That sequence seems to be the rule, rather than the exception, so prompt makes chomping the default. This particular design is discussed further in Chapter 17.

^[58]As opposed to the evil four-argument select (see Chapter 8).

^[59]Once you start down that path, forever will it dominate your maintenance…confuse you, it will!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. I/O

Create new playlist

Sign In

Sign Up