CHAPTER 18: Text Processing, Documentation, and Reports

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 18

Text Processing, Documentation, and Reports

We have seen a lot of Perl's native capabilities in terms of interpolation and regular expressions. However, these are just the start of Perl's text processing capabilities.

Perl comes with a standard library of text processing modules that solve many common problems associated with manipulating text. These include such tasks as tab expansion, determining abbreviations, and paragraph formatting. While not necessarily advanced, they provide a simple way of performing useful functions without reinventing the wheel.

Another class of text processing modules is dedicated to understanding Perl's documentation syntax known as POD, or Plain Old Documentation. The Pod:: family of Perl modules enable us to perform various functions such as create, transform, and generally manipulate POD documentation in many different ways. We also look at processing a very special subclass of documents, source files, using Perl source filters.

The final part of this chapter deals with Perl's support for reports. These allow us to format text using special layout definitions. Formats are a built-in feature of the Perl interpreter that enable us to do many handy things with text, and we will explore, among other things, format structure, page control, and the format data type.

Text Processing

Perl's standard library contains several handy text processing modules that solve many common problems and can save a lot of time. These modules are often overlooked when considering Perl's text processing capabilities simply because the core language already provides such a rich set of functionality.

The main text processing modules are all members of the Text:: family of which those listed in Table 18-1 are the most common.

Table 18-1. Standard Text Processing Modules

Module	Function
`Text::Tabs`	Convert tabs to and from spaces.
`Text::Abbrev`	Calculate unique abbreviations from a list of words.
`Text::Balanced`	Match nested delimiters.
`Text::ParseWords`	Parse text into words and phrases.
`Text::Wrap`	Convert unformatted text into paragraphs.
`Text::Soundex`	Convert similar sounding text into condensed codes.

Many of Perl's other standard modules have more than a little to do with text processing of one kind or another. We make a brief note of them and where they are covered at the end of this section. In addition, CPAN offers many more modules for handling text. Searching in the Text:: namespace will uncover many modules designed for basic processing requirements, while namespaces like XML:: or Parser:: offer more advanced task-specific modules.

Expanding and Contracting Tabs with Text::Tabs

The Text::Tabs module is the simplest of the text processing modules. It provides two subroutines, unexpand for converting sequences of spaces into tab characters and expand for converting tab characters into spaces. Here is how they work:

# convert spaces into tabs

$tabbed_text = unexpand($spaced_text);



# convert tabs into spaces

$spaced_text = expand($tabbed_text);

Both of these subroutines work on either single strings, as just shown, or lists of strings, as in

@tabbed_lines = unexpand(@spaced_lines);

Any tabs that already exist in the text are not affected by unexpand, and similarly existing spaces are not touched by expand. The gap between stops (the stop gap, so to speak) is determined by the variable $tabstop, which is set to the desired tab width, 4 by default. This is actually imported into our own package by default so we can set it with

$tabstop = 8; # set a tab width of eight characters

That said, it is better from a namespace pollution point of view to import only the subroutines and set $tabstop as a package variable:

use Text::Tabs qw(expand unexpand);

$Text::Tabs::tabstop = 8;

Calculating Abbreviations with Text::Abbrev

It is occasionally useful to be able to quickly determine the unique abbreviations for a set of words, for instance, when implementing a command-line interface. Assuming we wish to create our own, rather than use an existing solution like Term::Complete or (sometimes) Term::Readline, we can make use of the Text::Abbrev module to precompute a table of abbreviations and their full-name equivalents.

The Text::Abbrev module supplies one function, abbrev, which works by taking a list of words and computing abbreviations for each of them in turn by removing one character at a time from each word and recording the resultant word stalk in a hash table. If the abbreviation has already been seen, it must be because two words share that abbreviation, and it is removed from the table. If a supplied word is an abbreviation of another, it is recorded and the longer abbreviations remain, pointing to the longer word. This short script shows the module in action:

#!/usr/bin/perl

# abbrev.pl

use warnings;

use strict;

use Text::Abbrev;



my $abbreviations = abbrev(@ARGV);

foreach (sort keys %{$abbreviations}) {

    print "$_ => $abbreviations->{$_} 
";

}

When run, this script produces a hash of unique abbreviations. In the output that follows, the abbreviations for gin, gang, and goolie are calculated. The single letter g is not present because it does not uniquely identify a word, but ga, gi, and go are:

> abbrev.pl gin gan goolie

        ga => gang

        gan => gang

        gang => gang

        gi => gin

        gin => gin

        go => goolie

        goo => goolie

        gool => goolie

        gooli => goolie

        goolie => goolie

The abbrev function returns either a list suitable for creating a hash or a hash reference, depending on whether it was called in list or scalar context:

%abbreviations = abbrev('gin', 'gang', 'goolie'),

$abbreviations = abbrev('gin', 'gang', 'goolie'),

We can also pass in a reference to a hash or a typeglob (deprecated) as the first argument. However, the original contents, if any, are not maintained:

# overwrite previous contents of $abbreviations

abbrev($abbreviations, 'ghost', 'ghast', 'ghoul'),

Note that the Term::Complete module combines abbreviations with a command-line entry mechanism (although it does not use Text::Abbrev to determine abbreviations). If we don't need anything more complex, this is a simpler solution than rolling our own with Text::Abbrev. See Chapter 15 for more details.

Parsing Words and Phrases with Text::ParseWords

Many applications that accept textual input need to be able to parse the text into distinct words for processing. In most simple cases, we can get away with using split. Since this is such a common requirement, split even splits using whitespace as a default. For instance, this rather terse program carves up its input text into a list of words, separated by whitespace and split using split with no arguments:

#!/usr/bin/perl

# splitwords.pl

use warnings;

use strict;



my @words;

push @words, split foreach(<>);

print scalar(@words), "words: @words 
";

This approach falls short if we want to handle more advanced constructs like quotes. If two or more words are surrounded by quotes, we often want to treat them as a single word or phrase, in which case we can't easily use split. Instead we can use the Text::ParseWords module, which handles quotes and produces a list of words and phrases using them.

Parsing Space-Separated Text

The Text::ParseWords module supports the parsing of text into words and phrases, based on the presence of quotes in the input text. It provides four subroutines:

`shellwords`	Process strings using whitespace as a delimiter, in the same manner as shells.
`quotewords`	Handle more general cases where the word separator can be any arbitrary text.
`nested_quotewords`	Similar to `quotewords`, word separator can be any arbitrary text.
`parse_line`	A simpler version of `quotewords`, which handles a single line of text and which is actually the basis of the other three.

The first, shellwords, takes one or more lines of text and returns a list of words and phrases found within them. Since it is set to consider whitespace as the separator between words, it takes no other parameters:

@words = shellwords(@input);

Here is a short program that shows shellwords in action:

#!/usr/bin/perl

# shell.pl

use warnings;

use strict;



use Text::ParseWords qw(shellwords);



my @input = (

    'This is "a phrase"',

    'So is this',

    q('and this'),

    "But this isn't",

    'And neither "is this"',

);



print "Input: ", join('',@input),"
";



my @words = shellwords(@input); print scalar(@words), " words:
";

print "	$_
" foreach @words;

When run, this program should produce the following output:

> shell.pl

Input: This is "a phrase" So is this 'and this' This isn't Neither "is this"

11 words:

        This

        is

        a phrase

        So

        is this

        and this



        This

        isn't



        Neither

        "is

        this"

This program demonstrates several points. First, we can define phrases with double quotes, or single quotes if we use the q function. Second, we can also define phrases by escaping spaces that we want shellwords to overlook. In order to have shellwords process these backslashes, we have to use single quotes (or q) around the string as a whole to avoid interpolation from evaluating them first. Finally, to have shellwords ignore a quote, we can escape it, but to escape a single quote, we have to use double quotes around the string and escape it twice (once for interpolation, once for shellwords). Of course, a lot of this is simpler if the text is coming from a variable rather than a literal string.

Parsing Arbitrarily Delimited Text

The quotewords subroutine is a more flexible version of shellwords that allows the word separator to be defined. It takes two additional parameters, a regular expression pattern describing the word separator itself and a keep flag that determines how quotes are handled. This is how we might use it to emulate and modify the result of shellwords. Note the value of the keep flag in each case:

# emulate 'shellwords' with 'quotewords'

@words = quotewords('s+', 0, @lines);



# emulate 'shellwords' but keep quotes and backslashes

@words = quotewords('s+', 1, @lines);

As a more complete example, here is a short program that parses a file of colon-delimited lines (like those found in /etc/passwd) into a long list of fields:

#!/usr/bin/perl

# readpw.pl

use warnings;

use strict;



use Text::ParseWords;



my (@users, @fields);

if (open PASSWD,"/etc/passwd") {

    @users = <PASSWD>;

    chomp @users;   # remove linefeeds

    @fields = quotewords(':', 0, @users);

    close PASSWD;

}

print "@fields";

The keep parameter determines whether quotes and backslashes are removed once their work is done, as real shells do, or whether they should be retained in the resulting list of words. If false, quotes are removed as they are parsed. If true, they are retained. The keep flag is almost but not quite Boolean. If set to the special case of delimiters, both quotes and characters that matched the word separator are kept:

# emulate 'shellwords' but keep quotes and backlashes and also store the

# matched whitespace as tokens too

@words = quotewords('s+', 'delimiters', @lines);

Batch-Parsing Multiple Lines

The preceding /etc/passwd example works, but it assembles all the resultant fields of each user into one huge list of words. Far better would be to keep each set of words found on each individual line in separate lists. We can do that with the nested_quotewords subroutine, which returns a list of lists, one list for each line passed in. Here is a short program that uses nested_quotewords to do just that:

#!/usr/bin/perl

# password.pl

use Text::ParseWords;



my @ARGV = ('/etc/passwd'),

my @users = nested_quotewords(':', 0, <>);



print scalar(@users)," users: 
";

print "	${$_}[0] => ${$_}[2] 
" foreach @users;

This program prints out a list of all users found in /etc/passwd and their user ID. When run it should produce output that starts something like the following:

> perl password.pl

16 users:

        root => 0

        bin => 1

        daemon => 2

        adm => 3

        ...

In this case, we could equally well have used split with a split pattern of a colon since quotes do not usually appear in a password file. However, the principle still applies.

Parsing a Single Line Only

The fourth function provided by Text::ParseWords is parse_line. It parses a single line only but is otherwise identical in operation to quotewords, and it takes the same parameters with the exception that the last can only be a scalar string value:

@words = parse_line('s+', 0, $line);

The parse_line subroutine provides no functional benefit over quotewords, but if we only have one line to parse, for example, a command-line input, then we can save a subroutine call by calling it directly rather than via quotewords or shellwords.

Parsing Brackets and Delimiters with Text::Balanced

Added to the Perl standard library for Perl 5.6, Text::Balanced is also available for older Perls from CPAN. It provides comprehensive abilities to match delimiters and brackets with arbitrary levels of nesting. Matching nested delimiters is traditionally a hard problem to solve, so having a ready-made solution is very welcome.

Extracting an Initial Match

All of the routines provided by Text::Balanced work in essentially the same way, taking some input text and applying one or more delimiters, brackets, or tags to it in order to extract a match. We can also supply an initial prefix for the routine to skip before commencing. This prefix, by default set to skip whitespace, is a regular expression, so we can create quite powerful match criteria.

Quotes and Single-Character Delimiters

To match delimiters and brackets, we have the extract_delimited and extract_bracketed routines. These operate in substantially similar ways, the only difference being that the latter understands the concept of paired characters, where the opening and closing delimiters are different. Here is a simple example of extracting the first double-quoted expression from some input text:

#!/usr/bin/perl

# quotebalanced1.pl

use strict;

use warnings;

use Text::Balanced qw(extract_delimited);



my $input=qq[The "quick" brown fox "jumped over" the lazy "dog"];



my ($extracted,$remainder)=extract_delimited($input,'"','The '),

print qq[Got $extracted, remainder <$remainder>
];

The first argument to extract_delimited is the text to be matched. The second is the delimiter; only the first character is used if more than one is supplied. The third (optional) parameter is the prefix to skip before starting the extraction. Without it, only whitespace is skipped over. This program will generate the following output:

Got "quick", remainder < brown fox "jumped over" the lazy "dog">

The remainder starts with the space immediately following the second quote, and the extracted text includes the delimiters. If we don't care about the remainder, we do not have to ask for it. All the extract_ functions will notice when they are called in a scalar context and will return just the extracted text, so we can write

my $extracted=extract_delimited($input,'"','The '),

If we want to match on more than one kind of delimiter, for example, single and double quotes, we replace the delimiter with a character class, like this:

my $extracted=extract_delimited($input,q/["']/,'The '),

This reveals that the delimiter is actually a regular expression, and indeed we could also write qr/["']/ here, or use even more advanced patterns. Whichever quote character is found is looked for again to complete the match, so any number of intervening single quotes may occur between an initial double quote and its terminating twin.

We are not just limited to quotes as delimiters—any character can be used. We can also pass undef as a delimiter, in which case the standard Perl quotes are used. The following statements are equivalent:

my $extracted=extract_delimited($input,q/["'`]/,'The '), #explicit



my $extracted=extract_delimited($input,undef,'The '), #implict '," and `

In order to match the first set of quotes, we supplied the prefix 'The ' to extract_delimited. Given the position of the first quote, the routine then finds the second. This is not very flexible, however. What we would really like to do is specify a prefix that says "skip everything up to the first double quote." Luckily, this turns out to be very easy because the prefix is a regular expression, and this is simply expressed as [^"]+, or "anything but a double quote":

my ($extracted,$remainder)=extract_delimited($input,'"','[^"]+'),

Substituting this line for the original will generate exactly the same output, but now it is no longer bound to the specific prefix of the input text. If we are curious to know what the prefix actually matched, we can get it from the third value returned:

my ($extracted,$remainder,$prefix)=extract_delimited($input,'"','[^"]+'),

We can supply a precompiled regular expression for the third parameter as well:

my ($extracted,$remainder)=extract_delimited($input,'"',qr/[^"]+/);

This emphasizes the regular expression, which improves legibility. It also allows us to specify trailing pattern match modifiers, which allows us to specify this alternate regular expression, which is closer to the literal meaning of "skip everything up to the first quote":

my ($extracted,$remainder)=extract_delimited($input,'"',qr/.*?(?=")/s);

This pattern starts with a nongreedy match for anything. Since a dot does not ordinarily match a newline, the /s modifier is required to permit the prefix to match an arbitrary number of initial lines without double quotes in them. This pattern also makes use of a positive look-ahead assertion (?=") to spot a quote without absorbing it. Combined with the nongreedy pattern, this will match all text up to the first quote.

By default, the backslash character, , escapes delimiters so that they will not be considered as delimiters in the input text. A fourth parameter to extract_delimited allows us to change the escape character or individually nominate a different escape character for each delimiter. For example:

extract_delimited($input,q/['"`]/,undef,''),      # no escape character

extract_delimited($input,q/['"`]/,undef,q/33/); # ASCII 27 (ESC)

extract_delimited($input,q/['"`]/,undef,q/'"`/);  # escape is delimiter

The last example defines a list of quote characters that is identical to (that is, in the same order as) the delimiters specified in the character class of the second parameter. If more than one escape character is specified, then each delimiter is escaped with the corresponding escape character. If not enough escape characters are supplied, then the last one is used for all remaining delimiters. In this example, the escape character for each delimiter is the same as the delimiter, so now we double up a character to escape it.

We can generate a customized regular expression that can take the place of extract_delimited for preset criteria. To do this, we make use of gen_delimited_pat, which takes delimiter and optional escape character arguments:

my $delimiter_re=gen_delimited_pat(q/['"`]/,q/'"'/);

The regular expression generated by this statement will match quote strings using any of the quote characters specified, each one of which is escapable by doubling it up. This is a convenient way to pregenerate regular expressions and does not even require that Text::Balanced is available in the final application:

$input =˜ /($delimiter_re)/ and print "Found $1
";

Or, to emulate extract_delimited more closely:

$input =˜/^($prefix)($delimiter_re)(.*)$/ and

  ($prefix,$extracted,$remainder)=($1,$2,$3);

The regular expression is in an optimized form that is much longer than one we might otherwise write but which is the most efficient at finding a match. It does not extract text though, so we need to add our own parentheses to get the matched text back.

Brackets and Braces

The extract_bracketed function is identical in use and return values, except that the delimiters are one or more of the matched brace characters (), [], {}, or <>. Here is an adapted version of the previous example that extracts bracketed text:

#!/usr/bin/perl

# bracebalanced.pl

use strict;

use warnings;

use Text::Balanced qw(extract_bracketed);



my $input=qq[The (<quick brown> fox) {jumped} over the (<lazy> dog)];



my ($extracted,$remainder)=extract_bracketed($input,'()<>{}',qr/[^()<>{}]+/);

print qq[Got "$extracted", remainder "$remainder"
];

When run, this program will produce the following output:

Got "(<quick brown> fox)", remainder " {jumped} over the (<lazy> dog)"

As before, the prefix is simply defined as text not containing any of the delimiters. If we changed the delimiter to not detect (, the text extracted would instead be <quick brown>. Interestingly, since Perl already knows about matched delimiter characters for quoting operators, we only need to specify the opening characters:

my ($extracted,$remainder)=extract_bracketed($input,'(<{',qr/[^()<>{}]+/);

It would be a mistake to think that extract_bracketed just looks for a closing delimiter character, though. One of its major benefits is that it understands nested braces and will only match on the corresponding closing brace. Take this adjusted example, where all the braces are round:

#!/usr/bin/perl

# nestedbracebalanced.pl

use strict;

use warnings;

use Text::Balanced qw(extract_bracketed);



my $input=qq[The ((quick brown) fox) (jumped) over the ((lazy) dog)];



my ($extracted,$remainder)=extract_bracketed($input,'(',qr/[^()]+/);

print qq[Got "$extracted", remainder "$remainder"
];

When run, this matches the correct closing brace, after fox, not the first one after brown:

Got "((quick brown) fox)", remainder " (jumped) over the ((lazy) dog)"

In situations where more than one brace type is significant, all brackets must nest correctly for the text to be successfully extracted. For example, if () and <> are both considered delimiters, then this text will not match because there is no matching >:

my $input=''(supply < demand)';

Only those characters listed as delimiters (and their corresponding closing characters) are managed this way, so if we do not consider < and > to be delimiters, then they are not considered and this text would match.

We can even handle quoted delimiters and ignore them rather than treating them as delimiters if we include a quote character in the list of delimiters. For example:

my ($extracted,$remainder)=extract_bracketed($input,q[("')],qr/[^()]+/);

This will match on round brackets only but disregard any round brackets found within single- or double-quoted strings. Interestingly, if the letter q is included, any character acceptable as a Perl quoting delimiter for quotelike operators such as // or {} is also recognized. If not specified as an acceptable brace delimiter, characters like { and } will instead operate like quotes, causing their contents to be skipped rather than processed for valid delimiters.

XML and Other Tagged Delimiters

The extract_tagged function does for tags what extract_bracketed does for brace characters. It is broadly similar in use: it returns the same values of ($extracted,$remainder,$prefix), but it has a slightly expanded list of up to five parameters. Here are some examples of it:

extract_tagged($text);                             # match any XML tag.

extract_tagged($text,"<FOO>")                      # match <FOO>...</FOO>

extract_tagged($text,"<[a-z]+>")                   # match any lowercase XML tag

extract_tagged($text,"/start","/end"               # match /start.../end

extract_tagged($text,"[FOO]","[/FOO]",$prefix)     # skip prefix then match

                                                   # [FOO]...[/FOO]

As these examples show, while XML tags are the default target, any kind of delimiter tags can be fined, including regular expressions matches. We can also pass in undef for any parameters for which we just want to accept the defaults. For example:

extract_tagged($text,"{FOO}",undef,$prefix)          # skip prefix then match

                                                     # {FOO}...{/FOO}

extract_tagged($text,undef,undef,$prefix)            # skip prefix then match

                                                     # any XML tag.

Tags must balance, just as braces must balance for extract_bracketed. The composition of the start tag and the end tag is specified and analyzed for punctuation versus alphanumeric characters, and the general rule for tags is inferred from it. The end tag, if not specified, is autogenerated by inserting a / before the alphanumeric part, so (abc) is automatically paired with (/abc) and sets (...) and (/...) as the general rule. Remarkably, this will even work for tags like <[a-z]+> as in the preceding example.

For example, the following program looks for and extracts the first lowercased XML tag:

#!/usr/bin/perl

# extractlctagged.pl

use strict;

use warnings;

use Text::Balanced qw(extract_tagged);



my $input=qq[<TEXT>The quick brown <subject>fox</subject> jumped

over the lazy <object>dog<?object></TEXT>];

my ($extracted,$remainder)=

  extract_tagged($input,"<[a-z]+>",undef,".*?(?=<[a-z]+>)");

print qq[Got "$extracted"
Remainder "$remainder"
];

Running this program will produce

Got "<subject>fox</subject>"

Remainder " jumped over the lazy <object>dog</object></TEXT>"

To match any XML-compliant tags, we would replace the extract_tagged call in this example with

my ($extracted,$remainder)=extract_tagged($input,undef,undef,".*?<[^<>]+>");

This would try to match the entire input text because of the initial <TEXT>, but the <subject> and <object> tags both match the general rule for tags, and so this match would fail, because the closing tag for <object> was <?object>, and not the expected </object>. (We can see the reason for the failure by printing out $@, as we will see later. $@ is set by Text::Balanced whenever a match cannot be made.)

The fifth parameter to extract_tagged is a hash of configuration options that allow us to alter the match behavior to handle common alternate inputs. For instance, HTML (as opposed to XHTML) allows some tags to be left unclosed. If we want to be able to handle these, we can tell extract_tagged not to require the closing tag. For example, this will not only match paragraph tags <p>...</p>, but also end the paragraph if a second <p> is seen:

my $paragraph=extract_tagged($text,"<p>","</p>",".*?(?=<p>)", {

    reject => ["<p>"],

    fail   => MAX

});

Here the reject option instructs the routine to fail if a second <p> is seen. The criteria are specified within a list reference so that we can specify multiple reject criteria at the same time if we wish. The fail option MAX instructs it to succeed and return all the characters matched up until that point rather than fail. This caters for both a closing tag and a new opening one. Alternatively, we can say fail => PARA to cut off the returned characters at the next blank line.

We can also use the ignore option to have extract_tagged disregard character sequences that would otherwise match the start tag specification. This is generally useful when the start tag is a regular expression or the "match any XML tag" default. For example, to handle the <object> tag by simply ignoring it, we could put

my $paragraph=extract_tagged($text,undef,undef,".*?(?=<[^<>]+>)",{

    ignore => ["<object>"],

});

If we expect to use the same parameters to extract_tagged frequently, we can generate a custom match that is routine-specific to the match we want to make with gen_extract_tagged. This takes the exact same arguments, minus the first input text argument, and allows us to rewrite this:

my ($extracted,$remainder)=

  extract_tagged($text,"<TOP>","<BOTTOM>",qr/.*?(?=<TOP>)/);

as this:

my $topbottom_match=

  gen_extract_tagged("<TOP>","<BOTTOM>",qr/.*?(?=<TOP>)/);

my ($extracted,$remainder)=$topbottom_match->($input);

The returned value is actually a closure (that is, a code reference) blessed into the Text::Balanced::Extractor class, and hence is invoked as a method call and not a subroutine. These routines become very useful in multiple match scenarios using extract_multiple. We can also create a special tag matcher for the <object>...=<?object> example and process it along with XML tags instead of ignoring it.

Perl-Style Quoted Text

The extract_quotelike function extracts Perl-style quoted strings from the input text. It is essentially a smarter version of extract_delimited that understands not just the standard quote characters, but also Perl's quotelike operators q, qq, and qr, hence its name. It will also parse any quotelike operator such as m//, s///, tr///, or any of their alternate quote syntaxes like s{}{}.

Since it is a specialized function, we cannot specify delimiters, and in fact we can only specify the text to be matched and the initial prefix. Perhaps to make up for this, this function returns up to ten values in list context, although not all of them will contain meaningful values depending on what kind of quoted text was found.

my ($extracted,$remainder,$matchedprefix,

    $typeofquote,

    $startdelim1,$text1,$enddelim1,

    $startdelim2,$text2,$enddelim2,

    $endflags) = extract_quotelike($input,$prefix);

The first three values here are the familiar extracted text, remainder, and matched prefix that we have already seen. The fourth value holds the operator, if any, that was found. For a regular quoted string it is undefined, but it will hold the appropriate value for a q, qq, qr, s, m, y, or tr operator. The next three values describe the delimiters and the text of the first found string. For a regular quoted string, this is the string and its delimiters. For a match or substitution, it is the delimiters and pattern of the match. The three values following this have the same meaning except for the second found string; this only has meaning for substitutions and transliterations, the only two operators to have the concept of a "second" string. Finally, any pattern match modifiers found are returned. This only has meaning for matches, substitution, or transliteration, of course.

Clever though it is, extract_quotelike is only really useful for parsing text containing Perl code. To handle ordinary quoted text, we actually want extract_delimited instead.

Variables and Code

The final two extraction routines are extract_variable and extract_codeblock. The first of these simply matches any kind of Perl variable, including method calls and subroutine calls, via a code reference. It has the usual semantics:

my (extracted,$remainder,$matchedprefix)=extract_variable($input,$prefix);

Like extract_quotelike, this is only useful if we are parsing Perl code. Of course, in these cases it is very useful indeed.

The extract_codeblock function is essentially a combination of extract_quotelike and extract_bracketed that will correctly parse strings containing nested braces and quoted strings that may contain braces that should not be considered significant for the purposes of matching. Its usage is identical to extract_bracketed except that a fourth argument may be specified to describe an outer delimiter that is not included in the delimiter set passed as the second argument. This allows code to be "marked up" in enclosing text using a special delimiter that is not considered special within the code block itself. Only after the code block matches its initial delimiter again is the closing outer delimiter looked for once more.

For example, this statement matches on (), {}, or [] braces, but looks for an outer set of <> braces to mark the beginning and end of the code:

my ($extracted,$remainder,$matchedprefix)=

  extract_codeblock($input, q/{}()[]/ , qr/.*?(?=<)/, "<>");

Even though the code block might actually contain < and > characters, they are not recognized as delimiters. Whatever the first brace inside the outer <....> turns out to be, the closing > is only matched when that brace is matched again, taking into account any nesting of delimiters that takes place within the block.

The default bracket delimiter for extract_codeblock is just {}, so with only the input text as an argument, or with a second argument of undef, other kinds of brace are not recognized. This is different from the default {}()[] of extract_bracketed.

Both extract_codeblock and extract_quotelike may be more feature-full than we actually require. If we only need to deal with real quote characters and do not need the outer delimiter feature, we can achieve a similar effect more simply and with greater flexibility using extract_bracketed with a delimiter string containing the quote and brace characters we want to handle.

Extracting Multiple Matches

All of the extractor functions we have looked at so far take note of and maintain the match position of input text that is stored in a variable. This is the same position that the pos function and the G regular expression metacharacter reference and that the /g modifier sets. As a result, all of the extract_ functions can be used in loops to extract multiple values. Here is a modified version of one of our earlier examples, now rewritten into a do...while loop:

#!/usr/bin/perl

# whilequotebalanced.pl

use strict;

use warnings;

use Text::Balanced qw(extract_delimited);



my $input=qq[The "quick" brown fox "jumped over" the lazy "dog"];

my ($extracted,$remainder);



do {

    ($extracted,$remainder)=extract_delimited($input,'"',qr/[^"]+/);

    print qq[Got $extracted, remainder <$remainder>
];

} while ($extracted and $remainder);

When run, this version of the program will print out

Got "quick", remainder < brown fox "jumped over" the lazy "dog">

Got "jumped over", remainder < the lazy "dog">

Got "dog", remainder <>

Note the criteria of the while condition. We terminate the loop if we run out of input text, indicated by an empty remainder, or we get an undef returned for the extracted text, indicating an error. In the latter case, we can look at $@ to see what the problem was.

For more advanced requirements, we can use the extract_multiple function. This wraps one or more extraction functions and applies the input text to each of them in turn. For example, this program applies extract_bracketed repeatedly to the input text we used earlier:

#!/usr/bin/perl

# bracebalanced1.pl

use strict;

use warnings;

use Text::Balanced qw(extract_multiple extract_bracketed);



my $input=qq[The (<quick brown> fox) {jumped} over the (<lazy> dog)];

my @matches=extract_multiple(

    $input,

    [ &extract_bracketed ],

    undef, 1

);

print "Got ",scalar(@matches)," matches
";

print $_+1,"='",$matches[$_],"'
" foreach (0..$#matches);

When run, this program will output the following:

Got 3 matches

1='(<quick brown> fox)'

2='{jumped}'

3='(<lazy> dog)'

Of the four arguments supplied to extract_multiple, the first is the input text as usual. If it is undefined, $_ is used. The second we will pass over for a moment. The third parameter defines how many matches to make; it has the same meaning as the third argument of split. A positive number will cause only that many matches to take place (in scalar context, the number of matches is forced to 1, and a warning is issued if more are requested).

The fourth determines if unmatched text is discarded or returned. If true, it is discarded. If unspecified or false, it is returned. The unmatched text is able to take the place of prefix text in many cases; the unmatched text is assembled character by character each time extract_multiple is unable to advance through the input text using any of the supplied extractors. If we modified the last argument to 0, for example, the first string returned would be "The ".

Note If we don't need to specify the fourth argument and don't want to limit matches, then we can omit both the third and fourth arguments and just specify the input text.

The second argument is the most complex. It is a reference to an array of extraction functions. Here we have only one. Only the input text is passed, but if we are happy to use an extraction function with its settings, we can just specify it as a code reference. An extraction function can also be a closure generated by gen_extract_tagged, a regular expression, or a literal text string. Each is automatically detected and handled correctly, as demonstrated in this modified call:

my @matches=extract_multiple(

    $input,

    [ &extract_bracketed,

      qr/THE/i,

      'over'

    ],

    undef, 1

);

This will display the following:

Got 6 matches

1='The'

2='(<quick brown> fox)'

3='{jumped}'

4='over'

5='the'

6='(<lazy> dog)'

Note that there are no spaces before or after the extracted text. That is because the default prefix matched by extract_bracketed is whitespace. It is matched inside the extractor but not returned as matched text.

What if we want to use different parameters to the defaults? This is not a problem, but we need to use an anonymous subroutine to wrap our extractor subroutines with the arguments we want. The first argument should be $_[0], to get the input text passed in by extract_multiple. For example:

my @matches=extract_multiple(

    $input,

    [ sub { extract_bracketed($_[0],'()<>{}',qr/[^()<>{}]+/) } ],

    undef, 1

);

As a final nicety, if an extractor is specified as a hash reference to a hash of one key-value pair, the value is used as the extractor, and on a successful match the extracted text is returned in a reference blessed into the class named by the key:

my @matches=extract_multiple(

    $input, [

        { Parsed::String::QUOTED    => &extract_delimited }

        { Parsed::String::BRACKETED => &extract_bracketed },

    ]

);

This allows smart extraction algorithms to tokenize extracted text conveniently and is very useful for implementing parsers.

Handling Failed Matches

If any of Text::Balanced's functions fail, an undef is returned to the caller. In a list context, the undef is followed by the original input text, so the usual semantics of ($extracted,$remainder) remain consistent even in a failed match. To find out the actual reason for the failure, we look at the special variable $@, which is assigned an object of class Text::Balanced::ErrorMsg with two attributes:

`$@->{error}`	A diagnostic error message indicating the reason for failure
`$@->{pos}`	The position in the input text where the error occurred

In a string context, both values are combined into a single diagnostic error message. For example, in our earlier mismatched XML tag example where we wrote <object>dog<?object>, $@ would be set to

Found unbalanced nested tag: <object>, detected at offset 66

There are over 20 possible reasons for Text::Balanced to fail and set $@. On a successful match, $@ will always be undef, so any other value is an indication of failure.

Formatting Paragraphs with Text::Wrap

The Text::Wrap module provides text-formatting facilities to automate the task of turning irregular blocks of text into neatly formatted paragraphs, organized so that their lines fit within a specified width. Although not particularly powerful, it provides a simple and quick solution.

It provides two subroutines: wrap handles individual paragraphs and is ideally suited for formatting single lines into a more presentable form, and fill handles multiple paragraphs and will work on entire documents.

Formatting Single Paragraphs

The wrap subroutine formats single paragraphs, transforming one or more lines of text of indeterminate length and converting them into a single paragraph. It takes three parameters: an initial indent string, which is applied to the first line of the resulting paragraph; a following indent string, applied to the second and all subsequent lines; and finally a string or list of strings.

Here is how we could use wrap to generate a paragraph with an indent of five spaces on the first line and an indent of two spaces on all subsequent lines:

$para = wrap(' ', ' ', @lines);

Any indentation is permissible. Here is a paragraph formatted (crudely) with HTML tags to force the lines to conform to a given line length instead of following the browser's screen width:

$html = wrap("<p>  ", "<br>", $text);

If a list is supplied, wrap concatenates all the strings into one before proceeding—there is no essential difference between supplying a single string over a list. However, existing indentation, if there is any, is not eliminated, so we must take care to deal with this first if we are handling text that has already been formatted to a different set of criteria. For example:

s/^s+// foreach @lines; # strip leading whitespace from all lines

The list can be of any origin, not just an array variable. For example, take this one-line reformatting application:

print wrap(" ", "", <>); # reformat standard input/ARGV

Tabs are handled by Text::Wrap to expand them into spaces, a function handled by Text::Tabs, previously documented. When formatting is complete, spaces are converted back into tabs, if possible and appropriate. See the section "Expanding and Contracting Tabs with Text::Tabs" for more information.

Customized Wrapping

The Text::Wrap module defines several package variables to control its behavior, including the formatting width, the handling of long words, and the break text.

The number of columns to format is held in the package variable Text::Wrap::columns and has a default value of 76, which is the polite width for things like e-mail messages (to allow a couple of > quoting prefixes to be added in replies before 80 columns is reached). We can change the column width to 39 with

$Text::Wrap::columns = 39;

Words that are too long to fit the line are broken up (URLs in text documents are a common culprit). This behavior can be altered to a fatal error by setting the variable with the following line:

$Text::Wrap::huge = 'die';

Alternatively, long words can be left as-is, causing them to overflow the width, with

$Text::Wrap::huge = 'overflow';

We can also configure the break text, that is, the character or characters that separate words. The break text is a regular expression, defined in the package variable $Text::Wrap::break, and is by default s, to match any whitespace character. To allow a comma or a colon, but not a space to break text, we could redefine this to $Text::Wrap::break = '[:,]';.

A limited debugging mode can also be enabled by setting the variable $Text::Wrap::debug:

$Text::Wrap::debug = 1;

The columns, break, and huge variables can all be exported from the Text::Wrap package, if desired:

use Text::Wrap qw($columns $huge);$columns = 39;

$huge = 'overflow';

As with any module symbols we import, this is fine for simple scripts but is probably unwarranted for larger applications—use the fully qualified package variables instead.

Formatting Whole Documents

Whole documents can be formatted with the fill subroutine. This will chop the supplied text into paragraphs first by looking for lines that are indented, indicating the start of a new paragraph, and blank lines, indicating the end of one paragraph and the start of another. Having determined where each paragraph starts and ends, it then feeds the resulting lines to wrap, before merging the resulting wrapped paragraphs back together.

The arguments passed to fill are the same as those for wrap. Here is how we would use it to reformat paragraphs into unindented and spaced paragraphs:

$formatted_document = fill("
", "", @lines);

If the two indents are identical, fill automatically adds a blank line to separate each paragraph from the previous one. Therefore, the preceding could also be achieved with

$formatted_document = fill("", "", @lines);

If the indents are not identical, then we need to add the blank line ourselves:

$formatted_document = fill("	", "", @lines);

# indent each new paragraph with a tab, paragraphs are continuous



$formatted_document = fill("
	", "", @lines);

# indent each new paragraph with a tag, paragraphs are separated

All the configurable variables that affect the operation of wrap also apply to fill, of course, since fill uses wrap to do most of the actual work. It is not possible to configure how fill splits text into paragraphs.

Note that if fill is passed lines already indented by a previous wrap operation, then it will incorrectly detect each new line as a new paragraph (because it is indented). Consequently, we must remove misleading indentation from the lines we want to reformat before we pass them to fill.

Formatting on the Command Line

Text::Wrap's usage is simple enough for it to be used on the command line:

> perl -MText::Wrap -e "fill('','',<>)" -- textfile ...

Here we have used the special argument -- to separate Perl's arguments from the file names to be fed to the formatter. We can supply any number of files at once and redirect the output to a file if we wish. A related module that may be worth investigating is Text::Autoformat, which is specifically tailored for command-line uses like this.

Matching Similar Sounding Words with Text::Soundex

The Text::Soundex module is different in nature from the other modules in the Text:: family. While modules such as Text::Abbrev and Text::ParseWords are simple solutions to common problems, Text::Soundex tackles a different area entirely. It implements a version of the Soundex algorithm developed for the U.S. Census in the latter part of the 19th century as an aid for phonetically indexing surnames and popularized by Donald Knuth of TeX fame.

The Soundex algorithm takes words and converts them into tokens that approximate the sound of the word. Similar-sounding words produce tokens that are either the same or close together. Using this, we can generate Soundex tokens for a predetermined list of words, say a dictionary or a list of surnames, and match queries against it. If the query is close to a word in the list, we can return the match even if the query is not exactly right, misspelled, for example.

Tokenizing Single Words

The Text::Soundex module provides exactly one subroutine, soundex, that transforms one word into its Soundex token. It can also accept a list of words and will return a list of the tokens, but it will not deal with multiple words in one string:

print soundex "hello";               # produces 'H400'

print soundex "goodbye";             # produces 'G310'

print soundex "hilo";                # produces 'H400' - same as 'Hello'

print join ',', soundex qw(Hello World);

                                     # produces 'H400,W643'

print soundex "Hello World"          # produces 'H464'

The following short program shows the Soundex algorithm being used to look up a name from a list given an input from the user. Since we are using Soundex, the input doesn't have to be exact, just similar:

#!/usr/bin/perl

# surname.pl

use warnings;

use strict;



use Text::Soundex;



# define an ABC of names (as a hash for 'exists')

my %abc = (

    "Hammerstein" => 1,

    "Pineapples" => 1,

    "Blackblood" => 1,

    "Deadlock" => 1,

    "Mekquake" => 1,

    "Rojaws" => 1,

);



# create a token-to-name table

my %tokens;

foreach (keys %abc) {

    $tokens{soundex $_} = $_;

}

# test input against known names

print "Name? ";

while (<>) {

    chomp;

    if (exists $abc{$_}) {

        print "Yes, we have a '$_' here. Another? ";

    } else {

        my $token = soundex $_;

        if (exists $tokens{$token}) {

            print "Did you mean $tokens{$token}? ";

        } else {

            print "Sorry, who again? ";

        }

    }

}

We can try out this program with various different names, real and imaginary, and produce different answers. The input can be quite different from the name if it sounds approximately right:

> perl surname.pl

Name? Hammerstone

Did you mean Hammerstein? Hammerstein

Yes, we have a 'Hammerstein' here. Another? Blockbleed

Did you mean Blackblood? Mechwake

Did you mean Mekquake? Nemesis

Sorry, who again?

Tokenizing Lists of Words and E-Mail Addresses

We can produce a string of tokens from a string of words by splitting up the string before feeding it to soundex. Here is a simple query program that takes input from the user and returns a list of tokens:

#!/usr/bin/perl

# soundex.pl

use warnings;

use strict;



use Text::Soundex;



while (<>) {

   chomp;   #remove trailing linefeed

   s/W/ /g;   #zap punctuation, e.g. '.', '@'

   print "'$_' => '@{[soundex(split)]}'
";

}

We can try this program out with phrases to illustrate that accuracy does not have to be all that great, as a guide:

> perl soundex.pl

definitively inaccurate

'definitively inaccurate' => 'D153 I526'

devinatovli inekurat

'devinatovli inekurat' => 'D153 I526'

As well as handling spaces, we have also added a substitution that converts punctuation into spaces first. This allows us to generate a list of tokens for an e-mail address, for example.

The Soundex Algorithm

As the previous examples illustrate, Soundex tokens consist of an initial letter, which is the same as that of the original word, followed by three digits that represent the sound of the first, second, and third syllable, respectively. Since one has one syllable, only the first digit is non-zero. On the other hand, seven has two syllables, so it gets two non-zero digits. Comparing the two results, we can notice that both one and seven contain a 5, which corresponds to the syllable containing "n" in each word.

The Soundex algorithm has some obvious limitations though. In particular, it only resolves words up to the first three syllables. However, this is generally more than enough for simple "similar sounding" type matches, such as surname matching, for which it was designed.

Handling Untokenizable Words

In some rare cases, the Soundex algorithm cannot find any suitable token for the supplied word. In these cases, it usually returns nothing (or to be more accurate, undef). We can change this behavior by setting the variable:

$Text::Soundex::soundex_nocode:



$Text::Soundex::soundex_nocode = 'Z000';   # a common 'failed' token

print soundex "=>";   # produces 'Z000'

If we change the value of this variable, we must be sure to set it to something that is not likely to genuinely occur. The value of Z000 is a common choice, but it matches many words including Zoo. A better choice in this case might be Q999, but there is no code that is absolutely guaranteed not to occur. If we do not need to conform to the Soundex code system (we might want to pass the results to something else that expects valid Soundex tokens as input), then we can simply define an impossible value like _NOCODE_ or ?000, which soundex cannot generate.

Other Text Processing Modules

As well as the modules in the Text:: family, several other Perl modules outside the Text:: hierarchy involve text processing or combine text processing with other functions.

Several of the Term:: modules all involve text processing in relation to terminals. For instance, Term::Cap involves generating ANSI escape sequences from capability codes, while Term::ReadLine provides input line text processing support. These modules are all covered in Chapter 15.

Writing sed Scripts with Perl

Many Unix shell scripts make use of the sed command to carry out text processing. The name is short for stream editor, and a typical sed command might look like this:

sed 5q file.txt

This prints out the first five lines of a file, rather like the head command.

Perl comes with a script called psed that provides a complete implementation of sed written in Perl. As it has no dependency on the real sed, it will work on platforms like Windows for which sed is not available (short of installing a Unix shell environment like Cygwin):

psed 5q file.txt

When invoked under the alternate name s2p, this script instead takes the supplied sed arguments and generates a stand-alone Perl script that performs the same operation:

s2p 5q file.txt > printtop5.pl

Of course, the script generated is not terribly efficient compared to simply reimplementing the script in Perl to start with. Under either name, the script can also be given the option -f to parse and process a sed script in a file rather than a directly typed command and -e to specify additional commands (with or without -f).

Documenting Perl

Documentation is a good idea in any programming language. Like most programming languages, Perl supports simple comments. However, it also attempts to combine the onerous duties of commenting code and documenting software into one slightly less arduous task through the prosaically named POD or Plain Old Documentation syntax.

Comments

Anything after a # is a comment and ignored by the Perl interpreter. Comments may be placed on a line of their own or after existing Perl code. They can even be placed in the middle of multiline statements:

print 1 *   # depth

4*   # width

9;   # height

Perl will not interpret a # inside a string as the start of a comment, but because of this we cannot place comments inside HERE documents. While we cannot comment multiple lines at once, like C-style /*...*/ comments, POD offers this ability indirectly.

POD: Plain Old Documentation

POD is a very simple markup syntax for integrating documentation with source code. It consists of a series of special one-line tokens that distinguish POD from source code and also allows us to define simple structures like headings and lists.

In and of itself, POD does nothing more than give us the ability to write multiline comments. However, its simple but flexible syntax also makes it very simple to convert into user-friendly document formats. Perl comes with the following POD translator tools as standard:

`pod2text`	Render POD in plain text format.
`pod2html`	Render POD into HTML.
`pod2man`	Render POD into Unix manual page (`nroff`) format.
`pod2latex`	Render POD into Latex format.

Many more translators are available from CPAN, of course, including translators for RTF, XML, PostScript, PDF, DocBook, OpenOffice, as wells as alternate translators for HTML, text, and so on. The perldoc utility is just a friendlier and more specialized interface to the same translation process, as are pod2usage and podselect, all of which are covered in this section.

POD Paragraphs

POD allows us to define documentation paragraphs, which we can insert into other documents—most usually, but by no means exclusively, Perl code. The simplest sequence is =pod ... =cut. The =pod token states that all of the following text is to be taken as POD paragraphs, until the next =cut or the end of the file:

...

$scalar = "value";

=pod



This is a paragraph of POD text embedded into some Perl code. It is not indented,

so it will be treated as normal text and word wrapped by POD translators.

=cut



print do_something($scalar);

...

Within the delimited section, text is divided into paragraphs, which are simply blocks of continuous text (potentially including linefeeds). A paragraph ends and a new one begins only when a completely empty line is encountered. All POD tokens will absorb a paragraph that immediately follows them, which is why there is a blank line after the =cut in the preceding example. While the blank line preceding the =cut is not necessary, maintaining blank lines on both sides helps to visually discriminate the POD directive.

Some tokens such as =item and =head1, covered shortly, use the attached paragraph for display purposes. Others, like =pod and =cut, ignore it. This lets us document the POD itself, as the text immediately following =pod or =cut is not rendered by POD processors.

=pod this is just a draft document



mysubname - this subroutine doesn't do much of anything

at all and is just serving as an example of how to document it with

a POD paragraph or two



=cut end of draft bit

Since nonblank following lines are included in the text attached to the token, the preceding =pod token could also be written, with identically equivalent meaning, as follows:

=pod this is

just a draft

document

The same rule applies to all POD tokens except =cut. While POD translators will absorb text on the line or lines immediately following a =cut, the Perl interpreter itself will only ignore text following a =cut on the same line. As a consequence, we cannot spread a "cut comment" across more than one line and expect code to compile.

Paragraphs and Paragraph Types

If a paragraph is indented, then we consider it to be preformatted, much in the same way that the HTML <pre> tag works. The following example shows three paragraphs, two of which are indented:

=pod



         This paragraph is indented, so it is taken as

         is and not reformatted by translators like:



pod2text - the text translator

pod2html - the HTML translator

pod2man - the Unix manual page translator



          Note that 'as is' also means that escaping does not work, and that

          interpolation doesn't happen. What we see is what we get.



This is a second paragraph in the same =pod...=cut section. Since it is

not indented it will be reformatted by translators.



=cut

Headings

Section headings can be added with the =head1 and =head2 tokens, plus =head3 and =head4 in more recent releases of the POD translators. The heading text is the paragraph immediately following the token and may start (and end) on the same line:

=head1 This is a level one heading

Or:

=head2 This is

 a level

two heading

Or:

=head2

As is this

Heading tokens start a POD section in the same way that =pod does if one is not already active. POD sections do not nest, so we only need one =cut to get back to Perl code. In general the first form is used, but it's important to leave an empty line if we do not want the heading to absorb the following paragraph:

=head1 ERROR: This heading has accidentally swallowed up

       this paragraph, because there is no separating line.

=head2 ERROR: Worse than that, it will absorb this second level heading too

       so this becomes one long level one heading.

=cut

This is how we should really do it:

=head1 This is a level one heading



This is a paragraph following the level one heading.



=head2 This is a level two heading



    This a preformatted paragraph following the level two heading.



=cut

How the headings are actually rendered is entirely up to the translator. By default the text translator pod2text indents paragraphs by four spaces, level two headings by two, and level one headings by none—crude, but effective. The HTML translator pod2html uses tags like <h1> and <h2> as we might expect.

Lists

Lists can be defined with the =over ... =back and =item tokens. The =over token starts a list and can be given a value such as 4, which many formatters use to determine how much indentation to use. The list is ended by =back and is optional if the POD paragraph is at the end of the document. =item defines the actual list items of which there should be at least one, and we should not use this token outside of an =over ... =back section. Here is an example three-item list:

=over 4



=item 1



This is item number one on the list

=item 2



This is item number two on the list



=item 3



This is the third item



=back

Like =pod and =headn, =over will start a POD section if one is not active.

The numbers after the =item tokens are purely arbitrary; we can use anything we like for them, including meaningful text. However, to make the job of POD translators easier, we should stick to a consistent scheme. For example, if we number them, we should do it consistently, and if we want to use bullet points, then we should use something like an asterisk. If we want named items, we can do that too. For example, a bullet-pointed list with paragraphs:

=over 4



=item *

This is a bullet pointed list



=item *

With two items



=back

A named items list:

=over 4



=item The First Item



This is the description of the first item



=item The Second Item



This is the description of the second item



=back

A named items list without paragraphs:

=over 4



=item Stay Alert



=item Trust No one



=item Keep Your Laser Handy



=back

POD translators will attempt to do the best they can with lists, depending on what they think we are trying to do and the constraints of the document format into which they are converting. The pod2text tool will just use the text after the item name. The pod2html tool is subject to the rules of HTML, which has different tags for ordered, unordered, and descriptive lists (<ol>, <ul>, and <dl>) so it makes a guess based on what the items look like. A consistent item naming style will help it make a correct guess.

Although =over will start a new POD section, =back will end the list but not the POD section. We therefore also need a =cut to return to Perl code:

=over 4



=item * Back to the source

=back



=cut

Character Encodings

If documentation is written in a different character set than the default Latin-1, POD translators can be told to render it using an alternate character encoding with the =encoding token. For example:

=encoding utf8

Typically, this is used to state the encoding of a whole document and should be placed near the top, before any renderable text.

Translator-Specific Paragraphs

The final kinds of POD token are the =for and =begin ... =end tokens. The =for token takes the name of a specific translator and an immediately following paragraph (that is, not with an intervening blank line), which is rendered only if that translator is being used. The paragraph should be in the output format of the translator, that is, already formatted for output. Other translators will entirely ignore the paragraph:

=for text

This is a paragraph that will appear in documents produced by the pod2text format.



=for html < font color = red

<p>But this paragraph will appear in <b>HTML</b> documents

</font>

Again, like the headings and item tokens, the paragraph can start on the next line, as in the first example, or immediately following the format name, as in the second.

Since it is annoying to have to type =for format for every paragraph in a collection of paragraphs, we can also use the pair of =begin..=end markers. These operate much like =pod...=cut but mark the enclosed paragraphs as being specific to a particular format:

=begin html



<p>Paragraph1



<p><table>......

......</table>



<p>Paragraph2



=end

If =begin is used outside an existing POD section, then it starts one. The =end ends the format-specific section but not the POD, so we also need to add a =cut to return to Perl code, just as for lists.

=begin html



<p>A bit of <b>HTML</b> document

=end



=cut

The =begin and =end tokens can also be used to create multiline comments, simply by providing =begin with a name that does not correspond to any translator. We can even comment out blocks of code this way:

=begin disabled_dump_env



foreach (sort keys %ENV) {

    print STDERR, "$_ => $ENV{$_}
";

}



=end



=begin comment



This is an example of how we can use POD tokens to create comments.

Since 'comment' is not a POD translator type, this section is never

used in documents created by 'pod2text', 'pod2html', etc.



=end

Some extension modules also understand specific format names. For example, the Test::Inline module looks for the special name test to mark the location of in-line tests.

Using POD with __DATA__ and __END__

If we are using either a __DATA__ or a __END__ token in a Perl script, then we need to take special care with POD paragraphs that lie adjacent to them. POD translators require that there must be at least one empty line between the end of the data and a POD directive for the directive to be seen (rather like POD directives themselves, in fact); otherwise, it is missed by the translation tools. In other words, write this:

...

__END__



=head1



...

and not this:

...

__END__

=head1



...

Interior Sequences

We mentioned earlier that POD paragraphs could either be preformatted (indicated by indenting) or normal. Normal paragraphs are reformatted by translators to remove extraneous spaces, newlines, and tabs. Then the resulting paragraph is rendered to the desired width if necessary.

In addition to basic reformatting, normal paragraphs may also contain interior sequences. Each sequence consists of a single capital letter, followed by the text to treat specially within angle brackets. For example:

=pod

This is a B<paragraph> that uses I<italic> and B<bold> markup using the

BE<lt>textE<gt> and IE<lt>text<gt> interior sequences. Here is an example

code fragment: C<substr $text,0,1> and here is a filename: F</usr/bin/perl>.

All these things are of course represented in a style entirely up to the

translator. See L<perlpod> for more information.



=cut

To specify a real < and >, we have to use the E<lt> and E<gt> sequences, reminiscent of the < and > HTML entities. The full, loosely categorized list of interior sequences supported by POD follows:

Style:

Sequence	Formatting
`B<text>`	Bold/Strong text (options, switches, program names).
`I<text>`	Italic/Emphasized text (variables, emphasis).
`S<text>`	Text contains nonbreaking spaces and cannot be word-wrapped.
`C<code>`	Code/Example text (listings, command examples).
`F<file>`	File names.

Cross-references and hyperlinks:

Sequence	Formatting
`L<name>`	A cross-reference link to a named manual page and/or section
`L<page>`	Other manual page
`L<page/name>`	Section or list item in other manual page
`L<page/"name">`	The same as preceding entry
`L</name>`	Section or list item in current manual page
`L<"name">`	The same as preceding entry

A section title is the text after a =head POD directive, with any spaces replaced with underscores. A list item title is the text after an =item POD directive. In case of conflicts, the first match will usually be linked to. Markup (if the title contains any) may be omitted in the link name.

Either a leading / or quotes are necessary to distinguish a section or list item name from manual page names.

L<text|name> Equivalent to the L<name> sequence, but with an alternative text description for the link

The descriptive text is given first. The original link name is given second, after a pipe symbol, and describes the nature of the link. For example:

L<text|name>

L<text|name/item>

L<text|name/"section">

L<text|"section">

L<text|/"section">

Under this syntax we cannot use explicit / or | characters, but see E<escape> later.

Miscellaneous:

`X<index>`	An index entry. Ignored by most formatters, it may be used by indexing programs.
`Z<>`	A zero-width character. Useful for breaking up sequences that would otherwise be recognized as POD directives.

Special characters and escape sequences:

E<escape> A named or numbered entity, styled on the &entity; syntax of HTML.

These escapes are usually only necessary inside another sequence or immediately after a capital letter representing an escape sequence; for instance, B<text> is written literally as BE<lt>textE<gt>. In particular, the following special names are supported:

`E<lt>`	<
`E<gt>`	>
`E<sol>`	/
`E<verbar>`	\|

Otherwise, a generic number or name can be specified:

`E<number>`	ASCII character code
`E<html>`	HTML entity (for example, "copy")

Most translators will handle the preceding four named entities but are not necessarily going to support generic entities. The obvious exception to this is, of course, the HTML translator, which doesn't have to do any work other than add a & and ; to the name.

POD Tools and Utilities

Perl provides a collection of modules in the Pod:: family that perform translations from POD into other formats and also provides utility modules for checking syntax. Most of these modules are wrapped by utility scripts that Perl provides as standard. The pod2html tool, for example, is merely a wrapper for the Pod::Html module.

Translator Tools

We have already mentioned pod2text and pod2html. Perl also comes with other translators and some POD utilities too. All the translators take an input and optional output file as arguments, plus additional options to control their output format. Without either, they take input from standard input and write it to standard output.

The list of POD translators supplied with Perl is as follows:

`pod2text`	Translates POD into plain text. If the `-c` option is used and `Term::ANSIColor` is installed (see Chapter 15), colors will also be used.
`pod2html`	Translates POD into HTML, optionally recursing and processing directories and integrating cross-links between pages.
`pod2latex`	Translates POD into Latex, either a single document or a collection of related documents.
`pod2man`	Translates POD into Unix manual pages (compatible with `nroff/troff`).

For more details on these translators, we can consult the relevant perldoc page using the now familiar command line:

> perldoc <translatorname>

In addition to the standard translators, there are many POD translation tools available from CPAN; including translators for RTF/Word, LaTex, PostScript, and plenty of other formats. Even a mildly popular format will likely have a POD translator.

Retrieval Tools

In addition to the translators, Perl provides three tools for extracting information from PODs selectively. Of these, perldoc is by far the most accomplished. Although not strictly a translator, perldoc is a utility that makes use of translators to provide a convenient Perl documentation lookup tool.

To attempt to retrieve usage information about the given Perl script, we can use the pod2usage tool. For example:

> pod2usage myscript.pl

The tool searches for a SYNOPSIS heading within the file and prints it out using pod2text. A verbosity flag may be specified to increase the returned information:

`-v 1`	(default) `SYNOPSIS` only
`-v 2`	`SYNOPSIS` plus `OPTIONS` and `ARGUMENTS` (if present)
`-v 3`	All POD documentation

A verbosity of 3 is equivalent to using pod2text directly. If the file is not given with an absolute pathname, then -pathlist can be used to provide a list of directory paths to search for the file in.

A simpler and more generic version of pod2usage is podselect. This tool attempts to locate a level 1 heading with the specified section title and extracts the subdocument from under that title in each file it is passed:

> podselect -s='How to boil an egg' *.pod

Note that podselect does not do any translation, so it needs to be directed to a translator for rendering into reasonable documentation.

POD Verification

It is easy to make simple mistakes with POD, omitting empty lines or forgetting =cut, for example. Fortunately, POD is simple enough to be easy to verify as well. The podchecker utility scans a file looking for problems:

> podchecker poddyscript.pl

If all is well, then it will return the following:

poddyscript.pl pod syntax OK.

Otherwise, it will produce a list of problems, which we can then go and fix, for example:

*** WARNING: file does not start with =head at line N in file poddyscript.pl

This warning indicates that we have started POD documentation with something other than a =head1 or =head2, which the checker considers to be suspect. Likewise:

*** WARNING: No numeric argument for =over at line N in file poddyscript.pl

*** WARNING: No items in =over (at line 17) / =back list at line N in file

poddyscript.pl

This indicates that we have an =over ... =back pair, which not only does not have a number after the over, but also does not even contain any items. The first is probably an omission. The second indicates that we might have bunched up our items so they all run into the =over token. If we had left out the space before =back, we would instead have got this error:

*** ERROR: =over on line N without closing =back at line EOF in file poddyscript.pl

The module underlying podchecker is Pod::Checker, and we can also use it in code:

# function syntax

$ok = podchecker($podfile, $checklog, %options);



# object syntax

$checker = new Pod::Checker %options;

$checker->parse_from_file($podpath, $checklog);

Both file arguments can be either file names or filehandles. By default, the POD file defaults to STDIN and the check log to STDERR, so a very simple checker script could be

use Pod::Checker;

print podchecker?"OK":"Fail";

The options hash, if supplied, allows one option to be defined: enable or disable the printing of warnings. The default is on, so we can get a verification check without a report using STDIN and STDERR:

$ok = podchecker(*STDIN, *STDERR,'warnings' => 0);

The actual podchecker script is more advanced than this, but not by all that much.

Creating Usage Info and Manual Pages from POD

The pod2usage tool allows us to dump out just the SYNOPSIS section from a POD document, the SYNOPSIS plus OPTIONS and ARGUMENTS (if either are present), or the whole manual page. We can make use of the Pod::Usage module to provide the same capabilities within our own scripts.

The pod2usage subroutine is automatically exported when using Pod::Usage and is the single interface to its features. While it has a number of different calling conventions, it is typically used with Getopt::Std or Getopt::Long, as in this example:

#!/usr/bin/perl

# podusagedemo.pl

use strict;

use warnings;

use Pod::Usage;

use Getopt::Long qw(:config bundling no_ignore_case);

=head1 NAME



A demonstration of Pod::Usage



=head1 SYNOPSIS



  podusagedemo.pl -h | -H | -l | -r [<files>]



=head1 OPTIONS



  -h|--help      this help

  -H|--morehelp  extended help

  -l|--left      go left

  -r|--right     go right



=head1 ARGUMENTS



One or more files may be specified as arguments, otherwise

standard input is used. (Both this section and OPTIONS are

are displayed by the -h option)



=head1 DESCRIPTION



This is the extended help displayed by the -H option



=cut



my %opts;



pod2usage(-verbose=>0) unless GetOptions(\%opts,qw[

    h|help H|m|morehelp l|left r|right

]);

pod2usage(-verbose=>1)  if  $opts{h};

pod2usage(-verbose=>2)  if  $opts{H} or $opts{m};

pod2usage(-verbose=>2, "Cannot go both left and right")

  if $opts{l} and $opts{r};



# ...

A verbose level of 0 corresponds to the SYNOPSIS only, which just displays the command line preceded by Usage:. A verbose level of 1 prints out the OPTIONS and ARGUMENTS sections as well, similarly preceded by Options: and Arguments: respectively. This happens when the -h option is used. An -H or a -m will generate the whole documentation, using highlighting and a pager in the manner of perldoc or the pod2usage tool. The help output, verbose level 1, looks like this:

> ./podusagedemo.pl -h

Usage:

      podusagedemo.pl -h | -H | -l | -r [<files>]



Options:

      -h|--help      this help

      -H|--morehelp  extended help

      -l|--left      go left

      -r|--right     go right

Arguments:

    One or more files may be specified as arguments, otherwise standard

    input is used. (Both this section and OPTIONS are displayed by -h)

We can also build the calls to pod2usage directly into the call to GetOptions:

my %opts;

GetOptions(

    'h|help'       => sub { pod2usage(-verbose=>1) },

    'H|m|morehelp' => sub { pod2usage(-verbose=>2) },

    'l|left'       => $opts{l},

    'r|right'      => $opts{r},

);

Although it is usually clearer to call pod2usage with named arguments like -verbose and -message, we can also call it with a single numeric or string argument. A numeric argument will be treated as an exit status and will cause the program to exit displaying the synopsis (that is, verbose level 0). A string argument will be used as the message, as if -message had been specified, again with a verbose level of 0.

The pod2usage subroutine also understands the options and defaults shown in Table 18-2.

Table 18-2. pod2usage Options

Option	Purpose
`-msg`	Alias for -message.
`-exitval`	Set an explicit exit status for `pod2usage`. Otherwise, 2 if verbose level is 1, or 1 if verbose level is 2 or 3. The special value `NOEXIT` causes `pod2usage` to return control to the program rather than exiting.
`-input`	File name or filehandle to get POD documentation from. For example, `*DATA`. Otherwise, the source file is used.
`-output`	File name or filehandle to write generated documentation to. Otherwise, standard output is used if the exit status is 0 or 1, and standard error if the exit status is 2 or higher. Note the default exit status is 2.
`-pathlist`	Search path to locate the file name specified to `-input` if it is not locally present. May be specified as a reference to an array or a colon-delimited path. Defaults to `$ENV{PATH}`. This option allows programs to self-document themselves even when the documentation is located in an external POD file.

There is no requirement to specify separate OPTIONS or ARGUMENTS sections. If desired, either section can be bundled into the SYNOPSIS to have them appear even for verbose level 0. In this case, level 1 simply becomes identical to level 0.

Recent versions of Getopt::Long provide the HelpMessage and VersionMessage subroutines. HelpMessage is essentially a wrapper around pod2usage, while VersionMessage emulates pod2usage syntax and options, but it is fully contained within Getopt::Long. See Chapter 14 for more information.

Programming POD

Perl provides a number of modules for processing POD documentation. These modules form the basis for all the POD utilities, and they are described briefly in Table 18-3.

Table 18-3. POD Modules

Module	Action
`Pod::Checker`	The basis of the `podchecker` utility. See earlier.
`Pod::Find`	Search for and return a hash of POD documents. See the section "Locating Pods."
`Pod::Functions`	A categorized summary of Perl's functions, exported as a hash.
`Pod::Html`	The basis for the `pod2html` utility.
`Pod::Latex`	The basis for the `pod2latex` utility.
`Pod::Man`	The basis for both the `pod2man` and the functionally identical `pod2roff` utilities.
`Pod::Parser`	The POD parser. This is the basis for all the translation modules and most of the others too. New parsers can be implemented by inheriting from this module.
`Pod::ParseLink`	A module containing the logic for converting `L<...>` POD links into URLs.
`Pod::ParseUtils`	A module containing utility subroutines for retrieving information about and organizing the structure of a parsed POD document, as created by `Pod::InputObjects`.
`Pod::InputObjects`	The implementation of the POD syntax, describing the nature of paragraphs and so on. In-memory POD documents can be created on the fly using the methods in this module.
`Pod::Perldoc`	The basis for the `perldoc` utility. Also incorporates a family of plug-in submodules handling format conversions, some of which require `Pod::Simple` (available from CPAN).
`Pod::PlainText`	The basis for the `pod2text` utility.
`Pod::Plainer`	A compatibility module for converting new-style POD into old-style POD.
`Pod::Select`	A subclass of `Pod::Parser` and the basis of the `podselect` utility, `Pod::Select` extracts selected parts of POD documents by searching for their heading titles. Any translator that inherits from `Pod::Select` rather than `Pod::Parser` will be able to support the `Pod::Usage` module automatically.
`Pod::Text`	The basis of the `pod2text` utility.
`Pod::Text::Color`	Convert POD to text using ANSI color sequences. The basis of the `-color` option to `pod2text`. Subclassed from `Pod::Text`. This uses `Term::ANSIColor`, which must be installed (see Chapter 15).
`Pod::Text::Overstrike`	Convert POD to text using overstrike escape sequences, where different effects are created by printing a character, issuing a backspace, and then printing another.
`Pod::Text::Termcap`	Convert POD to text using escape sequences suitable for the current terminal. Subclassed from `Pod::Text`. Requires `termcap` support (see Chapter 15).
`Pod::Usage`	The basis of the `pod2usage` utility; this uses `Pod::Select` to extract usage-specific information from POD documentation by searching for specific sections, for example, `NAME, SYNOPSIS`.

In addition to the modules listed here, the Pod::Simple family of modules on CPAN is also worthy of attention. Pod::Simple provides a revised and refactored toolkit for writing and using POD translators with a flexible and extensible interface.

Another module of interest to developers working on ensuring that documentation for a module is complete is Pod::Coverage. This module can be used to test whether or not POD documentation fully covers all the subroutines defined within it. The Devel::Cover module, covered in Chapter 17, will automatically invoke Pod::Coverage if available. This is generally a more convenient interface, and it analyzes the coverage of our tests at the same time.

Using POD Parsers

Translator modules, which is to say any module based directly or indirectly on Pod::Parser, may be used programmatically by creating a parser object and then calling one of the parsing methods:

parse_from_filehandle($fh, %options);

Or:

parse_from_file($infile, $outfile, %options);

For example, assuming we have Term::ANSIColor installed, we can create ANSIColor text documents using this short script:

#!/usr/bin/perl

# parseansi.pl

use Pod::Text::Color;



my $parser = new Pod::Text::Color(

    width => 56,

    loose => 1,

    sentence => 1,

);



if (@ARGV) {

    $parser->parse_from_file($_, '-') foreach @ARGV;

} else {

    $parser->parse_from_filehandle(*STDIN);

}

We can generate HTML pages, plain text documents, and manual pages using exactly the same process from their respective modules.

Writing a POD Parser

Writing a POD parser is surprisingly simple. Most of the hard work is already done by Pod::Parser, so all that's left is to override the methods we need to replace in order to generate the kind of document we are interested in. Particularly, there are four methods we may want to override:

`command`	Render and output POD commands.
`verbatim`	Render and output verbatim paragraphs.
`textblock`	Render and output regular (nonverbatim) paragraphs.
`interior_sequence`	Return rendered interior sequence.

By overriding these and other methods, we can customize the document that the parser produces. Note that the first three methods display their result, whereas interior_sequence returns it. Here is a short example of a POD parser that turns POD documentation into an XML document (albeit without a DTD):

#!/usr/bin/perl

# parser.pl

use warnings;

use strict;



{

    package My::Pod::Parser;



    use Pod::Parser;

    our @ISA = qw(Pod::Parser);



    sub command {

        my ($parser, $cmd, $para, $line) = @_;

        my $fh = $parser->output_handle;



        $para =˜s/[
]+$//;

        my $output = $parser->interpolate($para, $line);

        print $fh "<pod:$cmd> $output </pod:$cmd> 
";

    }



    sub verbatim {

        my ($parser, $para, $line) = @_;

        my $fh = $parser->output_handle;



        $para =˜s/[
]+$//;

        print $fh "<pod:verbatim> 
 $para 
 </pod:verbatim> 
";

    }



    sub textblock {

        my ($parser, $para, $line) = @_;

        my $fh = $parser->output_handle;



        print $fh $parser->interpolate($para, $line);

    }



    sub interior_sequence {

        my ($parser, $cmd, $arg) = @_;

        my $fh = $parser->output_handle;



        return "<pod:int cmd="$cmd"> $arg </pod:int>";

    }

}



my $parser = new My::Pod::Parser();



if (@ARGV) {

    $parser->parse_from_file($_) foreach @ARGV;

} else {

    $parser->parse_from_filehandle(*STDIN);

}

To implement this script, we need the output filehandle, which we can get from the output_handle method. We also take advantage of Pod::Parser to do the actual rendering work by using the interpolate method, which in turn calls our interior_sequence method. Pod::Parser provides plenty of other methods too, some of which we can override as well as or instead of the ones we used in this parser; see the following for a complete list:

> perldoc Pod::Parser

The Pod::Parser documentation also covers more methods that we might want to override, such as begin_input, end_input, preprocess_paragraph, and so on. Each of these gives us the ability to customize the parser in increasingly detailed ways.

We have placed the Parser package inside the script in this instance, though we could equally have had it in a separate module file. To see the script in action, we can feed it with any piece of Perl documentation—the POD documentation itself, for example. On a typical Unix installation of Perl version 5.6 or higher, we can do that with

> perl mypodparser /usr/lib/perl5/5.8.6/pod/perlpod.pod

This generates an XML version of perlpod that starts like this:

<pod:head1>NAME</pod:head1>

perlpod - plain old documentation



<pod:head1>DESCRIPTION</pod:head1>

A pod-to-whatever translator reads a pod file paragraph by paragraph,

and translates it to the appropriate output format. There are

three kinds of paragraphs:

<pod:int cmd="L">verbatim|/"Verbatim Paragraph"</pod:int>,

<pod:int cmd="L">command|/"Command Paragraph"</pod:int>, and

<pod:int cmd="L">ordinary text|/"Ordinary Block of Text"</pod:int>.



...

By comparing this with the original document, we can see how the parser is converting POD tokens into XML tags.

Locating PODs

The Unix-specific Pod::Find module searches for POD documents within a list of supplied files and directories. It provides one subroutine of importance, pod_find, which is not imported by default. This subroutine takes one main argument—a reference to a hash of options including default search locations. Subsequent arguments are additional files and directories to look in. The following script implements a more or less fully featured POD search based around Pod::Find and Getopt::Long, which we cover in detail in Chapter 14.

#!/usr/bin/perl

# findpod.pl

use warnings;

use strict;



use Pod::Find qw(pod_find);

use Getopt::Long;



# default options

my ($verbose,$include,$scripts);

my $display = 1;



# allow files/directories and options to mix

Getopt::Long::Configure('permute'),



# get options

GetOptions('verbose!' => $verbose,

    'include!' => $include,

    'scripts!' => $scripts,

'display!' => $display,

);



# if no directories specified, default to @INC

$include = 1 if !defined($include) and (@ARGV or $scripts);



# perform scan

my %pods = pod_find({

    -verbose => $verbose,

    -inc => $include,

    -script => $scripts,

    -perl => 1

}, @ARGV);



# display results if required

if ($display) {

    if (%pods) {



        print "Found '$pods{$_}' in $_
 foreach sort keys %pods;

    } else {

        print "No pods found
";

    }

}

We can invoke this script with no arguments to search @INC or pass it a list of directories and files to search. It also supports four arguments to enable verbose messages, disable the final report, and enable Pod::Find's two default search locations. Here is one way we can use it, assuming we call the script findpod:

> perl findpod.pl -iv /my/perl/lib 2> dup.log

This command tells the script to search @INC in addition to /my/perl/lib (-i), produce extra messages during the scan (-v), and redirect error output to dup.log. This will capture details of any duplicate modules that the module finds during its scan. If we only want to see duplicate modules, we can disable the output and view the error output on screen with this command:

> perl findpod.pl -i --nodisplay /my/perl/lib

The options passed in the hash reference to pod_find are all Boolean and all default to 0 (off). They have the meanings listed in Table 18-4.

Table 18-4. pod_find Options

Option	Action
`-verbose`	Print out progress during scan, reporting all files scanned that did not contain POD information.
`-inc`	Scan all the paths contained in `@INC`. Implies `-perl`.
`-script`	Search the installation directory and subdirectories for POD files. If Perl was installed as `/usr/bin/perl`, then this will be `/usr/bin` for example. This implies `-perl`.
`-perl`	Apply Perl naming conventions for finding POD files. This strips Perl file extensions (`.pod, .pm`, etc.), skips over numeric directory names that are not the current Perl release, and so on.

The hash generated by findpod.pl contains the file in which each POD document was found as the key and the document title (usually the module package name) as the value.

Source Filters

An intriguing feature of Perl is the ability to preprocess source code before it is even compiled. This capability is provided by the Filter::Util::Call module, which uses an underlying C-based interface to the Perl interpreter itself to intercept source code after it is read in but before it is compiled.

Here is an example of it in use to implement a filter that carries out a simple substitution. The filter itself is implemented by the filter method, while the import method performs the task of installing the filter with a specified pair of match and replacement strings:

package Class::Filter::Replace;

use strict;

use Carp qw(croak);

use Filter::Util::Call;



sub import {

    my ($self,$replace,$with)=@_;

    unless ($replace) {

        croak("use ".__PACKAGE__." 'original' [, 'replacement'];");

    }

    $with ||= ""; #replace with nothing



    my $filter={

        replace => $replace,

        with    => $with,

    };



    filter_add($filter);

}



sub filter {

    my $status=filter_read(); # set $_ from input

    s/$_[0]->{replace}/$_[0]->{with}/go if $status > 0;

    return $status; # 0 = end of file, <0 = error

}



1;

We can now use this filter to preprocess source code. Here, we set the filter to replace all instances of the word Goodbye with Hello. Since the filter is installed at compile time by virtue of use, it affects the code immediately following it:

#!/usr/bin/perl

use strict;

use warnings;



use Class::Filter::Replace Goodbye => 'Hello';



my $Goodbye="so long";

print "Goodbye, I must be going, $Hello
";

Running this program prints out the following:

Hello, I must be going, so long

We can also use a filter directly on any source code from the command line:

> perl -MClass::Filter::Replace=Goodbye,Hello unfiltered.pl

While this filter might look like we could easily register multiple objects in the same class, this is not so. In fact, the class is a singleton, because Filter::Util::Call will permit only one filter per class. What actually happens here is that the class is extracted from the context of filter_add, and the hash reference is passed as a hash reference to the filter method, which is called as a class method, not as an object instance method. This is why we did not bother to bless the hash reference $filter into the class before passing it on.

A filter consists of either an object class that inserts itself by name into the filter interface with filter_add and provides a filter method for Filter::Util::Call to call back, as earlier, or a simple subroutine that carries out the same task as the filter method and is inserted by code reference. Here is the code reference version of the preceding filter:

package Closure::Filter::Replace;

use strict;

use Carp qw(croak);

use Filter::Util::Call;



sub import {

    my ($self,$replace,$with)=@_;

    croak("use ".__PACKAGE__." 'original' [, 'replacement'];")

      unless $replace;

    $with ="" unless $with;



    my $filter=sub {

        my $status=filter_read(); #populates $_

        s/$replace/$with/g if $status > 0;

        return $status;

    }



    filter_add($filter);

}



1;

To read source in other than a line-by-line basis, we can either supply a size argument to filter_read or make use of filter_read_exact. Both uses cause the filter to try to read a block of the requested number of bytes; filter_read may come back with less if it cannot initially read enough, while filter_read_exact will block and not stop trying until the end of file or an error is encountered:


my $status=filter_read($size); #block mode, nonblocking



my $status=filter_read_exact($size); #block mode, blocking

Note that both filter_read and filter_read_exact append to the current value of $_, so mul-tiple calls to filter_read within the filter subroutine will not reset it to an entirely new value each time. The status is always returned with a value of 0 for end of file, greater than 0 for a successful read, and less than 0 for an error. Hence in this example we perform the substitution only if $status > 0.

If a filter wishes to disable itself, perhaps because it should only apply to a certain part of the source, it can do so by calling filter_del. For example:

if (/__(DATA|END)__/) {

    filter_del();

}

The Filter::Simple module provides a third way to define a filter. While not as flexible, it is a lot simpler to use and will suit many applications. For instance, we can rewrite the preceding examples as follows:

package Simple::Filter::Replace;

use strict;

use Carp qw(croak);

use Filter::Simple;



my ($replace,$with);



sub import {

    $replace = $_[1];

    unless ($replace) {

        croak("use ".__PACKAGE__." 'original' [, 'replacement'];");

    }

    $with = $_[2] || "";

}



FILTER { s/$replace/$with/g };



1;

The key to this module is the special FILTER block. This is processed by Filter::Simple using its own internal filter to generate a filter out of our code. We can get a lot smarter too, because the module colludes with Text::Balanced to give us the ability to register filters to process only code, only quoted strings, or a number of other selections with a FILTER_ONLY specification:

use Filter::Simple;



FILTER_ONLY

    code   => sub { s/ucfirst/lcfirst/g },

    string => sub { s/Goodbye/Hello/g };

The full list of filter types is offered in Table 18-5.

Table 18-5. Filter::Simple Filter Types

Filter Type	Effect
`all`	Everything, same as `FILTER`.
`code`	Filter code, excluding quotelike operators.
`executable`	`code` plus `quotelike`.
`quotelike`	Filter quotelike operators `q, qq, qr`.
`regex`	Filter regular expression patterns.
`string`	Filter literal strings in quotes or quotelike text.

The all filter is identical to FILTER, so we could previously have written

FILTER all => sub { s/$replace/$with/g };

We can specify all of these filters except code more than once, with cumulative effect:

use Filter::Simple;



FILTER_ONLY

    code   => sub { s/ucfirst/lcfirst/g },

    string => sub { s/Goodbye/Ciao/g }

    string => sub { s/Ciao/Au Revoir/g },

    string => sub { s/Au Revoir/Hello/g };

While Perl modules are the primary type of source filter, we can also use external commands. The Filter::exec module (which is available in the Filter distribution on CPAN, but not as standard) is one way we can invoke an external program to filter our code. For instance, if we happened to have a gzipped Perl script, we could run it on a Unix platform with this command:

> perl -MFilter::exec=gunzip,-c myscript.pl.gz

The Filter::sh module is similar, but it takes a single string as the command, invoking an intermediate shell to execute it:

> perl -MFilter::sh='gunzip -c' myscript.pl.gz

Although functional, these modules mostly serve as examples of how to implement filters. As further demonstration, the Filter::cpp module provides support for C-style preprocessor macros, Filter::tee outputs the post-processed source code for inspection, and Filter::decrypt provides support for running encrypted source files. Each of these modules uses an underlying factory module that subclasses Filter::Util::Call to register the filter, for instance, Filter::exec invokes Filter::Util::Exec.

The Filter::Util::Call interface is used by several modules in the standard distribution. The Switch module uses it to implement new semantics and keywords in Perl, by translating them into real Perl keywords before the interpreter gets to look at them. The B::Byteloader module uses a filter to convert compiled code saved in binary form back into a parsed opcode tree.

Reports: The "R" in Perl

Reports are a useful but often overlooked feature of Perl. They provide a way to generate structured text such as tables or forms using a special layout description called a format. Superficially similar in intent to the print and sprintf functions, the strength of formats comes from their ability to describe layouts in physical terms, making it much easier to see how the resulting text will look and making it possible to design page layouts visually rather than resorting to character counting with printf.

Formats and the Format Data Type

Intriguingly, formats are an entirely separate data type with their own typeglob slot, separate from scalars, arrays, hashes, and filehandles. Like filehandles, they have no prefix or other syntax to express themselves and as a consequence often look like bareword filehandles, which can occasionally be confusing.

A format is compiled from a format definition, a series of formatting or picture lines containing literal text and placeholders, interspersed with data lines that describe the information used to fill placeholder and comment lines. As a simple example, here is a format definition that defines a single pattern line consisting mainly of literal text and a single placeholder, followed by a data line that fills that placeholder with some more literal text:

This is a @<<<<< justified field



"left"

To turn a format definition into a format, we need to use the format function, which takes a format name and a multiline format definition, strongly reminiscent of a here document, and turns it into a compiled format. A single full stop on its own defines the end of the format. To define the very simple format example earlier, we would write something like this:

format MYFORMAT =

This is a @<<<<< justified field

"left"

.

The trailing period is very important. It is the end token that defines the end of the implicit HERE document. A format definition will happily consume the entire contents of a source file if left unchecked.

To use a format, we use the write function on the filehandle with the same name as the format. For the MYFORMAT example earlier, we would write the following:

# print format definition to filehandle 'MYFORMAT'

write MYFORMAT;

This requires that we actually have an open filehandle called MYFORMAT and want to use the format to print to it. More commonly we want to print to standard output, which we can do by either defining a format called STDOUT or assigning a format name to the special variable $˜ ($FORMAT_NAME with the English module). In this case, we can omit the filehandle, and write will use the currently selected output filehandle, just like print:

$˜ = 'MYFORMAT';

write;

We can also use methods from the IO:: family of modules, if we are using them. Given an IO::Handle-derived filehandle called $fh, we can assign and use a format on it like this:

$fh->format(MYFORMAT);

$fh->format_write();

The write function (or its IO::Handle counterpart format_write) generates filled-out formats by combining the picture lines with the current values of the items in the data lines to fill in any placeholder present.

Format SyntaxFormats consist of a collection of picture and data lines, interspersed with optional comments, combined into a HERE-style document that is ended with a single full stop.

Of the three, comments are by far the simplest to explain. They resemble conventional Perl comments and simply start with a # symbol, as this example demonstrates:

format FORMNAME =

# this is a comment. The next line is a picture line

This is a pattern line with one @<<<<<<<<<<.

# this is another comment.

# the next line is a data line

"placeholder"

# and don't forget to end the format with a '.':

.

Picture and data lines take a little more explaining. Since they are the main point of using formats at all, we will start with picture lines.

Picture Lines and Placeholders

Picture lines consist of literal text intermingled with placeholders, which the write function fills in with data at the point of output. If a picture line does not contain any placeholders at all, it is treated as literal text and can be printed out. Since it does not require any data to fill it out, it is not followed by a data line. This means that several picture lines can appear one after the other, as this static top-of-page format illustrates:

STATIC_TOP =

This header was generated courtesy of Perl formatting

See Chapter 18 of Pro Perl for details

-------------------------------------------

.

Placeholders are defined by either an @ or a ^, followed by a number of <, |, >, or # characters that define the width of the placeholder. Picture lines that contain placeholders must be followed by a data line (possibly with comments in between) that defines the data to be placed into the placeholder when the format is written.

Formats do not support the concept of a variable-width placeholder. The resulting text will always reserve the defined number of characters for the substituted value irrespective of the actual length of the value, even if it is undefined. It is this feature that makes formats so useful for defining structured text output—we can rely on the resulting text exactly conforming to the layout defined by the picture lines. For example, to define a ten-character field that is left justified, we would use

This is a ten character placeholder: @<<<<<<<<<

$value_of_placeholder

Note that the @ itself counts as one of the characters, so there are nine < characters in the example, not ten. To specify multiple placeholders, we just use multiple instances of @ and supply enough values in the data line to fill them. This example has a left-, center-, and right-justified placeholder:

This picture line has three placeholders: @<<<@|||@>>>

$first, $second, $third

The second example defines three four-character-wide placeholders. The <, |, and > characters define the justification for fields more than one character wide; we can define different justifications using different characters, as we will see in a moment.

Programmers new to formats are sometimes confused by the presence of @ symbols. In this case, @ has nothing to do with interpolation; it indicates a placeholder. Because of this, we also cannot define a literal @ symbol by escaping it with a backslash, that is, an interpolation feature. In fact, the only way to get an actual @ (or indeed ^) into the resulting string is to substitute it from the data line:

# the '@' below is actually a placeholder:

This is a literal '@'

# but we can make it a literal '@' by substituting one in on the data line:

'@'

Simple placeholders are defined with the @ symbol. The caret ^ or "continuation" placeholder, however, has special properties that allow it to be used to spread values across multiple output lines. When Perl sees a ^ placeholder, it fills out the placeholder with as much text as it reasonably can and then truncates the text it used from the start of the string. It follows from this that the original variable is altered and that to use a caret placeholder we cannot supply literal text. Further uses of the same variable can then fill in further caret placeholders. For example, this format reformats text into 38 columns with a > prefix on each line:

format QUOTE_MESSAGE =

> ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

.

This creates a format that processes the text in the variable $message into four lines of 40 characters, fitting as many words as possible into each line. When write comes to process this format, it uses the special variable $: to determine how and where to truncate the line. By default it is set to - to break on spaces, newlines, or hyphens, which works fine for most plain text.

There are a number of problems with this format—it only handles four lines, and it always fills them out even if the message is shorter than four lines after reformatting. We will see how to suppress redundant lines and automatically repeat picture lines to generate extra ones with the special ˜ and ˜˜ strings shortly.

Justification

It frequently occurs that the width of a field exceeds that of the data to be placed in it. In these cases, we need to decide how the format will deal with the excess, since a fixed-width field cannot shrink (or grow) to fit the size of the data. A structured layout is the entire point of formats. If the data we want to fill the placeholder is only one-character wide, we need no other syntax. As an extreme case, to insert six single-character items into a format, we can use

The code is '@@@@@@'

# use first six elements of digits, assumed to be from 0 to 9.

@digits

For longer fields, we need to choose how text will be aligned in the field through one of four justification methods, listed in Table 18-6, depending on which character we use to define the width of the placeholder.

Table 18-6. Placeholder Justification Styles

Placeholder	Alignment	Example
`<`	Left justified	`@<<<<`
`>`	Right justified	`@>>>>`
`\|`	Center justified	`@\|\|\|\|`
`#`	Right-justified numeric	`@####`

The <, |, and > justification styles are mostly self-explanatory; they align values shorter than the placeholder width to the left, center, or right of the placeholder. They pad the rest of the field with spaces. (Note that padding with other characters is not supported. If we want to do that, we will have to generate the relevant value by hand before it is substituted.) If the value is the right length in any case, then no justification occurs. If it is longer, then it is truncated on the right irrespective of the justification direction.

The numeric # justification style is more interesting. With only # characters present, it will insert an integer based on the supplied value—for an integer number it substitutes in its actual value, but for a string or the undefined value it substitutes in 0, and for a floating-point number it substitutes in the integer part. To produce a percentage placeholder, for example, we can use the following:

Percentage: @##%

$value * 100

If, however, we use a decimal point character within the placeholder, then the placeholder becomes a decimal placeholder, with floating-point values point-justified to align themselves around the position of the decimal point:

Result (2 significant places): @####.##

$result

This provides a very simple and powerful way to align columns of figures, automatically truncating them to the desired level of accuracy at the same time.

If the supplied result is not a floating-point number, then the fractional places are filled in with 0, and for strings and undefined values the ones column is also filled in with 0.

The actual character used by the decimal placeholder to represent the decimal point is defined by the locale, specifically the LC_NUMERIC value of the locale. In Germany, for instance, the conventional symbol to separate the integer and fractional parts is a comma, not a full stop. Formats are in fact the only part of Perl that directly accesses the locale in this way, possibly because of their long history; all other parts of the language adhere to the use locale directive. Although deprecated in modern Perl, we can also use the special variable $# to set the point character.

The final placeholder format is the * placeholder. This creates a raw output placeholder, producing a complete multiple-line value in one go and consequently can only be placed after an @ symbol; it makes no sense in the context of a continuation placeholder since there will never be a remainder for a continuation to make use of. For example:

> @* <

$multiline_message

In this format definition, the value of $multiline_message is output in its entirety when the format is written. The first line is prefixed with a >, and the last is suffixed with <. No other formatting of any kind is done. Since this placeholder has variable width (and indeed, variable height), it is not often used, since it is effectively just a poor version of print that happens to handle line and page numbering correctly.

Data Lines

Whenever a picture line contains one or more placeholders it must be immediately followed by a data line consisting of one or more expressions that supply the information to fill them. Expressions can be numbers, string values, variables, or compound expressions:

format NUMBER =

Question: What do you get if you multiply @ by @?



6, 9

Answer: @#

6*9

.

Multiple values can be given either as an array or a comma-separated list:

The date is: @###/@#/@#

$year, $month, $day

If insufficient values are given to fill all the placeholders in the picture line, then the remaining placeholders are undefined and padded out with spaces. Conversely, if too many values are supplied, then the excess ones are discarded. This behavior changes if the picture line contains ˜˜ however, as shown later.

If we generate a format using conventional quoted strings rather than the HERE document syntax, we must take special care not to interpolate the data lines. This is made more awkward because in order for the format to compile, we need to use to create newlines at the end of each line of the format, including the data lines, and these do need to be interpolated. Separating the format out onto separate lines is probably the best approach, though as this example shows, even then it can be a little hard to follow:

# define page width and output filehandle

$page_width = 80;

$output = "STDOUT_TOP";

# construct a format statement from concatenated strings

$format_st = "format $output = 
".

'Page @<<<'. "
".

'$='. "
".

('-'x$page_width). "
".

".
";   # don't forget the trailing '.'



# define the format - note we do not interpolate, to preserve '$='

eval $format_st;

Note that continuation placeholders (defined by a leading caret) need to be able to modify the original string in order to truncate the start. For this reason, an assignable value such as a scalar variable, array element, or hash value must be used with these fields.

Suppressing Redundant Lines

The format and write functions support two special picture strings that alter the behavior of the placeholders in the same picture line, both of which are applied if the placeholders are all continuation (caret) placeholders.

The first is a single tilde, or ˜ character. When this occurs anywhere in a picture line containing caret placeholders, the line is suppressed if there is no value to plug into the placeholder. For example, we can modify the quoting format we gave earlier to suppress the extra lines if the message is too short to fill them:

format QUOTE_MESSAGE =

> ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜

$message

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...˜

$message

.

In this example, the bottom three picture lines have a ˜ suffix, so they will only be used if $message contains sufficient text to fill them after it has been broken up according to the break characters in $:. When the format is written, the tildes are replaced with spaces. Since they are at the end of the line in this case, we will not see them, which is why conventionally they are placed here. If we have spaces elsewhere in the picture line, we can replace one of them with the tilde and avoid the trailing space.

We modify the last picture line to indicate that the message may have been truncated because we know that it will only be used if the message fills out all the subsequent lines. In this case, we have replaced the last three < characters with dots.

The ˜ character can be thought of as a zero-or-one modifier for the picture line, in much the same way that ? works in regular expressions. The line will be used if Perl needs it, but it can also be ignored if necessary.

Autorepeating Pattern Lines

If two adjacent tildes appear in a pattern line, then write will automatically repeat the line while there is still input. If ˜ can be likened to the ? zero-or-one metacharacter of regular expressions, ˜˜ can be likened to *, zero-or-more. For instance, to format text into a paragraph of a set width but an unknown number of lines, we can use a format like this:

format STDOUT =

^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜˜

$text

.

Calling write with this format will take the contents of $text and reformat it into a column 30 characters wide, repeating the pattern line as many times as necessary until the contents of $text are exhausted. Anything else in the pattern line is also repeated, so we can create a more flexible version of the quoting pattern we gave earlier that handles a message of any size:

format QUOTE =

>˜˜^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

$message

.

Like ˜, the ˜˜ itself is converted into a space when it is output. It also does not matter where it appears, so in this case we have put it between the > quote mark and the text, to suppress the extra space on the end of the line it would otherwise create.

Note that ˜˜ only makes sense when used with a continuation placeholder, since it relies on the continuation to truncate the text. Indeed, if we try to use it with a normal @ placeholder, Perl will return a syntax error, since this would effectively be an infinite loop that repeats the first line. Since write cannot generate infinite quantities of text, Perl prevents us from trying.

Formats and Filehandles

Formats are directly associated with filehandles. All we have to do is write to the filehandle, and the associated format is invoked. It might seem strange that we associate a format with a filehandle and then write to the filehandle, rather than specifying which format we want to use when we do the writing, but there is a certain logic behind this mechanism. There are in fact two formats that may be associated with a filehandle. The main one is used by write, but we can also install a top-of-page format that is used whenever Perl runs out of room on the current page and is forced to start a new one. Since this is associated with the filehandle, Perl can use it automatically when we use write rather than needing to be told.

Defining the Top-of-Page Format

Perl allows two formats to be associated with a filehandle. The main format is used whenever we issue a write statement. The top-of-page format, if defined, is issued at the start of the first page and at the top of each new page. This is determined by the special variable $= (length of page) and $- (the number of lines left). Each time we use write, the value of $- increases. When there is no longer sufficient room to fit the results of the next write, a new page is started, a new top-of-page format is written, and only then is the result of the last write issued.

The main format is automatically associated with the filehandle of the same name so that the format MYFORMAT is automatically used when we use write on the filehandle MYFORMAT. Giving it the name of the filehandle with the text _TOP appended to it can similarly associate the top-of-page format. For instance, to assign a main and top-of-page format to the filehandle MYFORMAT, we would use something like this:

format MYFORMAT =

...main format definition...

.



# define a format that gives the current page number

format MYFORMAT_TOP =

This is page @<<<

$=

------------------------

.

Assigning Formats to Standard Output

Since standard output is the filehandle most usually associated with formats, we can omit the format name when defining formats.

format STDOUT=

The magic word is "@<<<<<<<<";

$word

.



format STDOUT_TOP=

Page @>

$#

-----------

.

We can also omit STDOUT for the main format and simply write

format =

The magic word is "@<<<<<<<<";

$word

.

This works because standard output is the default output filehandle. If we change the filehandle with select, format creates a format with the same name as that filehandle instead. The write function also allows us to omit the filehandle; to write out the formats assigned to whatever filehandle is currently selected, we can simply put

write;

Determining and Assigning Formats to Other Filehandles

We are not constrained to defining formats with the same name as a filehandle in order to associate them. We can also find their names and assign new ones using the special variables $˜ and $^.

The special variable $˜ ($FORMAT_NAME with use English) defines the name of the main format associated with the currently selected filehandle. For example:

$format = $˜;

Likewise, to set the current format we can assign to $˜:

# set standard output format to 'MYFORMAT';

$˜ = 'MYFORMAT';



use English;

$FORMAT_NAME = 'MYFORMAT';   # more legibly

The variable is set to the name of the format, not to the format itself, hence the quotes.

The special variable $^ ($FORMAT_TOP_NAME with use English) performs the identical role for the top-of-page format:

# save name of current top-of-page format

$topform = $^;

# assign new top-of-page format

$^ = 'MYFORMAT_TOP';

# write out main format associated with standard out,

# (using top-of-page format if necessary)

write;

# restore original top-of-page format

$^ = $topform;

Setting formats on other filehandles using the variables $˜ and $^ requires special maneuvering with select to temporarily make the target filehandle the current filehandle:

# set formats on a different filehandle

$oldfh = select MYHANDLE;

$˜ = 'MYFORMAT';

$^ = 'MYFORMAT_TOP';

select $oldfh;

The IO::Handle module (and subclasses like IO::File) provide a simpler object-oriented way of setting reports on filehandles:

$fh = new IO::File ("> $outputfile");

...

$fh->format_name ('MYFORMAT'),

$fh->format_top_name ('MYFORMAT_TOP'),

...

write $fh;   # or $fh->format_write ();

Page Control

Perl's reporting system uses several special variables to keep track of line and page numbering. We can use these variables to produce line and page numbers and set them to control how pages are generated. There are four variables of particular interest, and these are listed in Table 18-7.

Table 18-7. Format Page Control Variables

Variable	Corresponds To
`$=`	The page length
`$%`	The page number
`$-`	The number of lines remaining
`$^L`	The formfeed string

$= (or $FORMAT_LINES_PER_PAGE with use English) holds the page length and by default is set to 60 lines. To change the page length, we can assign a new value:

$= = 80; # set page length to 80 lines

Or more legibly:

use English;

$FORMAT_LINES_PER_PAGE = 80;

If we want to generate reports without pages, we can set $= to a very large number. Alternatively, we can redefine $^L to an empty string and avoid (or subsequently redefine to nothing) the "top-of-page" format.

$% (or $FORMAT_PAGE_NUMBER with use English) holds the number of the current page. It starts at 1 and is incremented by one every time a new page is started, which in turn happens whenever write runs out of room on the current page. We can change the page number explicitly by modifying $%, for example:

$% = 1; # reset page count to 1

$- (or $FORMAT_LINES_LEFT with use English) holds the number of lines remaining on the current page. Whenever write generates output, it decrements this value by the number of lines in the format. If there are insufficient lines left (the size of the output is greater than the number of lines left), then $- is set to 0, the value of $% is incremented by one, and a new page is started, starting with the value of $^L and followed immediately by the top-of-page format, if one is defined. We can force a page break on the next write by setting $- to 0:

$- = 0; # force a new page on the next 'write'

Finally, $^L (or $FORMAT_FORMFEED with use English) is output before the top-of-page format by write when a new page is started. By default it is set to a formfeed character, f. See the section "Creating Footers" for a creative use of $^L.

As an example of using the page control variables, here is a short program that paginates its input file, adding the name of the file and a page number to the top of each page. It also illustrates creating a format dynamically with eval so we can define not only the height of the resulting pages, but also their width.

#!/usr/bin/perl

# paginate.pl

use warnings;

use strict;

no strict 'refs';

use Getopt::Long;



# get parameters from the user

my $height = 60;  # length of page

my $width = 80;   # width of page

my $quote = "";   # optional quote prefix

GetOptions ('height|size|length:i', $height,

    'width:i', $width, 'quote:s', $quote);

die "Must specify input file" unless @ARGV;



# get the input text into one line, for continuation

undef $/;

my $text = <>;



# set the page length

$= = $height;



# if we're quoting, take that into account

$width -= length($quote);



# define the main page format - a single autorepeating continuation field

my $main_format = "format STDOUT = 
".

                  '^'.$quote.('<' x ($width-1))."˜˜
".

                  '$text'. "
".

                  ".
";

eval $main_format;



# define the top of page format

my $page_format = "format STDOUT_TOP = 
".

                  '@'.('<' x ($width/2-6)). ' page @<<<'. "
".

                  '$ARGV,$%'. "
".

                  '-'x$width. "
".

                  ".
";

eval $page_format;



# write out the result

write;

To use this program, we can feed it an input file and one or more options to control the output, courtesy of the Getopt::Long module, for example:

> perl paginate.pl input.pl -w 50 -h 80

Creating Footers

Footers are not supported as a concept by the formatting system; there is no "bottom-of-page" format. However, with a little effort we can improvise our own footers. The direct and obvious way is to keep an eye on $- and issue the footer when we get close to the bottom of the page. If the footer is smaller in lines than the output of the main format, we can use something like the following, assuming that we know what the size of output is:

print "
Page $%
" if $- < $size_of_format;

This is all we need to do, since the next attempt to write will not have sufficient space to fit and will automatically trigger a new page. If we want to make sure that we start a new page on the next write, we can set $- to 0 to force it:

print ("
Page $% 
"), $- = 0 if $- < $size_of_format;

A more elegant and subtle way of creating a footer is to redefine $^L. This is a lot simpler to arrange but suffers in terms of flexibility since the footer is fixed once it is defined, so page numbering is not possible unless we redefine the footer on each new page.

For example, if we want to put a two-line footer on the bottom of 60-line pages, we can do so by putting the footer into $^L (suffixed with the original formfeed) and then reducing the page length by the size of the footer, in this case to 58 lines:

# define a footer.

$footer = ('-'x80). "
End of Page
";

# redefine the format formfeed to be the footer plus a formfeed

$^L = $footer. "f";



# reduce page length from default 60 to 58 lines

# if we wanted to be creative we could count the instances of '
' instead.

$= -= 2;

Now every page will automatically get a footer without any tracking or examination of the line count. We still have to add a footer to the last page manually. The number of lines remaining to fill on the last page is held by $-, so this turns out to be trivial:

print ("
" * $-);  # fill out the rest of the page (to 58 lines)

print $footer;      # print the final footer

As mentioned earlier, arranging for a changing footer such as a page number is slightly trickier, but it can be done by remembering and checking the value of $- after each write:

$lines = $-;

write;

redefine_footer() if $- > $lines;

This will work for many cases but will not always work when using ˜˜, since it may cause write to generate more lines than the page has left before we get a chance to check it.

Combining Reports and Regular Output

It is possible to print both unformatted and formatted output on the same filehandle.

However, while write and print can be freely mixed together, print knows nothing about the special formatting variables such as $=, $-, and $% that track pagination and trigger the top-of-page format. Consequently, we must take care to track line counts ourselves if we want pages to be of even length, by adjusting $- ourselves.

For instance:

write;

foreach (@extra_lines) {

    print $_, "
";

    --$-;   # decrement $-.

}

Unfortunately, this solution does not take into account that $- might become negative if there is not enough room left on the current page. Due to the complexities of managing mixtures of write and print, it is often simpler to either use formline or create a special format that is simply designed to print out the information we were using print for.

Generating Report Text with formline

The formline function is a lower-level interface to the same formatting system used by write. formline generates text from a single picture line and a list of values, the result of which is placed into the special variable $^A. For example, this is how we could create a formatted string containing the current time using formline:

($sec, $min, $hour) = localtime;

formline '@#/@#/@#', $hour, $min, $sec;

$time = $^A;

print "The time is: $hour:$min.$sec 
";

In this case, it would probably be easier to use sprintf, but we can also use formline to create text from more complex patterns. For instance, to format a line of text into an array of text lines wrapped at 20 characters, we could use formline like this:

$text = get_text();   # get a chunk of text from somewhere



@lines;

while ($text) {

    formline '^<<<<<<<<<<<<<<<<<<<', $text;

    push @lines, $^A;

}

The formline function is only designed to handle single lines, so it ignores newlines and treats the picture text as a single line. This means that we cannot feed formline a complete format definition and expect it to produce the correct result in $^A.

Strangely, there is no simple way to generate text from write, other than by redirecting filehandles, since write sends its results to a filehandle. However, we can produce a version of write that returns its result instead.

sub swrite ($@) {

    my $picture = shift;

    formline ($picture, @_);

    return $^A;

}

This function is a friendly version of formline, but it is not a direct replacement for write, since it only operates on a single picture line and expects a conventional list of values as an argument. However, it is convenient and simple to use.

Summary

This chapter dealt with text processing in depth, building on the concepts of regular expressions and interpolation to carry out advanced text manipulation. To begin with, we looked at text processing modules, including Text::Tab, Text::Abbrev, Text::ParseWords, and the versatile Text::Balanced. We also looked at rewrapping text with Text::Wrap and tokenizing it with Text::Soundex.

Source code is an important subclass of text document. We covered Perl's Plain Old Documentation (POD) syntax, and saw how to construct it, format it, render it, and write our own tools to parse it. From here we went on to look at preprocessing source files using a source filter. We covered the Filter::Util::Call module and also saw how to simplify some aspects of filter development with the Filter::Simple module.

Finally, we looked at reports, the "R" in Perl, which provide us with a way to create simple templates to format the way output is rendered. We looked at the format data type, formats and filehandles, format structure (including justification), and page control.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 18: Text Processing, Documentation, and Reports

Create new playlist

Sign In

Sign Up

CHAPTER 18Text Processing, Documentation, and Reports