We have seen a lot of Perl's native capabilities in terms of interpolation and regular expressions. However, these are just the start of Perl's text processing capabilities.
Perl comes with a standard library of text processing modules that solve many common problems associated with manipulating text. These include such tasks as tab expansion, determining abbreviations, and paragraph formatting. While not necessarily advanced, they provide a simple way of performing useful functions without reinventing the wheel.
Another class of text processing modules is dedicated to understanding Perl's documentation syntax known as POD, or Plain Old Documentation. The Pod::
family of Perl modules enable us to perform various functions such as create, transform, and generally manipulate POD documentation in many different ways. We also look at processing a very special subclass of documents, source files, using Perl source filters.
The final part of this chapter deals with Perl's support for reports. These allow us to format text using special layout definitions. Formats are a built-in feature of the Perl interpreter that enable us to do many handy things with text, and we will explore, among other things, format structure, page control, and the format data type.
Perl's standard library contains several handy text processing modules that solve many common problems and can save a lot of time. These modules are often overlooked when considering Perl's text processing capabilities simply because the core language already provides such a rich set of functionality.
The main text processing modules are all members of the Text::
family of which those listed in Table 18-1 are the most common.
Table 18-1. Standard Text Processing Modules
Module | Function |
Text::Tabs |
Convert tabs to and from spaces. |
Text::Abbrev |
Calculate unique abbreviations from a list of words. |
Text::Balanced |
Match nested delimiters. |
Text::ParseWords |
Parse text into words and phrases. |
Text::Wrap |
Convert unformatted text into paragraphs. |
Text::Soundex |
Convert similar sounding text into condensed codes. |
Many of Perl's other standard modules have more than a little to do with text processing of one kind or another. We make a brief note of them and where they are covered at the end of this section. In addition, CPAN offers many more modules for handling text. Searching in the Text::
namespace will uncover many modules designed for basic processing requirements, while namespaces like XML::
or Parser::
offer more advanced task-specific modules.
The Text::Tabs
module is the simplest of the text processing modules. It provides two subroutines, unexpand
for converting sequences of spaces into tab characters and expand
for converting tab characters into spaces. Here is how they work:
# convert spaces into tabs
$tabbed_text = unexpand($spaced_text);
# convert tabs into spaces
$spaced_text = expand($tabbed_text);
Both of these subroutines work on either single strings, as just shown, or lists of strings, as in
@tabbed_lines = unexpand(@spaced_lines);
Any tabs that already exist in the text are not affected by unexpand
, and similarly existing spaces are not touched by expand
. The gap between stops (the stop gap, so to speak) is determined by the variable $tabstop
, which is set to the desired tab width, 4 by default. This is actually imported into our own package by default so we can set it with
$tabstop = 8; # set a tab width of eight characters
That said, it is better from a namespace pollution point of view to import only the subroutines and set $tabstop
as a package variable:
use Text::Tabs qw(expand unexpand);
$Text::Tabs::tabstop = 8;
It is occasionally useful to be able to quickly determine the unique abbreviations for a set of words, for instance, when implementing a command-line interface. Assuming we wish to create our own, rather than use an existing solution like Term::Complete
or (sometimes) Term::Readline
, we can make use of the Text::Abbrev
module to precompute a table of abbreviations and their full-name equivalents.
The Text::Abbrev
module supplies one function, abbrev
, which works by taking a list of words and computing abbreviations for each of them in turn by removing one character at a time from each word and recording the resultant word stalk in a hash table. If the abbreviation has already been seen, it must be because two words share that abbreviation, and it is removed from the table. If a supplied word is an abbreviation of another, it is recorded and the longer abbreviations remain, pointing to the longer word. This short script shows the module in action:
#!/usr/bin/perl
# abbrev.pl
use warnings;
use strict;
use Text::Abbrev;
my $abbreviations = abbrev(@ARGV);
foreach (sort keys %{$abbreviations}) {
print "$_ => $abbreviations->{$_}
";
}
When run, this script produces a hash of unique abbreviations. In the output that follows, the abbreviations for gin, gang
, and goolie
are calculated. The single letter g
is not present because it does not uniquely identify a word, but ga, gi
, and go
are:
> abbrev.pl gin gan goolie
ga => gang
gan => gang
gang => gang
gi => gin
gin => gin
go => goolie
goo => goolie
gool => goolie
gooli => goolie
goolie => goolie
The abbrev
function returns either a list suitable for creating a hash or a hash reference, depending on whether it was called in list or scalar context:
%abbreviations = abbrev('gin', 'gang', 'goolie'),
$abbreviations = abbrev('gin', 'gang', 'goolie'),
We can also pass in a reference to a hash or a typeglob (deprecated) as the first argument. However, the original contents, if any, are not maintained:
# overwrite previous contents of $abbreviations
abbrev($abbreviations, 'ghost', 'ghast', 'ghoul'),
Note that the Term::Complete
module combines abbreviations with a command-line entry mechanism (although it does not use Text::Abbrev
to determine abbreviations). If we don't need anything more complex, this is a simpler solution than rolling our own with Text::Abbrev
. See Chapter 15 for more details.
Many applications that accept textual input need to be able to parse the text into distinct words for processing. In most simple cases, we can get away with using split
. Since this is such a common requirement, split
even splits using whitespace as a default. For instance, this rather terse program carves up its input text into a list of words, separated by whitespace and split using split
with no arguments:
#!/usr/bin/perl
# splitwords.pl
use warnings;
use strict;
my @words;
push @words, split foreach(<>);
print scalar(@words), "words: @words
";
This approach falls short if we want to handle more advanced constructs like quotes. If two or more words are surrounded by quotes, we often want to treat them as a single word or phrase, in which case we can't easily use split
. Instead we can use the Text::ParseWords
module, which handles quotes and produces a list of words and phrases using them.
Parsing Space-Separated Text
The Text::ParseWords
module supports the parsing of text into words and phrases, based on the presence of quotes in the input text. It provides four subroutines:
shellwords |
Process strings using whitespace as a delimiter, in the same manner as shells. |
quotewords |
Handle more general cases where the word separator can be any arbitrary text. |
nested_quotewords |
Similar to quotewords , word separator can be any arbitrary text. |
parse_line |
A simpler version of quotewords , which handles a single line of text and which is actually the basis of the other three. |
The first, shellwords
, takes one or more lines of text and returns a list of words and phrases found within them. Since it is set to consider whitespace as the separator between words, it takes no other parameters:
@words = shellwords(@input);
Here is a short program that shows shellwords
in action:
#!/usr/bin/perl
# shell.pl
use warnings;
use strict;
use Text::ParseWords qw(shellwords);
my @input = (
'This is "a phrase"',
'So is this',
q('and this'),
"But this isn't",
'And neither "is this"',
);
print "Input: ", join('',@input),"
";
my @words = shellwords(@input); print scalar(@words), " words:
";
print " $_
" foreach @words;
When run, this program should produce the following output:
> shell.pl
Input: This is "a phrase" So is this 'and this' This isn't Neither "is this"
11 words:
This
is
a phrase
So
is this
and this
This
isn't
Neither
"is
this"
This program demonstrates several points. First, we can define phrases with double quotes, or single quotes if we use the q
function. Second, we can also define phrases by escaping spaces that we want shellwords
to overlook. In order to have shellwords
process these backslashes, we have to use single quotes (or q
) around the string as a whole to avoid interpolation from evaluating them first. Finally, to have shellwords
ignore a quote, we can escape it, but to escape a single quote, we have to use double quotes around the string and escape it twice (once for interpolation, once for shellwords
). Of course, a lot of this is simpler if the text is coming from a variable rather than a literal string.
Parsing Arbitrarily Delimited Text
The quotewords
subroutine is a more flexible version of shellwords
that allows the word separator to be defined. It takes two additional parameters, a regular expression pattern describing the word separator itself and a keep
flag that determines how quotes are handled. This is how we might use it to emulate and modify the result of shellwords
. Note the value of the keep
flag in each case:
# emulate 'shellwords' with 'quotewords'
@words = quotewords('s+', 0, @lines);
# emulate 'shellwords' but keep quotes and backslashes
@words = quotewords('s+', 1, @lines);
As a more complete example, here is a short program that parses a file of colon-delimited lines (like those found in /etc/passwd
) into a long list of fields:
#!/usr/bin/perl
# readpw.pl
use warnings;
use strict;
use Text::ParseWords;
my (@users, @fields);
if (open PASSWD,"/etc/passwd") {
@users = <PASSWD>;
chomp @users; # remove linefeeds
@fields = quotewords(':', 0, @users);
close PASSWD;
}
print "@fields";
The keep
parameter determines whether quotes and backslashes are removed once their work is done, as real shells do, or whether they should be retained in the resulting list of words. If false, quotes are removed as they are parsed. If true, they are retained. The keep
flag is almost but not quite Boolean. If set to the special case of delimiters
, both quotes and characters that matched the word separator are kept:
# emulate 'shellwords' but keep quotes and backlashes and also store the
# matched whitespace as tokens too
@words = quotewords('s+', 'delimiters', @lines);
Batch-Parsing Multiple Lines
The preceding /etc/passwd
example works, but it assembles all the resultant fields of each user into one huge list of words. Far better would be to keep each set of words found on each individual line in separate lists. We can do that with the nested_quotewords
subroutine, which returns a list of lists, one list for each line passed in. Here is a short program that uses nested_quotewords
to do just that:
#!/usr/bin/perl
# password.pl
use Text::ParseWords;
my @ARGV = ('/etc/passwd'),
my @users = nested_quotewords(':', 0, <>);
print scalar(@users)," users:
";
print " ${$_}[0] => ${$_}[2]
" foreach @users;
This program prints out a list of all users found in /etc/passwd
and their user ID. When run it should produce output that starts something like the following:
> perl password.pl
16 users:
root => 0
bin => 1
daemon => 2
adm => 3
...
In this case, we could equally well have used split
with a split pattern of a colon since quotes do not usually appear in a password file. However, the principle still applies.
Parsing a Single Line Only
The fourth function provided by Text::ParseWords
is parse_line
. It parses a single line only but is otherwise identical in operation to quotewords
, and it takes the same parameters with the exception that the last can only be a scalar string value:
@words = parse_line('s+', 0, $line);
The parse_line
subroutine provides no functional benefit over quotewords
, but if we only have one line to parse, for example, a command-line input, then we can save a subroutine call by calling it directly rather than via quotewords
or shellwords
.
Added to the Perl standard library for Perl 5.6, Text::Balanced
is also available for older Perls from CPAN. It provides comprehensive abilities to match delimiters and brackets with arbitrary levels of nesting. Matching nested delimiters is traditionally a hard problem to solve, so having a ready-made solution is very welcome.
Extracting an Initial Match
All of the routines provided by Text::Balanced
work in essentially the same way, taking some input text and applying one or more delimiters, brackets, or tags to it in order to extract a match. We can also supply an initial prefix for the routine to skip before commencing. This prefix, by default set to skip whitespace, is a regular expression, so we can create quite powerful match criteria.
Quotes and Single-Character Delimiters
To match delimiters and brackets, we have the extract_delimited
and extract_bracketed
routines. These operate in substantially similar ways, the only difference being that the latter understands the concept of paired characters, where the opening and closing delimiters are different. Here is a simple example of extracting the first double-quoted expression from some input text:
#!/usr/bin/perl
# quotebalanced1.pl
use strict;
use warnings;
use Text::Balanced qw(extract_delimited);
my $input=qq[The "quick" brown fox "jumped over" the lazy "dog"];
my ($extracted,$remainder)=extract_delimited($input,'"','The '),
print qq[Got $extracted, remainder <$remainder>
];
The first argument to extract_delimited
is the text to be matched. The second is the delimiter; only the first character is used if more than one is supplied. The third (optional) parameter is the prefix to skip before starting the extraction. Without it, only whitespace is skipped over. This program will generate the following output:
Got "quick", remainder < brown fox "jumped over" the lazy "dog">
The remainder starts with the space immediately following the second quote, and the extracted text includes the delimiters. If we don't care about the remainder, we do not have to ask for it. All the extract_
functions will notice when they are called in a scalar context and will return just the extracted text, so we can write
my $extracted=extract_delimited($input,'"','The '),
If we want to match on more than one kind of delimiter, for example, single and double quotes, we replace the delimiter with a character class, like this:
my $extracted=extract_delimited($input,q/["']/,'The '),
This reveals that the delimiter is actually a regular expression, and indeed we could also write qr/["']/
here, or use even more advanced patterns. Whichever quote character is found is looked for again to complete the match, so any number of intervening single quotes may occur between an initial double quote and its terminating twin.
We are not just limited to quotes as delimiters—any character can be used. We can also pass undef
as a delimiter, in which case the standard Perl quotes are used. The following statements are equivalent:
my $extracted=extract_delimited($input,q/["'`]/,'The '), #explicit
my $extracted=extract_delimited($input,undef,'The '), #implict '," and `
In order to match the first set of quotes, we supplied the prefix 'The '
to extract_delimited
. Given the position of the first quote, the routine then finds the second. This is not very flexible, however. What we would really like to do is specify a prefix that says "skip everything up to the first double quote." Luckily, this turns out to be very easy because the prefix is a regular expression, and this is simply expressed as [^"]+
, or "anything but a double quote":
my ($extracted,$remainder)=extract_delimited($input,'"','[^"]+'),
Substituting this line for the original will generate exactly the same output, but now it is no longer bound to the specific prefix of the input text. If we are curious to know what the prefix actually matched, we can get it from the third value returned:
my ($extracted,$remainder,$prefix)=extract_delimited($input,'"','[^"]+'),
We can supply a precompiled regular expression for the third parameter as well:
my ($extracted,$remainder)=extract_delimited($input,'"',qr/[^"]+/);
This emphasizes the regular expression, which improves legibility. It also allows us to specify trailing pattern match modifiers, which allows us to specify this alternate regular expression, which is closer to the literal meaning of "skip everything up to the first quote":
my ($extracted,$remainder)=extract_delimited($input,'"',qr/.*?(?=")/s);
This pattern starts with a nongreedy match for anything. Since a dot does not ordinarily match a newline, the /s
modifier is required to permit the prefix to match an arbitrary number of initial lines without double quotes in them. This pattern also makes use of a positive look-ahead assertion (?=")
to spot a quote without absorbing it. Combined with the nongreedy pattern, this will match all text up to the first quote.
By default, the backslash character, , escapes delimiters so that they will not be considered as delimiters in the input text. A fourth parameter to
extract_delimited
allows us to change the escape character or individually nominate a different escape character for each delimiter. For example:
extract_delimited($input,q/['"`]/,undef,''), # no escape character
extract_delimited($input,q/['"`]/,undef,q/ 33/); # ASCII 27 (ESC)
extract_delimited($input,q/['"`]/,undef,q/'"`/); # escape is delimiter
The last example defines a list of quote characters that is identical to (that is, in the same order as) the delimiters specified in the character class of the second parameter. If more than one escape character is specified, then each delimiter is escaped with the corresponding escape character. If not enough escape characters are supplied, then the last one is used for all remaining delimiters. In this example, the escape character for each delimiter is the same as the delimiter, so now we double up a character to escape it.
We can generate a customized regular expression that can take the place of extract_delimited
for preset criteria. To do this, we make use of gen_delimited_pat
, which takes delimiter and optional escape character arguments:
my $delimiter_re=gen_delimited_pat(q/['"`]/,q/'"'/);
The regular expression generated by this statement will match quote strings using any of the quote characters specified, each one of which is escapable by doubling it up. This is a convenient way to pregenerate regular expressions and does not even require that Text::Balanced
is available in the final application:
$input =˜ /($delimiter_re)/ and print "Found $1
";
Or, to emulate extract_delimited
more closely:
$input =˜/^($prefix)($delimiter_re)(.*)$/ and
($prefix,$extracted,$remainder)=($1,$2,$3);
The regular expression is in an optimized form that is much longer than one we might otherwise write but which is the most efficient at finding a match. It does not extract text though, so we need to add our own parentheses to get the matched text back.
Brackets and Braces
The extract_bracketed
function is identical in use and return values, except that the delimiters are one or more of the matched brace characters (), [], {}
, or <>
. Here is an adapted version of the previous example that extracts bracketed text:
#!/usr/bin/perl
# bracebalanced.pl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
my $input=qq[The (<quick brown> fox) {jumped} over the (<lazy> dog)];
my ($extracted,$remainder)=extract_bracketed($input,'()<>{}',qr/[^()<>{}]+/);
print qq[Got "$extracted", remainder "$remainder"
];
When run, this program will produce the following output:
Got "(<quick brown> fox)", remainder " {jumped} over the (<lazy> dog)"
As before, the prefix is simply defined as text not containing any of the delimiters. If we changed the delimiter to not detect (
, the text extracted would instead be <quick brown>
. Interestingly, since Perl already knows about matched delimiter characters for quoting operators, we only need to specify the opening characters:
my ($extracted,$remainder)=extract_bracketed($input,'(<{',qr/[^()<>{}]+/);
It would be a mistake to think that extract_bracketed
just looks for a closing delimiter character, though. One of its major benefits is that it understands nested braces and will only match on the corresponding closing brace. Take this adjusted example, where all the braces are round:
#!/usr/bin/perl
# nestedbracebalanced.pl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
my $input=qq[The ((quick brown) fox) (jumped) over the ((lazy) dog)];
my ($extracted,$remainder)=extract_bracketed($input,'(',qr/[^()]+/);
print qq[Got "$extracted", remainder "$remainder"
];
When run, this matches the correct closing brace, after fox
, not the first one after brown
:
Got "((quick brown) fox)", remainder " (jumped) over the ((lazy) dog)"
In situations where more than one brace type is significant, all brackets must nest correctly for the text to be successfully extracted. For example, if ()
and <>
are both considered delimiters, then this text will not match because there is no matching >
:
my $input=''(supply < demand)';
Only those characters listed as delimiters (and their corresponding closing characters) are managed this way, so if we do not consider <
and >
to be delimiters, then they are not considered and this text would match.
We can even handle quoted delimiters and ignore them rather than treating them as delimiters if we include a quote character in the list of delimiters. For example:
my ($extracted,$remainder)=extract_bracketed($input,q[("')],qr/[^()]+/);
This will match on round brackets only but disregard any round brackets found within single- or double-quoted strings. Interestingly, if the letter q
is included, any character acceptable as a Perl quoting delimiter for quotelike operators such as //
or {}
is also recognized. If not specified as an acceptable brace delimiter, characters like {
and }
will instead operate like quotes, causing their contents to be skipped rather than processed for valid delimiters.
XML and Other Tagged Delimiters
The extract_tagged
function does for tags what extract_bracketed
does for brace characters. It is broadly similar in use: it returns the same values of ($extracted,$remainder,$prefix)
, but it has a slightly expanded list of up to five parameters. Here are some examples of it:
extract_tagged($text); # match any XML tag.
extract_tagged($text,"<FOO>") # match <FOO>...</FOO>
extract_tagged($text,"<[a-z]+>") # match any lowercase XML tag
extract_tagged($text,"/start","/end" # match /start.../end
extract_tagged($text,"[FOO]","[/FOO]",$prefix) # skip prefix then match
# [FOO]...[/FOO]
As these examples show, while XML tags are the default target, any kind of delimiter tags can be fined, including regular expressions matches. We can also pass in undef
for any parameters for which we just want to accept the defaults. For example:
extract_tagged($text,"{FOO}",undef,$prefix) # skip prefix then match
# {FOO}...{/FOO}
extract_tagged($text,undef,undef,$prefix) # skip prefix then match
# any XML tag.
Tags must balance, just as braces must balance for extract_bracketed
. The composition of the start tag and the end tag is specified and analyzed for punctuation versus alphanumeric characters, and the general rule for tags is inferred from it. The end tag, if not specified, is autogenerated by inserting a /
before the alphanumeric part, so (abc)
is automatically paired with (/abc)
and sets (...)
and (/...)
as the general rule. Remarkably, this will even work for tags like <[a-z]+>
as in the preceding example.
For example, the following program looks for and extracts the first lowercased XML tag:
#!/usr/bin/perl
# extractlctagged.pl
use strict;
use warnings;
use Text::Balanced qw(extract_tagged);
my $input=qq[<TEXT>The quick brown <subject>fox</subject> jumped
over the lazy <object>dog<?object></TEXT>];
my ($extracted,$remainder)=
extract_tagged($input,"<[a-z]+>",undef,".*?(?=<[a-z]+>)");
print qq[Got "$extracted"
Remainder "$remainder"
];
Running this program will produce
Got "<subject>fox</subject>"
Remainder " jumped over the lazy <object>dog</object></TEXT>"
To match any XML-compliant tags, we would replace the extract_tagged
call in this example with
my ($extracted,$remainder)=extract_tagged($input,undef,undef,".*?<[^<>]+>");
This would try to match the entire input text because of the initial <TEXT>
, but the <subject>
and <object>
tags both match the general rule for tags, and so this match would fail, because the closing tag for <object>
was <?object>
, and not the expected </object>
. (We can see the reason for the failure by printing out $@
, as we will see later. $@
is set by Text::Balanced
whenever a match cannot be made.)
The fifth parameter to extract_tagged
is a hash of configuration options that allow us to alter the match behavior to handle common alternate inputs. For instance, HTML (as opposed to XHTML) allows some tags to be left unclosed. If we want to be able to handle these, we can tell extract_tagged
not to require the closing tag. For example, this will not only match paragraph tags <p>...</p>
, but also end the paragraph if a second <p>
is seen:
my $paragraph=extract_tagged($text,"<p>","</p>",".*?(?=<p>)", {
reject => ["<p>"],
fail => MAX
});
Here the reject
option instructs the routine to fail if a second <p>
is seen. The criteria are specified within a list reference so that we can specify multiple reject criteria at the same time if we wish. The fail
option MAX
instructs it to succeed and return all the characters matched up until that point rather than fail. This caters for both a closing tag and a new opening one. Alternatively, we can say fail => PARA
to cut off the returned characters at the next blank line.
We can also use the ignore
option to have extract_tagged
disregard character sequences that would otherwise match the start tag specification. This is generally useful when the start tag is a regular expression or the "match any XML tag" default. For example, to handle the <object>
tag by simply ignoring it, we could put
my $paragraph=extract_tagged($text,undef,undef,".*?(?=<[^<>]+>)",{
ignore => ["<object>"],
});
If we expect to use the same parameters to extract_tagged
frequently, we can generate a custom match that is routine-specific to the match we want to make with gen_extract_tagged
. This takes the exact same arguments, minus the first input text argument, and allows us to rewrite this:
my ($extracted,$remainder)=
extract_tagged($text,"<TOP>","<BOTTOM>",qr/.*?(?=<TOP>)/);
as this:
my $topbottom_match=
gen_extract_tagged("<TOP>","<BOTTOM>",qr/.*?(?=<TOP>)/);
my ($extracted,$remainder)=$topbottom_match->($input);
The returned value is actually a closure (that is, a code reference) blessed into the Text::Balanced::Extractor
class, and hence is invoked as a method call and not a subroutine. These routines become very useful in multiple match scenarios using extract_multiple
. We can also create a special tag matcher for the <object>...=<?object>
example and process it along with XML tags instead of ignoring it.
Perl-Style Quoted Text
The extract_quotelike
function extracts Perl-style quoted strings from the input text. It is essentially a smarter version of extract_delimited
that understands not just the standard quote characters, but also Perl's quotelike operators q, qq
, and qr
, hence its name. It will also parse any quotelike operator such as m//, s///, tr///
, or any of their alternate quote syntaxes like s{}{}
.
Since it is a specialized function, we cannot specify delimiters, and in fact we can only specify the text to be matched and the initial prefix. Perhaps to make up for this, this function returns up to ten values in list context, although not all of them will contain meaningful values depending on what kind of quoted text was found.
my ($extracted,$remainder,$matchedprefix,
$typeofquote,
$startdelim1,$text1,$enddelim1,
$startdelim2,$text2,$enddelim2,
$endflags) = extract_quotelike($input,$prefix);
The first three values here are the familiar extracted text, remainder, and matched prefix that we have already seen. The fourth value holds the operator, if any, that was found. For a regular quoted string it is undefined, but it will hold the appropriate value for a q, qq, qr, s, m, y
, or tr
operator. The next three values describe the delimiters and the text of the first found string. For a regular quoted string, this is the string and its delimiters. For a match or substitution, it is the delimiters and pattern of the match. The three values following this have the same meaning except for the second found string; this only has meaning for substitutions and transliterations, the only two operators to have the concept of a "second" string. Finally, any pattern match modifiers found are returned. This only has meaning for matches, substitution, or transliteration, of course.
Clever though it is, extract_quotelike
is only really useful for parsing text containing Perl code. To handle ordinary quoted text, we actually want extract_delimited
instead.
Variables and Code
The final two extraction routines are extract_variable
and extract_codeblock
. The first of these simply matches any kind of Perl variable, including method calls and subroutine calls, via a code reference. It has the usual semantics:
my (extracted,$remainder,$matchedprefix)=extract_variable($input,$prefix);
Like extract_quotelike
, this is only useful if we are parsing Perl code. Of course, in these cases it is very useful indeed.
The extract_codeblock
function is essentially a combination of extract_quotelike
and extract_bracketed
that will correctly parse strings containing nested braces and quoted strings that may contain braces that should not be considered significant for the purposes of matching. Its usage is identical to extract_bracketed
except that a fourth argument may be specified to describe an outer delimiter that is not included in the delimiter set passed as the second argument. This allows code to be "marked up" in enclosing text using a special delimiter that is not considered special within the code block itself. Only after the code block matches its initial delimiter again is the closing outer delimiter looked for once more.
For example, this statement matches on (), {}
, or []
braces, but looks for an outer set of <>
braces to mark the beginning and end of the code:
my ($extracted,$remainder,$matchedprefix)=
extract_codeblock($input, q/{}()[]/ , qr/.*?(?=<)/, "<>");
Even though the code block might actually contain <
and >
characters, they are not recognized as delimiters. Whatever the first brace inside the outer <....>
turns out to be, the closing >
is only matched when that brace is matched again, taking into account any nesting of delimiters that takes place within the block.
The default bracket delimiter for extract_codeblock
is just {}
, so with only the input text as an argument, or with a second argument of undef
, other kinds of brace are not recognized. This is different from the default {}()[]
of extract_bracketed
.
Both extract_codeblock
and extract_quotelike
may be more feature-full than we actually require. If we only need to deal with real quote characters and do not need the outer delimiter feature, we can achieve a similar effect more simply and with greater flexibility using extract_bracketed
with a delimiter string containing the quote and brace characters we want to handle.
Extracting Multiple Matches
All of the extractor functions we have looked at so far take note of and maintain the match position of input text that is stored in a variable. This is the same position that the pos
function and the G
regular expression metacharacter reference and that the /g
modifier sets. As a result, all of the extract_
functions can be used in loops to extract multiple values. Here is a modified version of one of our earlier examples, now rewritten into a do...while
loop:
#!/usr/bin/perl
# whilequotebalanced.pl
use strict;
use warnings;
use Text::Balanced qw(extract_delimited);
my $input=qq[The "quick" brown fox "jumped over" the lazy "dog"];
my ($extracted,$remainder);
do {
($extracted,$remainder)=extract_delimited($input,'"',qr/[^"]+/);
print qq[Got $extracted, remainder <$remainder>
];
} while ($extracted and $remainder);
When run, this version of the program will print out
Got "quick", remainder < brown fox "jumped over" the lazy "dog">
Got "jumped over", remainder < the lazy "dog">
Got "dog", remainder <>
Note the criteria of the while
condition. We terminate the loop if we run out of input text, indicated by an empty remainder, or we get an undef
returned for the extracted text, indicating an error. In the latter case, we can look at $@
to see what the problem was.
For more advanced requirements, we can use the extract_multiple
function. This wraps one or more extraction functions and applies the input text to each of them in turn. For example, this program applies extract_bracketed
repeatedly to the input text we used earlier:
#!/usr/bin/perl
# bracebalanced1.pl
use strict;
use warnings;
use Text::Balanced qw(extract_multiple extract_bracketed);
my $input=qq[The (<quick brown> fox) {jumped} over the (<lazy> dog)];
my @matches=extract_multiple(
$input,
[ &extract_bracketed ],
undef, 1
);
print "Got ",scalar(@matches)," matches
";
print $_+1,"='",$matches[$_],"'
" foreach (0..$#matches);
When run, this program will output the following:
Got 3 matches
1='(<quick brown> fox)'
2='{jumped}'
3='(<lazy> dog)'
Of the four arguments supplied to extract_multiple
, the first is the input text as usual. If it is undefined, $_
is used. The second we will pass over for a moment. The third parameter defines how many matches to make; it has the same meaning as the third argument of split
. A positive number will cause only that many matches to take place (in scalar context, the number of matches is forced to 1, and a warning is issued if more are requested).
The fourth determines if unmatched text is discarded or returned. If true, it is discarded. If unspecified or false, it is returned. The unmatched text is able to take the place of prefix text in many cases; the unmatched text is assembled character by character each time extract_multiple
is unable to advance through the input text using any of the supplied extractors. If we modified the last argument to 0
, for example, the first string returned would be "The ".
Note If we don't need to specify the fourth argument and don't want to limit matches, then we can omit both the third and fourth arguments and just specify the input text.
The second argument is the most complex. It is a reference to an array of extraction functions. Here we have only one. Only the input text is passed, but if we are happy to use an extraction function with its settings, we can just specify it as a code reference. An extraction function can also be a closure generated by gen_extract_tagged
, a regular expression, or a literal text string. Each is automatically detected and handled correctly, as demonstrated in this modified call:
my @matches=extract_multiple(
$input,
[ &extract_bracketed,
qr/THE/i,
'over'
],
undef, 1
);
This will display the following:
Got 6 matches
1='The'
2='(<quick brown> fox)'
3='{jumped}'
4='over'
5='the'
6='(<lazy> dog)'
Note that there are no spaces before or after the extracted text. That is because the default prefix matched by extract_bracketed
is whitespace. It is matched inside the extractor but not returned as matched text.
What if we want to use different parameters to the defaults? This is not a problem, but we need to use an anonymous subroutine to wrap our extractor subroutines with the arguments we want. The first argument should be $_[0]
, to get the input text passed in by extract_multiple
. For example:
my @matches=extract_multiple(
$input,
[ sub { extract_bracketed($_[0],'()<>{}',qr/[^()<>{}]+/) } ],
undef, 1
);
As a final nicety, if an extractor is specified as a hash reference to a hash of one key-value pair, the value is used as the extractor, and on a successful match the extracted text is returned in a reference blessed into the class named by the key:
my @matches=extract_multiple(
$input, [
{ Parsed::String::QUOTED => &extract_delimited }
{ Parsed::String::BRACKETED => &extract_bracketed },
]
);
This allows smart extraction algorithms to tokenize extracted text conveniently and is very useful for implementing parsers.
Handling Failed Matches
If any of Text::Balanced
's functions fail, an undef
is returned to the caller. In a list context, the undef
is followed by the original input text, so the usual semantics of ($extracted,$remainder)
remain consistent even in a failed match. To find out the actual reason for the failure, we look at the special variable $@
, which is assigned an object of class Text::Balanced::ErrorMsg
with two attributes:
$@->{error} |
A diagnostic error message indicating the reason for failure |
$@->{pos} |
The position in the input text where the error occurred |
In a string context, both values are combined into a single diagnostic error message. For example, in our earlier mismatched XML tag example where we wrote <object>dog<?object>, $@
would be set to
Found unbalanced nested tag: <object>, detected at offset 66
There are over 20 possible reasons for Text::Balanced
to fail and set $@
. On a successful match, $@
will always be undef
, so any other value is an indication of failure.
The Text::Wrap
module provides text-formatting facilities to automate the task of turning irregular blocks of text into neatly formatted paragraphs, organized so that their lines fit within a specified width. Although not particularly powerful, it provides a simple and quick solution.
It provides two subroutines: wrap
handles individual paragraphs and is ideally suited for formatting single lines into a more presentable form, and fill
handles multiple paragraphs and will work on entire documents.
The wrap
subroutine formats single paragraphs, transforming one or more lines of text of indeterminate length and converting them into a single paragraph. It takes three parameters: an initial indent string, which is applied to the first line of the resulting paragraph; a following indent string, applied to the second and all subsequent lines; and finally a string or list of strings.
Here is how we could use wrap
to generate a paragraph with an indent of five spaces on the first line and an indent of two spaces on all subsequent lines:
$para = wrap(' ', ' ', @lines);
Any indentation is permissible. Here is a paragraph formatted (crudely) with HTML tags to force the lines to conform to a given line length instead of following the browser's screen width:
$html = wrap("<p> ", "<br>", $text);
If a list is supplied, wrap
concatenates all the strings into one before proceeding—there is no essential difference between supplying a single string over a list. However, existing indentation, if there is any, is not eliminated, so we must take care to deal with this first if we are handling text that has already been formatted to a different set of criteria. For example:
s/^s+// foreach @lines; # strip leading whitespace from all lines
The list can be of any origin, not just an array variable. For example, take this one-line reformatting application:
print wrap(" ", "", <>); # reformat standard input/ARGV
Tabs are handled by Text::Wrap
to expand them into spaces, a function handled by Text::Tabs
, previously documented. When formatting is complete, spaces are converted back into tabs, if possible and appropriate. See the section "Expanding and Contracting Tabs with Text::Tabs" for more information.
Customized Wrapping
The Text::Wrap
module defines several package variables to control its behavior, including the formatting width, the handling of long words, and the break text.
The number of columns to format is held in the package variable Text::Wrap::columns
and has a default value of 76, which is the polite width for things like e-mail messages (to allow a couple of >
quoting prefixes to be added in replies before 80 columns is reached). We can change the column width to 39 with
$Text::Wrap::columns = 39;
Words that are too long to fit the line are broken up (URLs in text documents are a common culprit). This behavior can be altered to a fatal error by setting the variable with the following line:
$Text::Wrap::huge = 'die';
Alternatively, long words can be left as-is, causing them to overflow the width, with
$Text::Wrap::huge = 'overflow';
We can also configure the break text, that is, the character or characters that separate words. The break text is a regular expression, defined in the package variable $Text::Wrap::break
, and is by default s
, to match any whitespace character. To allow a comma or a colon, but not a space to break text, we could redefine this to $Text::Wrap::break = '[:,]';
.
A limited debugging mode can also be enabled by setting the variable $Text::Wrap::debug
:
$Text::Wrap::debug = 1;
The columns, break
, and huge
variables can all be exported from the Text::Wrap
package, if desired:
use Text::Wrap qw($columns $huge);$columns = 39;
$huge = 'overflow';
As with any module symbols we import, this is fine for simple scripts but is probably unwarranted for larger applications—use the fully qualified package variables instead.
Formatting Whole Documents
Whole documents can be formatted with the fill
subroutine. This will chop the supplied text into paragraphs first by looking for lines that are indented, indicating the start of a new paragraph, and blank lines, indicating the end of one paragraph and the start of another. Having determined where each paragraph starts and ends, it then feeds the resulting lines to wrap
, before merging the resulting wrapped paragraphs back together.
The arguments passed to fill
are the same as those for wrap
. Here is how we would use it to reformat paragraphs into unindented and spaced paragraphs:
$formatted_document = fill("
", "", @lines);
If the two indents are identical, fill
automatically adds a blank line to separate each paragraph from the previous one. Therefore, the preceding could also be achieved with
$formatted_document = fill("", "", @lines);
If the indents are not identical, then we need to add the blank line ourselves:
$formatted_document = fill(" ", "", @lines);
# indent each new paragraph with a tab, paragraphs are continuous
$formatted_document = fill("
", "", @lines);
# indent each new paragraph with a tag, paragraphs are separated
All the configurable variables that affect the operation of wrap
also apply to fill
, of course, since fill
uses wrap
to do most of the actual work. It is not possible to configure how fill
splits text into paragraphs.
Note that if fill
is passed lines already indented by a previous wrap
operation, then it will incorrectly detect each new line as a new paragraph (because it is indented). Consequently, we must remove misleading indentation from the lines we want to reformat before we pass them to fill
.
Formatting on the Command Line
Text::Wrap
's usage is simple enough for it to be used on the command line:
> perl -MText::Wrap -e "fill('','',<>)" -- textfile ...
Here we have used the special argument --
to separate Perl's arguments from the file names to be fed to the formatter. We can supply any number of files at once and redirect the output to a file if we wish. A related module that may be worth investigating is Text::Autoformat
, which is specifically tailored for command-line uses like this.
The Text::Soundex
module is different in nature from the other modules in the Text::
family. While modules such as Text::Abbrev
and Text::ParseWords
are simple solutions to common problems, Text::Soundex
tackles a different area entirely. It implements a version of the Soundex algorithm developed for the U.S. Census in the latter part of the 19th century as an aid for phonetically indexing surnames and popularized by Donald Knuth of TeX fame.
The Soundex algorithm takes words and converts them into tokens that approximate the sound of the word. Similar-sounding words produce tokens that are either the same or close together. Using this, we can generate Soundex tokens for a predetermined list of words, say a dictionary or a list of surnames, and match queries against it. If the query is close to a word in the list, we can return the match even if the query is not exactly right, misspelled, for example.
Tokenizing Single Words
The Text::Soundex
module provides exactly one subroutine, soundex
, that transforms one word into its Soundex token. It can also accept a list of words and will return a list of the tokens, but it will not deal with multiple words in one string:
print soundex "hello"; # produces 'H400'
print soundex "goodbye"; # produces 'G310'
print soundex "hilo"; # produces 'H400' - same as 'Hello'
print join ',', soundex qw(Hello World);
# produces 'H400,W643'
print soundex "Hello World" # produces 'H464'
The following short program shows the Soundex algorithm being used to look up a name from a list given an input from the user. Since we are using Soundex
, the input doesn't have to be exact, just similar:
#!/usr/bin/perl
# surname.pl
use warnings;
use strict;
use Text::Soundex;
# define an ABC of names (as a hash for 'exists')
my %abc = (
"Hammerstein" => 1,
"Pineapples" => 1,
"Blackblood" => 1,
"Deadlock" => 1,
"Mekquake" => 1,
"Rojaws" => 1,
);
# create a token-to-name table
my %tokens;
foreach (keys %abc) {
$tokens{soundex $_} = $_;
}
# test input against known names
print "Name? ";
while (<>) {
chomp;
if (exists $abc{$_}) {
print "Yes, we have a '$_' here. Another? ";
} else {
my $token = soundex $_;
if (exists $tokens{$token}) {
print "Did you mean $tokens{$token}? ";
} else {
print "Sorry, who again? ";
}
}
}
We can try out this program with various different names, real and imaginary, and produce different answers. The input can be quite different from the name if it sounds approximately right:
> perl surname.pl
Name? Hammerstone
Did you mean Hammerstein? Hammerstein
Yes, we have a 'Hammerstein' here. Another? Blockbleed
Did you mean Blackblood? Mechwake
Did you mean Mekquake? Nemesis
Sorry, who again?
Tokenizing Lists of Words and E-Mail Addresses
We can produce a string of tokens from a string of words by splitting up the string before feeding it to soundex
. Here is a simple query program that takes input from the user and returns a list of tokens:
#!/usr/bin/perl
# soundex.pl
use warnings;
use strict;
use Text::Soundex;
while (<>) {
chomp; #remove trailing linefeed
s/W/ /g; #zap punctuation, e.g. '.', '@'
print "'$_' => '@{[soundex(split)]}'
";
}
We can try this program out with phrases to illustrate that accuracy does not have to be all that great, as a guide:
> perl soundex.pl
definitively inaccurate
'definitively inaccurate' => 'D153 I526'
devinatovli inekurat
'devinatovli inekurat' => 'D153 I526'
As well as handling spaces, we have also added a substitution that converts punctuation into spaces first. This allows us to generate a list of tokens for an e-mail address, for example.
The Soundex Algorithm
As the previous examples illustrate, Soundex tokens consist of an initial letter, which is the same as that of the original word, followed by three digits that represent the sound of the first, second, and third syllable, respectively. Since one has one syllable, only the first digit is non-zero. On the other hand, seven has two syllables, so it gets two non-zero digits. Comparing the two results, we can notice that both one and seven contain a 5, which corresponds to the syllable containing "n" in each word.
The Soundex algorithm has some obvious limitations though. In particular, it only resolves words up to the first three syllables. However, this is generally more than enough for simple "similar sounding" type matches, such as surname matching, for which it was designed.
Handling Untokenizable Words
In some rare cases, the Soundex algorithm cannot find any suitable token for the supplied word. In these cases, it usually returns nothing (or to be more accurate, undef
). We can change this behavior by setting the variable:
$Text::Soundex::soundex_nocode:
$Text::Soundex::soundex_nocode = 'Z000'; # a common 'failed' token
print soundex "=>"; # produces 'Z000'
If we change the value of this variable, we must be sure to set it to something that is not likely to genuinely occur. The value of Z000
is a common choice, but it matches many words including Zoo
. A better choice in this case might be Q999
, but there is no code that is absolutely guaranteed not to occur. If we do not need to conform to the Soundex code system (we might want to pass the results to something else that expects valid Soundex tokens as input), then we can simply define an impossible value like _NOCODE_
or ?000
, which soundex
cannot generate.
As well as the modules in the Text::
family, several other Perl modules outside the Text::
hierarchy involve text processing or combine text processing with other functions.
Several of the Term::
modules all involve text processing in relation to terminals. For instance, Term::Cap
involves generating ANSI escape sequences from capability codes, while Term::ReadLine
provides input line text processing support. These modules are all covered in Chapter 15.
Many Unix shell scripts make use of the sed
command to carry out text processing. The name is short for stream editor, and a typical sed
command might look like this:
sed 5q file.txt
This prints out the first five lines of a file, rather like the head
command.
Perl comes with a script called psed
that provides a complete implementation of sed
written in Perl. As it has no dependency on the real sed
, it will work on platforms like Windows for which sed
is not available (short of installing a Unix shell environment like Cygwin):
psed 5q file.txt
When invoked under the alternate name s2p
, this script instead takes the supplied sed
arguments and generates a stand-alone Perl script that performs the same operation:
s2p 5q file.txt > printtop5.pl
Of course, the script generated is not terribly efficient compared to simply reimplementing the script in Perl to start with. Under either name, the script can also be given the option -f
to parse and process a sed
script in a file rather than a directly typed command and -e
to specify additional commands (with or without -f
).
Documentation is a good idea in any programming language. Like most programming languages, Perl supports simple comments. However, it also attempts to combine the onerous duties of commenting code and documenting software into one slightly less arduous task through the prosaically named POD or Plain Old Documentation syntax.
Anything after a #
is a comment and ignored by the Perl interpreter. Comments may be placed on a line of their own or after existing Perl code. They can even be placed in the middle of multiline statements:
print 1 * # depth
4* # width
9; # height
Perl will not interpret a #
inside a string as the start of a comment, but because of this we cannot place comments inside HERE documents. While we cannot comment multiple lines at once, like C-style /*...*/
comments, POD offers this ability indirectly.
POD is a very simple markup syntax for integrating documentation with source code. It consists of a series of special one-line tokens that distinguish POD from source code and also allows us to define simple structures like headings and lists.
In and of itself, POD does nothing more than give us the ability to write multiline comments. However, its simple but flexible syntax also makes it very simple to convert into user-friendly document formats. Perl comes with the following POD translator tools as standard:
pod2text |
Render POD in plain text format. |
pod2html |
Render POD into HTML. |
pod2man |
Render POD into Unix manual page (nroff ) format. |
pod2latex |
Render POD into Latex format. |
Many more translators are available from CPAN, of course, including translators for RTF, XML, PostScript, PDF, DocBook, OpenOffice, as wells as alternate translators for HTML, text, and so on. The perldoc
utility is just a friendlier and more specialized interface to the same translation process, as are pod2usage
and podselect
, all of which are covered in this section.
POD Paragraphs
POD allows us to define documentation paragraphs, which we can insert into other documents—most usually, but by no means exclusively, Perl code. The simplest sequence is =pod ... =cut
. The =pod
token states that all of the following text is to be taken as POD paragraphs, until the next =cut
or the end of the file:
...
$scalar = "value";
=pod
This is a paragraph of POD text embedded into some Perl code. It is not indented,
so it will be treated as normal text and word wrapped by POD translators.
=cut
print do_something($scalar);
...
Within the delimited section, text is divided into paragraphs, which are simply blocks of continuous text (potentially including linefeeds). A paragraph ends and a new one begins only when a completely empty line is encountered. All POD tokens will absorb a paragraph that immediately follows them, which is why there is a blank line after the =cut
in the preceding example. While the blank line preceding the =cut
is not necessary, maintaining blank lines on both sides helps to visually discriminate the POD directive.
Some tokens such as =item
and =head1
, covered shortly, use the attached paragraph for display purposes. Others, like =pod
and =cut
, ignore it. This lets us document the POD itself, as the text immediately following =pod
or =cut
is not rendered by POD processors.
=pod this is just a draft document
mysubname - this subroutine doesn't do much of anything
at all and is just serving as an example of how to document it with
a POD paragraph or two
=cut end of draft bit
Since nonblank following lines are included in the text attached to the token, the preceding =pod
token could also be written, with identically equivalent meaning, as follows:
=pod this is
just a draft
document
The same rule applies to all POD tokens except =cut
. While POD translators will absorb text on the line or lines immediately following a =cut
, the Perl interpreter itself will only ignore text following a =cut
on the same line. As a consequence, we cannot spread a "cut comment" across more than one line and expect code to compile.
Paragraphs and Paragraph Types
If a paragraph is indented, then we consider it to be preformatted, much in the same way that the HTML <pre>
tag works. The following example shows three paragraphs, two of which are indented:
=pod
This paragraph is indented, so it is taken as
is and not reformatted by translators like:
pod2text - the text translator
pod2html - the HTML translator
pod2man - the Unix manual page translator
Note that 'as is' also means that escaping does not work, and that
interpolation doesn't happen. What we see is what we get.
This is a second paragraph in the same =pod...=cut section. Since it is
not indented it will be reformatted by translators.
=cut
Section headings can be added with the =head1
and =head2
tokens, plus =head3
and =head4
in more recent releases of the POD translators. The heading text is the paragraph immediately following the token and may start (and end) on the same line:
=head1 This is a level one heading
Or:
=head2 This is
a level
two heading
Or:
=head2
As is this
Heading tokens start a POD section in the same way that =pod
does if one is not already active. POD sections do not nest, so we only need one =cut
to get back to Perl code. In general the first form is used, but it's important to leave an empty line if we do not want the heading to absorb the following paragraph:
=head1 ERROR: This heading has accidentally swallowed up
this paragraph, because there is no separating line.
=head2 ERROR: Worse than that, it will absorb this second level heading too
so this becomes one long level one heading.
=cut
This is how we should really do it:
=head1 This is a level one heading
This is a paragraph following the level one heading.
=head2 This is a level two heading
This a preformatted paragraph following the level two heading.
=cut
How the headings are actually rendered is entirely up to the translator. By default the text translator pod2text
indents paragraphs by four spaces, level two headings by two, and level one headings by none—crude, but effective. The HTML translator pod2html
uses tags like <h1>
and <h2>
as we might expect.
Lists
Lists can be defined with the =over ... =back
and =item
tokens. The =over
token starts a list and can be given a value such as 4
, which many formatters use to determine how much indentation to use. The list is ended by =back
and is optional if the POD paragraph is at the end of the document. =item
defines the actual list items of which there should be at least one, and we should not use this token outside of an =over ... =back
section. Here is an example three-item list:
=over 4
=item 1
This is item number one on the list
=item 2
This is item number two on the list
=item 3
This is the third item
=back
Like =pod
and =head
n, =over
will start a POD section if one is not active.
The numbers after the =item
tokens are purely arbitrary; we can use anything we like for them, including meaningful text. However, to make the job of POD translators easier, we should stick to a consistent scheme. For example, if we number them, we should do it consistently, and if we want to use bullet points, then we should use something like an asterisk. If we want named items, we can do that too. For example, a bullet-pointed list with paragraphs:
=over 4
=item *
This is a bullet pointed list
=item *
With two items
=back
A named items list:
=over 4
=item The First Item
This is the description of the first item
=item The Second Item
This is the description of the second item
=back
A named items list without paragraphs:
=over 4
=item Stay Alert
=item Trust No one
=item Keep Your Laser Handy
=back
POD translators will attempt to do the best they can with lists, depending on what they think we are trying to do and the constraints of the document format into which they are converting. The pod2text
tool will just use the text after the item name. The pod2html
tool is subject to the rules of HTML, which has different tags for ordered, unordered, and descriptive lists (<ol>, <ul>
, and <dl>
) so it makes a guess based on what the items look like. A consistent item naming style will help it make a correct guess.
Although =over
will start a new POD section, =back
will end the list but not the POD section. We therefore also need a =cut
to return to Perl code:
=over 4
=item * Back to the source
=back
=cut
Character Encodings
If documentation is written in a different character set than the default Latin-1, POD translators can be told to render it using an alternate character encoding with the =encoding
token. For example:
=encoding utf8
Typically, this is used to state the encoding of a whole document and should be placed near the top, before any renderable text.
Translator-Specific Paragraphs
The final kinds of POD token are the =for
and =begin ... =end
tokens. The =for
token takes the name of a specific translator and an immediately following paragraph (that is, not with an intervening blank line), which is rendered only if that translator is being used. The paragraph should be in the output format of the translator, that is, already formatted for output. Other translators will entirely ignore the paragraph:
=for text
This is a paragraph that will appear in documents produced by the pod2text format.
=for html < font color = red
<p>But this paragraph will appear in <b>HTML</b> documents
</font>
Again, like the headings and item tokens, the paragraph can start on the next line, as in the first example, or immediately following the format name, as in the second.
Since it is annoying to have to type =for format
for every paragraph in a collection of paragraphs, we can also use the pair of =begin..=end
markers. These operate much like =pod...=cut
but mark the enclosed paragraphs as being specific to a particular format:
=begin html
<p>Paragraph1
<p><table>......
......</table>
<p>Paragraph2
=end
If =begin
is used outside an existing POD section, then it starts one. The =end
ends the format-specific section but not the POD, so we also need to add a =cut
to return to Perl code, just as for lists.
=begin html
<p>A bit of <b>HTML</b> document
=end
=cut
The =begin
and =end
tokens can also be used to create multiline comments, simply by providing =begin
with a name that does not correspond to any translator. We can even comment out blocks of code this way:
=begin disabled_dump_env
foreach (sort keys %ENV) {
print STDERR, "$_ => $ENV{$_}
";
}
=end
=begin comment
This is an example of how we can use POD tokens to create comments.
Since 'comment' is not a POD translator type, this section is never
used in documents created by 'pod2text', 'pod2html', etc.
=end
Some extension modules also understand specific format names. For example, the Test::Inline
module looks for the special name test
to mark the location of in-line tests.
Using POD with __DATA__ and __END__
If we are using either a __DATA__
or a __END__
token in a Perl script, then we need to take special care with POD paragraphs that lie adjacent to them. POD translators require that there must be at least one empty line between the end of the data and a POD directive for the directive to be seen (rather like POD directives themselves, in fact); otherwise, it is missed by the translation tools. In other words, write this:
...
__END__
=head1
...
and not this:
...
__END__
=head1
...
Interior Sequences
We mentioned earlier that POD paragraphs could either be preformatted (indicated by indenting) or normal. Normal paragraphs are reformatted by translators to remove extraneous spaces, newlines, and tabs. Then the resulting paragraph is rendered to the desired width if necessary.
In addition to basic reformatting, normal paragraphs may also contain interior sequences. Each sequence consists of a single capital letter, followed by the text to treat specially within angle brackets. For example:
=pod
This is a B<paragraph> that uses I<italic> and B<bold> markup using the
BE<lt>textE<gt> and IE<lt>text<gt> interior sequences. Here is an example
code fragment: C<substr $text,0,1> and here is a filename: F</usr/bin/perl>.
All these things are of course represented in a style entirely up to the
translator. See L<perlpod> for more information.
=cut
To specify a real <
and >
, we have to use the E<lt>
and E<gt>
sequences, reminiscent of the <
and >
HTML entities. The full, loosely categorized list of interior sequences supported by POD follows:
Style:
Sequence | Formatting |
B<text> |
Bold/Strong text (options, switches, program names). |
I<text> |
Italic/Emphasized text (variables, emphasis). |
S<text> |
Text contains nonbreaking spaces and cannot be word-wrapped. |
C<code> |
Code/Example text (listings, command examples). |
F<file> |
File names. |
Cross-references and hyperlinks:
Sequence | Formatting |
L<name> |
A cross-reference link to a named manual page and/or section |
L<page> |
Other manual page |
L<page/name> |
Section or list item in other manual page |
L<page/"name"> |
The same as preceding entry |
L</name> |
Section or list item in current manual page |
L<"name"> |
The same as preceding entry |
A section title is the text after a =head
POD directive, with any spaces replaced with underscores. A list item title is the text after an =item
POD directive. In case of conflicts, the first match will usually be linked to. Markup (if the title contains any) may be omitted in the link name.
Either a leading /
or quotes are necessary to distinguish a section or list item name from manual page names.
L<text|name> |
Equivalent to the L<name> sequence, but with an alternative text description for the link |
The descriptive text is given first. The original link name is given second, after a pipe symbol, and describes the nature of the link. For example:
L<text|name>
L<text|name/item>
L<text|name/"section">
L<text|"section">
L<text|/"section">
Under this syntax we cannot use explicit /
or |
characters, but see E<escape>
later.
Miscellaneous:
X<index> |
An index entry. Ignored by most formatters, it may be used by indexing programs. |
Z<> |
A zero-width character. Useful for breaking up sequences that would otherwise be recognized as POD directives. |
Special characters and escape sequences:
E<escape> |
A named or numbered entity, styled on the &entity; syntax of HTML. |
These escapes are usually only necessary inside another sequence or immediately after a capital letter representing an escape sequence; for instance, B<text>
is written literally as BE<lt>textE<gt>
. In particular, the following special names are supported:
E<lt> |
< |
E<gt> |
> |
E<sol> |
/ |
E<verbar> |
| |
Otherwise, a generic number or name can be specified:
E<number> |
ASCII character code |
E<html> |
HTML entity (for example, "copy") |
Most translators will handle the preceding four named entities but are not necessarily going to support generic entities. The obvious exception to this is, of course, the HTML translator, which doesn't have to do any work other than add a &
and ;
to the name.
Perl provides a collection of modules in the Pod::
family that perform translations from POD into other formats and also provides utility modules for checking syntax. Most of these modules are wrapped by utility scripts that Perl provides as standard. The pod2html
tool, for example, is merely a wrapper for the Pod::Html
module.
Translator Tools
We have already mentioned pod2text
and pod2html
. Perl also comes with other translators and some POD utilities too. All the translators take an input and optional output file as arguments, plus additional options to control their output format. Without either, they take input from standard input and write it to standard output.
The list of POD translators supplied with Perl is as follows:
pod2text |
Translates POD into plain text. If the -c option is used and Term::ANSIColor is installed (see Chapter 15), colors will also be used. |
pod2html |
Translates POD into HTML, optionally recursing and processing directories and integrating cross-links between pages. |
pod2latex |
Translates POD into Latex, either a single document or a collection of related documents. |
pod2man |
Translates POD into Unix manual pages (compatible with nroff/troff ). |
For more details on these translators, we can consult the relevant perldoc
page using the now familiar command line:
> perldoc <translatorname>
In addition to the standard translators, there are many POD translation tools available from CPAN; including translators for RTF/Word, LaTex, PostScript, and plenty of other formats. Even a mildly popular format will likely have a POD translator.
Retrieval Tools
In addition to the translators, Perl provides three tools for extracting information from PODs selectively. Of these, perldoc
is by far the most accomplished. Although not strictly a translator, perldoc
is a utility that makes use of translators to provide a convenient Perl documentation lookup tool.
To attempt to retrieve usage information about the given Perl script, we can use the pod2usage
tool. For example:
> pod2usage myscript.pl
The tool searches for a SYNOPSIS
heading within the file and prints it out using pod2text
. A verbosity flag may be specified to increase the returned information:
-v 1 |
(default) SYNOPSIS only |
-v 2 |
SYNOPSIS plus OPTIONS and ARGUMENTS (if present) |
-v 3 |
All POD documentation |
A verbosity of 3 is equivalent to using pod2text
directly. If the file is not given with an absolute pathname, then -pathlist
can be used to provide a list of directory paths to search for the file in.
A simpler and more generic version of pod2usage
is podselect
. This tool attempts to locate a level 1 heading with the specified section title and extracts the subdocument from under that title in each file it is passed:
> podselect -s='How to boil an egg' *.pod
Note that podselect
does not do any translation, so it needs to be directed to a translator for rendering into reasonable documentation.
POD Verification
It is easy to make simple mistakes with POD, omitting empty lines or forgetting =cut
, for example. Fortunately, POD is simple enough to be easy to verify as well. The podchecker
utility scans a file looking for problems:
> podchecker poddyscript.pl
If all is well, then it will return the following:
poddyscript.pl pod syntax OK.
Otherwise, it will produce a list of problems, which we can then go and fix, for example:
*** WARNING: file does not start with =head at line N in file poddyscript.pl
This warning indicates that we have started POD documentation with something other than a =head1
or =head2
, which the checker considers to be suspect. Likewise:
*** WARNING: No numeric argument for =over at line N in file poddyscript.pl
*** WARNING: No items in =over (at line 17) / =back list at line N in file
poddyscript.pl
This indicates that we have an =over ... =back
pair, which not only does not have a number after the over
, but also does not even contain any items. The first is probably an omission. The second indicates that we might have bunched up our items so they all run into the =over
token. If we had left out the space before =back
, we would instead have got this error:
*** ERROR: =over on line N without closing =back at line EOF in file poddyscript.pl
The module underlying podchecker
is Pod::Checker
, and we can also use it in code:
# function syntax
$ok = podchecker($podfile, $checklog, %options);
# object syntax
$checker = new Pod::Checker %options;
$checker->parse_from_file($podpath, $checklog);
Both file arguments can be either file names or filehandles. By default, the POD file defaults to STDIN
and the check log to STDERR
, so a very simple checker script could be
use Pod::Checker;
print podchecker?"OK":"Fail";
The options hash, if supplied, allows one option to be defined: enable or disable the printing of warnings. The default is on, so we can get a verification check without a report using STDIN
and STDERR
:
$ok = podchecker(*STDIN, *STDERR,'warnings' => 0);
The actual podchecker
script is more advanced than this, but not by all that much.
Creating Usage Info and Manual Pages from POD
The pod2usage
tool allows us to dump out just the SYNOPSIS
section from a POD document, the SYNOPSIS
plus OPTIONS
and ARGUMENTS
(if either are present), or the whole manual page. We can make use of the Pod::Usage
module to provide the same capabilities within our own scripts.
The pod2usage
subroutine is automatically exported when using Pod::Usage
and is the single interface to its features. While it has a number of different calling conventions, it is typically used with Getopt::Std
or Getopt::Long
, as in this example:
#!/usr/bin/perl
# podusagedemo.pl
use strict;
use warnings;
use Pod::Usage;
use Getopt::Long qw(:config bundling no_ignore_case);
=head1 NAME
A demonstration of Pod::Usage
=head1 SYNOPSIS
podusagedemo.pl -h | -H | -l | -r [<files>]
=head1 OPTIONS
-h|--help this help
-H|--morehelp extended help
-l|--left go left
-r|--right go right
=head1 ARGUMENTS
One or more files may be specified as arguments, otherwise
standard input is used. (Both this section and OPTIONS are
are displayed by the -h option)
=head1 DESCRIPTION
This is the extended help displayed by the -H option
=cut
my %opts;
pod2usage(-verbose=>0) unless GetOptions(\%opts,qw[
h|help H|m|morehelp l|left r|right
]);
pod2usage(-verbose=>1) if $opts{h};
pod2usage(-verbose=>2) if $opts{H} or $opts{m};
pod2usage(-verbose=>2, "Cannot go both left and right")
if $opts{l} and $opts{r};
# ...
A verbose level of 0 corresponds to the SYNOPSIS
only, which just displays the command line preceded by Usage:
. A verbose level of 1 prints out the OPTIONS
and ARGUMENTS
sections as well, similarly preceded by Options:
and Arguments:
respectively. This happens when the -h
option is used. An -H
or a -m
will generate the whole documentation, using highlighting and a pager in the manner of perldoc
or the pod2usage
tool. The help output, verbose level 1, looks like this:
> ./podusagedemo.pl -h
Usage:
podusagedemo.pl -h | -H | -l | -r [<files>]
Options:
-h|--help this help
-H|--morehelp extended help
-l|--left go left
-r|--right go right
Arguments:
One or more files may be specified as arguments, otherwise standard
input is used. (Both this section and OPTIONS are displayed by -h)
We can also build the calls to pod2usage
directly into the call to GetOptions
:
my %opts;
GetOptions(
'h|help' => sub { pod2usage(-verbose=>1) },
'H|m|morehelp' => sub { pod2usage(-verbose=>2) },
'l|left' => $opts{l},
'r|right' => $opts{r},
);
Although it is usually clearer to call pod2usage
with named arguments like -verbose
and -message
, we can also call it with a single numeric or string argument. A numeric argument will be treated as an exit status and will cause the program to exit displaying the synopsis (that is, verbose level 0). A string argument will be used as the message, as if -message
had been specified, again with a verbose level of 0.
The pod2usage
subroutine also understands the options and defaults shown in Table 18-2.
Option | Purpose |
-msg |
Alias for -message. |
-exitval |
Set an explicit exit status for pod2usage . Otherwise, 2 if verbose level is 1, or 1 if verbose level is 2 or 3. The special value NOEXIT causes pod2usage to return control to the program rather than exiting. |
-input |
File name or filehandle to get POD documentation from. For example, *DATA . Otherwise, the source file is used. |
-output |
File name or filehandle to write generated documentation to. Otherwise, standard output is used if the exit status is 0 or 1, and standard error if the exit status is 2 or higher. Note the default exit status is 2. |
-pathlist |
Search path to locate the file name specified to -input if it is not locally present. May be specified as a reference to an array or a colon-delimited path. Defaults to $ENV{PATH} . This option allows programs to self-document themselves even when the documentation is located in an external POD file. |
There is no requirement to specify separate OPTIONS
or ARGUMENTS
sections. If desired, either section can be bundled into the SYNOPSIS
to have them appear even for verbose level 0. In this case, level 1 simply becomes identical to level 0.
Recent versions of Getopt::Long
provide the HelpMessage
and VersionMessage
subroutines. HelpMessage
is essentially a wrapper around pod2usage
, while VersionMessage
emulates pod2usage
syntax and options, but it is fully contained within Getopt::Long
. See Chapter 14 for more information.
Perl provides a number of modules for processing POD documentation. These modules form the basis for all the POD utilities, and they are described briefly in Table 18-3.
Module | Action |
Pod::Checker |
The basis of the podchecker utility. See earlier. |
Pod::Find |
Search for and return a hash of POD documents. See the section "Locating Pods." |
Pod::Functions |
A categorized summary of Perl's functions, exported as a hash. |
Pod::Html |
The basis for the pod2html utility. |
Pod::Latex |
The basis for the pod2latex utility. |
Pod::Man |
The basis for both the pod2man and the functionally identical pod2roff utilities. |
Pod::Parser |
The POD parser. This is the basis for all the translation modules and most of the others too. New parsers can be implemented by inheriting from this module. |
Pod::ParseLink |
A module containing the logic for converting L<...> POD links into URLs. |
Pod::ParseUtils |
A module containing utility subroutines for retrieving information about and organizing the structure of a parsed POD document, as created by Pod::InputObjects . |
Pod::InputObjects |
The implementation of the POD syntax, describing the nature of paragraphs and so on. In-memory POD documents can be created on the fly using the methods in this module. |
Pod::Perldoc |
The basis for the perldoc utility. Also incorporates a family of plug-in submodules handling format conversions, some of which require Pod::Simple (available from CPAN). |
Pod::PlainText |
The basis for the pod2text utility. |
Pod::Plainer |
A compatibility module for converting new-style POD into old-style POD. |
Pod::Select |
A subclass of Pod::Parser and the basis of the podselect utility, Pod::Select extracts selected parts of POD documents by searching for their heading titles. Any translator that inherits from Pod::Select rather than Pod::Parser will be able to support the Pod::Usage module automatically. |
Pod::Text |
The basis of the pod2text utility. |
Pod::Text::Color |
Convert POD to text using ANSI color sequences. The basis of the -color option to pod2text . Subclassed from Pod::Text . This uses Term::ANSIColor , which must be installed (see Chapter 15). |
Pod::Text::Overstrike |
Convert POD to text using overstrike escape sequences, where different effects are created by printing a character, issuing a backspace, and then printing another. |
Pod::Text::Termcap |
Convert POD to text using escape sequences suitable for the current terminal. Subclassed from Pod::Text . Requires termcap support (see Chapter 15). |
Pod::Usage |
The basis of the pod2usage utility; this uses Pod::Select to extract usage-specific information from POD documentation by searching for specific sections, for example, NAME, SYNOPSIS . |
In addition to the modules listed here, the Pod::Simple
family of modules on CPAN is also worthy of attention. Pod::Simple
provides a revised and refactored toolkit for writing and using POD translators with a flexible and extensible interface.
Another module of interest to developers working on ensuring that documentation for a module is complete is Pod::Coverage
. This module can be used to test whether or not POD documentation fully covers all the subroutines defined within it. The Devel::Cover
module, covered in Chapter 17, will automatically invoke Pod::Coverage
if available. This is generally a more convenient interface, and it analyzes the coverage of our tests at the same time.
Using POD Parsers
Translator modules, which is to say any module based directly or indirectly on Pod::Parser
, may be used programmatically by creating a parser object and then calling one of the parsing methods:
parse_from_filehandle($fh, %options);
Or:
parse_from_file($infile, $outfile, %options);
For example, assuming we have Term::ANSIColor
installed, we can create ANSIColor
text documents using this short script:
#!/usr/bin/perl
# parseansi.pl
use Pod::Text::Color;
my $parser = new Pod::Text::Color(
width => 56,
loose => 1,
sentence => 1,
);
if (@ARGV) {
$parser->parse_from_file($_, '-') foreach @ARGV;
} else {
$parser->parse_from_filehandle(*STDIN);
}
We can generate HTML pages, plain text documents, and manual pages using exactly the same process from their respective modules.
Writing a POD Parser
Writing a POD parser is surprisingly simple. Most of the hard work is already done by Pod::Parser
, so all that's left is to override the methods we need to replace in order to generate the kind of document we are interested in. Particularly, there are four methods we may want to override:
command |
Render and output POD commands. |
verbatim |
Render and output verbatim paragraphs. |
textblock |
Render and output regular (nonverbatim) paragraphs. |
interior_sequence |
Return rendered interior sequence. |
By overriding these and other methods, we can customize the document that the parser produces. Note that the first three methods display their result, whereas interior_sequence
returns it. Here is a short example of a POD parser that turns POD documentation into an XML document (albeit without a DTD):
#!/usr/bin/perl
# parser.pl
use warnings;
use strict;
{
package My::Pod::Parser;
use Pod::Parser;
our @ISA = qw(Pod::Parser);
sub command {
my ($parser, $cmd, $para, $line) = @_;
my $fh = $parser->output_handle;
$para =˜s/[
]+$//;
my $output = $parser->interpolate($para, $line);
print $fh "<pod:$cmd> $output </pod:$cmd>
";
}
sub verbatim {
my ($parser, $para, $line) = @_;
my $fh = $parser->output_handle;
$para =˜s/[
]+$//;
print $fh "<pod:verbatim>
$para
</pod:verbatim>
";
}
sub textblock {
my ($parser, $para, $line) = @_;
my $fh = $parser->output_handle;
print $fh $parser->interpolate($para, $line);
}
sub interior_sequence {
my ($parser, $cmd, $arg) = @_;
my $fh = $parser->output_handle;
return "<pod:int cmd="$cmd"> $arg </pod:int>";
}
}
my $parser = new My::Pod::Parser();
if (@ARGV) {
$parser->parse_from_file($_) foreach @ARGV;
} else {
$parser->parse_from_filehandle(*STDIN);
}
To implement this script, we need the output filehandle, which we can get from the output_handle
method. We also take advantage of Pod::Parser
to do the actual rendering work by using the interpolate
method, which in turn calls our interior_sequence
method. Pod::Parser
provides plenty of other methods too, some of which we can override as well as or instead of the ones we used in this parser; see the following for a complete list:
> perldoc Pod::Parser
The Pod::Parser
documentation also covers more methods that we might want to override, such as begin_input, end_input, preprocess_paragraph
, and so on. Each of these gives us the ability to customize the parser in increasingly detailed ways.
We have placed the Parser package inside the script in this instance, though we could equally have had it in a separate module file. To see the script in action, we can feed it with any piece of Perl documentation—the POD documentation itself, for example. On a typical Unix installation of Perl version 5.6 or higher, we can do that with
> perl mypodparser /usr/lib/perl5/5.8.6/pod/perlpod.pod
This generates an XML version of perlpod
that starts like this:
<pod:head1>NAME</pod:head1>
perlpod - plain old documentation
<pod:head1>DESCRIPTION</pod:head1>
A pod-to-whatever translator reads a pod file paragraph by paragraph,
and translates it to the appropriate output format. There are
three kinds of paragraphs:
<pod:int cmd="L">verbatim|/"Verbatim Paragraph"</pod:int>,
<pod:int cmd="L">command|/"Command Paragraph"</pod:int>, and
<pod:int cmd="L">ordinary text|/"Ordinary Block of Text"</pod:int>.
...
By comparing this with the original document, we can see how the parser is converting POD tokens into XML tags.
Locating PODs
The Unix-specific Pod::Find
module searches for POD documents within a list of supplied files and directories. It provides one subroutine of importance, pod_find
, which is not imported by default. This subroutine takes one main argument—a reference to a hash of options including default search locations. Subsequent arguments are additional files and directories to look in. The following script implements a more or less fully featured POD search based around Pod::Find
and Getopt::Long
, which we cover in detail in Chapter 14.
#!/usr/bin/perl
# findpod.pl
use warnings;
use strict;
use Pod::Find qw(pod_find);
use Getopt::Long;
# default options
my ($verbose,$include,$scripts);
my $display = 1;
# allow files/directories and options to mix
Getopt::Long::Configure('permute'),
# get options
GetOptions('verbose!' => $verbose,
'include!' => $include,
'scripts!' => $scripts,
'display!' => $display,
);
# if no directories specified, default to @INC
$include = 1 if !defined($include) and (@ARGV or $scripts);
# perform scan
my %pods = pod_find({
-verbose => $verbose,
-inc => $include,
-script => $scripts,
-perl => 1
}, @ARGV);
# display results if required
if ($display) {
if (%pods) {
print "Found '$pods{$_}' in $_
foreach sort keys %pods;
} else {
print "No pods found
";
}
}
We can invoke this script with no arguments to search @INC
or pass it a list of directories and files to search. It also supports four arguments to enable verbose messages, disable the final report, and enable Pod::Find
's two default search locations. Here is one way we can use it, assuming we call the script findpod
:
> perl findpod.pl -iv /my/perl/lib 2> dup.log
This command tells the script to search @INC
in addition to /my/perl/lib
(-i
), produce extra messages during the scan (-v
), and redirect error output to dup.log
. This will capture details of any duplicate modules that the module finds during its scan. If we only want to see duplicate modules, we can disable the output and view the error output on screen with this command:
> perl findpod.pl -i --nodisplay /my/perl/lib
The options passed in the hash reference to pod_find
are all Boolean and all default to 0
(off). They have the meanings listed in Table 18-4.
Option | Action |
-verbose |
Print out progress during scan, reporting all files scanned that did not contain POD information. |
-inc |
Scan all the paths contained in @INC . Implies -perl . |
-script |
Search the installation directory and subdirectories for POD files. If Perl was installed as /usr/bin/perl , then this will be /usr/bin for example. This implies -perl . |
-perl |
Apply Perl naming conventions for finding POD files. This strips Perl file extensions (.pod, .pm , etc.), skips over numeric directory names that are not the current Perl release, and so on. |
The hash generated by findpod.pl
contains the file in which each POD document was found as the key and the document title (usually the module package name) as the value.
An intriguing feature of Perl is the ability to preprocess source code before it is even compiled. This capability is provided by the Filter::Util::Call
module, which uses an underlying C-based interface to the Perl interpreter itself to intercept source code after it is read in but before it is compiled.
Here is an example of it in use to implement a filter that carries out a simple substitution. The filter itself is implemented by the filter
method, while the import
method performs the task of installing the filter with a specified pair of match and replacement strings:
package Class::Filter::Replace;
use strict;
use Carp qw(croak);
use Filter::Util::Call;
sub import {
my ($self,$replace,$with)=@_;
unless ($replace) {
croak("use ".__PACKAGE__." 'original' [, 'replacement'];");
}
$with ||= ""; #replace with nothing
my $filter={
replace => $replace,
with => $with,
};
filter_add($filter);
}
sub filter {
my $status=filter_read(); # set $_ from input
s/$_[0]->{replace}/$_[0]->{with}/go if $status > 0;
return $status; # 0 = end of file, <0 = error
}
1;
We can now use this filter to preprocess source code. Here, we set the filter to replace all instances of the word Goodbye
with Hello
. Since the filter is installed at compile time by virtue of use
, it affects the code immediately following it:
#!/usr/bin/perl
use strict;
use warnings;
use Class::Filter::Replace Goodbye => 'Hello';
my $Goodbye="so long";
print "Goodbye, I must be going, $Hello
";
Running this program prints out the following:
Hello, I must be going, so long
We can also use a filter directly on any source code from the command line:
> perl -MClass::Filter::Replace=Goodbye,Hello unfiltered.pl
While this filter might look like we could easily register multiple objects in the same class, this is not so. In fact, the class is a singleton, because Filter::Util::Call
will permit only one filter per class. What actually happens here is that the class is extracted from the context of filter_add
, and the hash reference is passed as a hash reference to the filter
method, which is called as a class method, not as an object instance method. This is why we did not bother to bless the hash reference $filter
into the class before passing it on.
A filter consists of either an object class that inserts itself by name into the filter interface with filter_add
and provides a filter method for Filter::Util::Call
to call back, as earlier, or a simple subroutine that carries out the same task as the filter method and is inserted by code reference. Here is the code reference version of the preceding filter:
package Closure::Filter::Replace;
use strict;
use Carp qw(croak);
use Filter::Util::Call;
sub import {
my ($self,$replace,$with)=@_;
croak("use ".__PACKAGE__." 'original' [, 'replacement'];")
unless $replace;
$with ="" unless $with;
my $filter=sub {
my $status=filter_read(); #populates $_
s/$replace/$with/g if $status > 0;
return $status;
}
filter_add($filter);
}
1;
To read source in other than a line-by-line basis, we can either supply a size argument to filter_read
or make use of filter_read_exact
. Both uses cause the filter to try to read a block of the requested number of bytes; filter_read
may come back with less if it cannot initially read enough, while filter_read_exact
will block and not stop trying until the end of file or an error is encountered:
my $status=filter_read($size); #block mode, nonblocking
my $status=filter_read_exact($size); #block mode, blocking
Note that both filter_read
and filter_read_exact
append to the current value of $_
, so mul-tiple calls to filter_read
within the filter subroutine will not reset it to an entirely new value each time. The status is always returned with a value of 0 for end of file, greater than 0 for a successful read, and less than 0 for an error. Hence in this example we perform the substitution only if $status > 0
.
If a filter wishes to disable itself, perhaps because it should only apply to a certain part of the source, it can do so by calling filter_del
. For example:
if (/__(DATA|END)__/) {
filter_del();
}
The Filter::Simple
module provides a third way to define a filter. While not as flexible, it is a lot simpler to use and will suit many applications. For instance, we can rewrite the preceding examples as follows:
package Simple::Filter::Replace;
use strict;
use Carp qw(croak);
use Filter::Simple;
my ($replace,$with);
sub import {
$replace = $_[1];
unless ($replace) {
croak("use ".__PACKAGE__." 'original' [, 'replacement'];");
}
$with = $_[2] || "";
}
FILTER { s/$replace/$with/g };
1;
The key to this module is the special FILTER
block. This is processed by Filter::Simple
using its own internal filter to generate a filter out of our code. We can get a lot smarter too, because the module colludes with Text::Balanced
to give us the ability to register filters to process only code, only quoted strings, or a number of other selections with a FILTER_ONLY
specification:
use Filter::Simple;
FILTER_ONLY
code => sub { s/ucfirst/lcfirst/g },
string => sub { s/Goodbye/Hello/g };
The full list of filter types is offered in Table 18-5.
Table 18-5. Filter::Simple Filter Types
Filter Type | Effect |
all |
Everything, same as FILTER . |
code |
Filter code, excluding quotelike operators. |
executable |
code plus quotelike . |
quotelike |
Filter quotelike operators q, qq, qr . |
regex |
Filter regular expression patterns. |
string |
Filter literal strings in quotes or quotelike text. |
The all
filter is identical to FILTER
, so we could previously have written
FILTER all => sub { s/$replace/$with/g };
We can specify all of these filters except code
more than once, with cumulative effect:
use Filter::Simple;
FILTER_ONLY
code => sub { s/ucfirst/lcfirst/g },
string => sub { s/Goodbye/Ciao/g }
string => sub { s/Ciao/Au Revoir/g },
string => sub { s/Au Revoir/Hello/g };
While Perl modules are the primary type of source filter, we can also use external commands. The Filter::exec
module (which is available in the Filter
distribution on CPAN, but not as standard) is one way we can invoke an external program to filter our code. For instance, if we happened to have a gzipped Perl script, we could run it on a Unix platform with this command:
> perl -MFilter::exec=gunzip,-c myscript.pl.gz
The Filter::sh
module is similar, but it takes a single string as the command, invoking an intermediate shell to execute it:
> perl -MFilter::sh='gunzip -c' myscript.pl.gz
Although functional, these modules mostly serve as examples of how to implement filters. As further demonstration, the Filter::cpp
module provides support for C-style preprocessor macros, Filter::tee
outputs the post-processed source code for inspection, and Filter::decrypt
provides support for running encrypted source files. Each of these modules uses an underlying factory module that subclasses Filter::Util::Call
to register the filter, for instance, Filter::exec
invokes Filter::Util::Exec
.
The Filter::Util::Call
interface is used by several modules in the standard distribution. The Switch
module uses it to implement new semantics and keywords in Perl, by translating them into real Perl keywords before the interpreter gets to look at them. The B::Byteloader
module uses a filter to convert compiled code saved in binary form back into a parsed opcode tree.
Reports are a useful but often overlooked feature of Perl. They provide a way to generate structured text such as tables or forms using a special layout description called a format. Superficially similar in intent to the print
and sprintf
functions, the strength of formats comes from their ability to describe layouts in physical terms, making it much easier to see how the resulting text will look and making it possible to design page layouts visually rather than resorting to character counting with printf
.
Intriguingly, formats are an entirely separate data type with their own typeglob slot, separate from scalars, arrays, hashes, and filehandles. Like filehandles, they have no prefix or other syntax to express themselves and as a consequence often look like bareword filehandles, which can occasionally be confusing.
A format is compiled from a format definition, a series of formatting or picture lines containing literal text and placeholders, interspersed with data lines that describe the information used to fill placeholder and comment lines. As a simple example, here is a format definition that defines a single pattern line consisting mainly of literal text and a single placeholder, followed by a data line that fills that placeholder with some more literal text:
This is a @<<<<< justified field
"left"
To turn a format definition into a format, we need to use the format
function, which takes a format name and a multiline format definition, strongly reminiscent of a here
document, and turns it into a compiled format. A single full stop on its own defines the end of the format. To define the very simple format example earlier, we would write something like this:
format MYFORMAT =
This is a @<<<<< justified field
"left"
.
The trailing period is very important. It is the end token that defines the end of the implicit HERE document. A format definition will happily consume the entire contents of a source file if left unchecked.
To use a format, we use the write
function on the filehandle with the same name as the format. For the MYFORMAT
example earlier, we would write the following:
# print format definition to filehandle 'MYFORMAT'
write MYFORMAT;
This requires that we actually have an open filehandle called MYFORMAT
and want to use the format to print to it. More commonly we want to print to standard output, which we can do by either defining a format called STDOUT
or assigning a format name to the special variable $˜
($FORMAT_NAME
with the English
module). In this case, we can omit the filehandle, and write
will use the currently selected output filehandle, just like print
:
$˜ = 'MYFORMAT';
write;
We can also use methods from the IO::
family of modules, if we are using them. Given an IO::Handle
-derived filehandle called $fh
, we can assign and use a format on it like this:
$fh->format(MYFORMAT);
$fh->format_write();
The write
function (or its IO::Handle
counterpart format_write
) generates filled-out formats by combining the picture lines with the current values of the items in the data lines to fill in any placeholder present.
Format SyntaxFormats
consist of a collection of picture and data lines, interspersed with optional comments, combined into a HERE-style document that is ended with a single full stop.
Of the three, comments are by far the simplest to explain. They resemble conventional Perl comments and simply start with a #
symbol, as this example demonstrates:
format FORMNAME =
# this is a comment. The next line is a picture line
This is a pattern line with one @<<<<<<<<<<.
# this is another comment.
# the next line is a data line
"placeholder"
# and don't forget to end the format with a '.':
.
Picture and data lines take a little more explaining. Since they are the main point of using formats at all, we will start with picture lines.
Picture Lines and Placeholders
Picture lines consist of literal text intermingled with placeholders, which the write
function fills in with data at the point of output. If a picture line does not contain any placeholders at all, it is treated as literal text and can be printed out. Since it does not require any data to fill it out, it is not followed by a data line. This means that several picture lines can appear one after the other, as this static top-of-page format illustrates:
STATIC_TOP =
This header was generated courtesy of Perl formatting
See Chapter 18 of Pro Perl for details
-------------------------------------------
.
Placeholders are defined by either an @
or a ^
, followed by a number of <, |, >
, or #
characters that define the width of the placeholder. Picture lines that contain placeholders must be followed by a data line (possibly with comments in between) that defines the data to be placed into the placeholder when the format is written.
Formats do not support the concept of a variable-width placeholder. The resulting text will always reserve the defined number of characters for the substituted value irrespective of the actual length of the value, even if it is undefined. It is this feature that makes formats so useful for defining structured text output—we can rely on the resulting text exactly conforming to the layout defined by the picture lines. For example, to define a ten-character field that is left justified, we would use
This is a ten character placeholder: @<<<<<<<<<
$value_of_placeholder
Note that the @
itself counts as one of the characters, so there are nine <
characters in the example, not ten. To specify multiple placeholders, we just use multiple instances of @
and supply enough values in the data line to fill them. This example has a left-, center-, and right-justified placeholder:
This picture line has three placeholders: @<<<@|||@>>>
$first, $second, $third
The second example defines three four-character-wide placeholders. The <, |
, and >
characters define the justification for fields more than one character wide; we can define different justifications using different characters, as we will see in a moment.
Programmers new to formats are sometimes confused by the presence of @
symbols. In this case, @
has nothing to do with interpolation; it indicates a placeholder. Because of this, we also cannot define a literal @
symbol by escaping it with a backslash, that is, an interpolation feature. In fact, the only way to get an actual @
(or indeed ^
) into the resulting string is to substitute it from the data line:
# the '@' below is actually a placeholder:
This is a literal '@'
# but we can make it a literal '@' by substituting one in on the data line:
'@'
Simple placeholders are defined with the @
symbol. The caret ^
or "continuation" placeholder, however, has special properties that allow it to be used to spread values across multiple output lines. When Perl sees a ^
placeholder, it fills out the placeholder with as much text as it reasonably can and then truncates the text it used from the start of the string. It follows from this that the original variable is altered and that to use a caret placeholder we cannot supply literal text. Further uses of the same variable can then fill in further caret placeholders. For example, this format reformats text into 38 columns with a >
prefix on each line:
format QUOTE_MESSAGE =
> ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
.
This creates a format that processes the text in the variable $message
into four lines of 40 characters, fitting as many words as possible into each line. When write
comes to process this format, it uses the special variable $:
to determine how and where to truncate the line. By default it is set to
-
to break on spaces, newlines, or hyphens, which works fine for most plain text.
There are a number of problems with this format—it only handles four lines, and it always fills them out even if the message is shorter than four lines after reformatting. We will see how to suppress redundant lines and automatically repeat picture lines to generate extra ones with the special ˜
and ˜˜
strings shortly.
Justification
It frequently occurs that the width of a field exceeds that of the data to be placed in it. In these cases, we need to decide how the format will deal with the excess, since a fixed-width field cannot shrink (or grow) to fit the size of the data. A structured layout is the entire point of formats. If the data we want to fill the placeholder is only one-character wide, we need no other syntax. As an extreme case, to insert six single-character items into a format, we can use
The code is '@@@@@@'
# use first six elements of digits, assumed to be from 0 to 9.
@digits
For longer fields, we need to choose how text will be aligned in the field through one of four justification methods, listed in Table 18-6, depending on which character we use to define the width of the placeholder.
Table 18-6. Placeholder Justification Styles
Placeholder | Alignment | Example |
< |
Left justified | @<<<< |
> |
Right justified | @>>>> |
| |
Center justified | @|||| |
# |
Right-justified numeric | @#### |
The <, |
, and >
justification styles are mostly self-explanatory; they align values shorter than the placeholder width to the left, center, or right of the placeholder. They pad the rest of the field with spaces. (Note that padding with other characters is not supported. If we want to do that, we will have to generate the relevant value by hand before it is substituted.) If the value is the right length in any case, then no justification occurs. If it is longer, then it is truncated on the right irrespective of the justification direction.
The numeric #
justification style is more interesting. With only #
characters present, it will insert an integer based on the supplied value—for an integer number it substitutes in its actual value, but for a string or the undefined value it substitutes in 0
, and for a floating-point number it substitutes in the integer part. To produce a percentage placeholder, for example, we can use the following:
Percentage: @##%
$value * 100
If, however, we use a decimal point character within the placeholder, then the placeholder becomes a decimal placeholder, with floating-point values point-justified to align themselves around the position of the decimal point:
Result (2 significant places): @####.##
$result
This provides a very simple and powerful way to align columns of figures, automatically truncating them to the desired level of accuracy at the same time.
If the supplied result is not a floating-point number, then the fractional places are filled in with 0
, and for strings and undefined values the ones column is also filled in with 0
.
The actual character used by the decimal placeholder to represent the decimal point is defined by the locale, specifically the LC_NUMERIC
value of the locale. In Germany, for instance, the conventional symbol to separate the integer and fractional parts is a comma, not a full stop. Formats are in fact the only part of Perl that directly accesses the locale in this way, possibly because of their long history; all other parts of the language adhere to the use locale
directive. Although deprecated in modern Perl, we can also use the special variable $#
to set the point character.
The final placeholder format is the *
placeholder. This creates a raw output placeholder, producing a complete multiple-line value in one go and consequently can only be placed after an @
symbol; it makes no sense in the context of a continuation placeholder since there will never be a remainder for a continuation to make use of. For example:
> @* <
$multiline_message
In this format definition, the value of $multiline_message
is output in its entirety when the format is written. The first line is prefixed with a >
, and the last is suffixed with <
. No other formatting of any kind is done. Since this placeholder has variable width (and indeed, variable height), it is not often used, since it is effectively just a poor version of print
that happens to handle line and page numbering correctly.
Data Lines
Whenever a picture line contains one or more placeholders it must be immediately followed by a data line consisting of one or more expressions that supply the information to fill them. Expressions can be numbers, string values, variables, or compound expressions:
format NUMBER =
Question: What do you get if you multiply @ by @?
6, 9
Answer: @#
6*9
.
Multiple values can be given either as an array or a comma-separated list:
The date is: @###/@#/@#
$year, $month, $day
If insufficient values are given to fill all the placeholders in the picture line, then the remaining placeholders are undefined and padded out with spaces. Conversely, if too many values are supplied, then the excess ones are discarded. This behavior changes if the picture line contains ˜˜
however, as shown later.
If we generate a format using conventional quoted strings rather than the HERE document syntax, we must take special care not to interpolate the data lines. This is made more awkward because in order for the format to compile, we need to use
to create newlines at the end of each line of the format, including the data lines, and these do need to be interpolated. Separating the format out onto separate lines is probably the best approach, though as this example shows, even then it can be a little hard to follow:
# define page width and output filehandle
$page_width = 80;
$output = "STDOUT_TOP";
# construct a format statement from concatenated strings
$format_st = "format $output =
".
'Page @<<<'. "
".
'$='. "
".
('-'x$page_width). "
".
".
"; # don't forget the trailing '.'
# define the format - note we do not interpolate, to preserve '$='
eval $format_st;
Note that continuation placeholders (defined by a leading caret) need to be able to modify the original string in order to truncate the start. For this reason, an assignable value such as a scalar variable, array element, or hash value must be used with these fields.
Suppressing Redundant Lines
The format
and write
functions support two special picture strings that alter the behavior of the placeholders in the same picture line, both of which are applied if the placeholders are all continuation (caret) placeholders.
The first is a single tilde, or ˜
character. When this occurs anywhere in a picture line containing caret placeholders, the line is suppressed if there is no value to plug into the placeholder. For example, we can modify the quoting format we gave earlier to suppress the extra lines if the message is too short to fill them:
format QUOTE_MESSAGE =
> ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜
$message
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...˜
$message
.
In this example, the bottom three picture lines have a ˜
suffix, so they will only be used if $message
contains sufficient text to fill them after it has been broken up according to the break characters in $:
. When the format is written, the tildes are replaced with spaces. Since they are at the end of the line in this case, we will not see them, which is why conventionally they are placed here. If we have spaces elsewhere in the picture line, we can replace one of them with the tilde and avoid the trailing space.
We modify the last picture line to indicate that the message may have been truncated because we know that it will only be used if the message fills out all the subsequent lines. In this case, we have replaced the last three <
characters with dots.
The ˜
character can be thought of as a zero-or-one modifier for the picture line, in much the same way that ?
works in regular expressions. The line will be used if Perl needs it, but it can also be ignored if necessary.
Autorepeating Pattern Lines
If two adjacent tildes appear in a pattern line, then write
will automatically repeat the line while there is still input. If ˜
can be likened to the ?
zero-or-one metacharacter of regular expressions, ˜˜
can be likened to *
, zero-or-more. For instance, to format text into a paragraph of a set width but an unknown number of lines, we can use a format like this:
format STDOUT =
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<˜˜
$text
.
Calling write
with this format will take the contents of $text
and reformat it into a column 30 characters wide, repeating the pattern line as many times as necessary until the contents of $text
are exhausted. Anything else in the pattern line is also repeated, so we can create a more flexible version of the quoting pattern we gave earlier that handles a message of any size:
format QUOTE =
>˜˜^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$message
.
Like ˜
, the ˜˜
itself is converted into a space when it is output. It also does not matter where it appears, so in this case we have put it between the >
quote mark and the text, to suppress the extra space on the end of the line it would otherwise create.
Note that ˜˜
only makes sense when used with a continuation placeholder, since it relies on the continuation to truncate the text. Indeed, if we try to use it with a normal @
placeholder, Perl will return a syntax error, since this would effectively be an infinite loop that repeats the first line. Since write
cannot generate infinite quantities of text, Perl prevents us from trying.
Formats are directly associated with filehandles. All we have to do is write
to the filehandle, and the associated format is invoked. It might seem strange that we associate a format with a filehandle and then write
to the filehandle, rather than specifying which format we want to use when we do the writing, but there is a certain logic behind this mechanism. There are in fact two formats that may be associated with a filehandle. The main one is used by write
, but we can also install a top-of-page format that is used whenever Perl runs out of room on the current page and is forced to start a new one. Since this is associated with the filehandle, Perl can use it automatically when we use write
rather than needing to be told.
Defining the Top-of-Page Format
Perl allows two formats to be associated with a filehandle. The main format is used whenever we issue a write
statement. The top-of-page format, if defined, is issued at the start of the first page and at the top of each new page. This is determined by the special variable $=
(length of page) and $-
(the number of lines left). Each time we use write
, the value of $-
increases. When there is no longer sufficient room to fit the results of the next write
, a new page is started, a new top-of-page format is written, and only then is the result of the last write
issued.
The main format is automatically associated with the filehandle of the same name so that the format MYFORMAT
is automatically used when we use write
on the filehandle MYFORMAT
. Giving it the name of the filehandle with the text _TOP
appended to it can similarly associate the top-of-page format. For instance, to assign a main and top-of-page format to the filehandle MYFORMAT
, we would use something like this:
format MYFORMAT =
...main format definition...
.
# define a format that gives the current page number
format MYFORMAT_TOP =
This is page @<<<
$=
------------------------
.
Assigning Formats to Standard Output
Since standard output is the filehandle most usually associated with formats, we can omit the format name when defining formats.
format STDOUT=
The magic word is "@<<<<<<<<";
$word
.
format STDOUT_TOP=
Page @>
$#
-----------
.
We can also omit STDOUT
for the main format and simply write
format =
The magic word is "@<<<<<<<<";
$word
.
This works because standard output is the default output filehandle. If we change the filehandle with select
, format
creates a format with the same name as that filehandle instead. The write
function also allows us to omit the filehandle; to write out the formats assigned to whatever filehandle is currently selected, we can simply put
write;
Determining and Assigning Formats to Other Filehandles
We are not constrained to defining formats with the same name as a filehandle in order to associate them. We can also find their names and assign new ones using the special variables $˜
and $^
.
The special variable $˜
($FORMAT_NAME
with use English
) defines the name of the main format associated with the currently selected filehandle. For example:
$format = $˜;
Likewise, to set the current format we can assign to $˜
:
# set standard output format to 'MYFORMAT';
$˜ = 'MYFORMAT';
use English;
$FORMAT_NAME = 'MYFORMAT'; # more legibly
The variable is set to the name of the format, not to the format itself, hence the quotes.
The special variable $^
($FORMAT_TOP_NAME
with use English
) performs the identical role for the top-of-page format:
# save name of current top-of-page format
$topform = $^;
# assign new top-of-page format
$^ = 'MYFORMAT_TOP';
# write out main format associated with standard out,
# (using top-of-page format if necessary)
write;
# restore original top-of-page format
$^ = $topform;
Setting formats on other filehandles using the variables $˜
and $^
requires special maneuvering with select
to temporarily make the target filehandle the current filehandle:
# set formats on a different filehandle
$oldfh = select MYHANDLE;
$˜ = 'MYFORMAT';
$^ = 'MYFORMAT_TOP';
select $oldfh;
The IO::Handle
module (and subclasses like IO::File
) provide a simpler object-oriented way of setting reports on filehandles:
$fh = new IO::File ("> $outputfile");
...
$fh->format_name ('MYFORMAT'),
$fh->format_top_name ('MYFORMAT_TOP'),
...
write $fh; # or $fh->format_write ();
Perl's reporting system uses several special variables to keep track of line and page numbering. We can use these variables to produce line and page numbers and set them to control how pages are generated. There are four variables of particular interest, and these are listed in Table 18-7.
Table 18-7. Format Page Control Variables
Variable | Corresponds To |
$= |
The page length |
$% |
The page number |
$- |
The number of lines remaining |
$^L |
The formfeed string |
$=
(or $FORMAT_LINES_PER_PAGE
with use English
) holds the page length and by default is set to 60 lines. To change the page length, we can assign a new value:
$= = 80; # set page length to 80 lines
Or more legibly:
use English;
$FORMAT_LINES_PER_PAGE = 80;
If we want to generate reports without pages, we can set $=
to a very large number. Alternatively, we can redefine $^L
to an empty string and avoid (or subsequently redefine to nothing) the "top-of-page" format.
$%
(or $FORMAT_PAGE_NUMBER
with use English
) holds the number of the current page. It starts at 1 and is incremented by one every time a new page is started, which in turn happens whenever write
runs out of room on the current page. We can change the page number explicitly by modifying $%
, for example:
$% = 1; # reset page count to 1
$-
(or $FORMAT_LINES_LEFT
with use English
) holds the number of lines remaining on the current page. Whenever write
generates output, it decrements this value by the number of lines in the format. If there are insufficient lines left (the size of the output is greater than the number of lines left), then $-
is set to 0, the value of $%
is incremented by one, and a new page is started, starting with the value of $^L
and followed immediately by the top-of-page format, if one is defined. We can force a page break on the next write
by setting $-
to 0:
$- = 0; # force a new page on the next 'write'
Finally, $^L
(or $FORMAT_FORMFEED
with use English
) is output before the top-of-page format by write
when a new page is started. By default it is set to a formfeed character, f
. See the section "Creating Footers" for a creative use of $^L
.
As an example of using the page control variables, here is a short program that paginates its input file, adding the name of the file and a page number to the top of each page. It also illustrates creating a format dynamically with eval
so we can define not only the height of the resulting pages, but also their width.
#!/usr/bin/perl
# paginate.pl
use warnings;
use strict;
no strict 'refs';
use Getopt::Long;
# get parameters from the user
my $height = 60; # length of page
my $width = 80; # width of page
my $quote = ""; # optional quote prefix
GetOptions ('height|size|length:i', $height,
'width:i', $width, 'quote:s', $quote);
die "Must specify input file" unless @ARGV;
# get the input text into one line, for continuation
undef $/;
my $text = <>;
# set the page length
$= = $height;
# if we're quoting, take that into account
$width -= length($quote);
# define the main page format - a single autorepeating continuation field
my $main_format = "format STDOUT =
".
'^'.$quote.('<' x ($width-1))."˜˜
".
'$text'. "
".
".
";
eval $main_format;
# define the top of page format
my $page_format = "format STDOUT_TOP =
".
'@'.('<' x ($width/2-6)). ' page @<<<'. "
".
'$ARGV,$%'. "
".
'-'x$width. "
".
".
";
eval $page_format;
# write out the result
write;
To use this program, we can feed it an input file and one or more options to control the output, courtesy of the Getopt::Long
module, for example:
> perl paginate.pl input.pl -w 50 -h 80
Creating Footers
Footers are not supported as a concept by the formatting system; there is no "bottom-of-page" format. However, with a little effort we can improvise our own footers. The direct and obvious way is to keep an eye on $-
and issue the footer when we get close to the bottom of the page. If the footer is smaller in lines than the output of the main format, we can use something like the following, assuming that we know what the size of output is:
print "
Page $%
" if $- < $size_of_format;
This is all we need to do, since the next attempt to write
will not have sufficient space to fit and will automatically trigger a new page. If we want to make sure that we start a new page on the next write, we can set $-
to 0
to force it:
print ("
Page $%
"), $- = 0 if $- < $size_of_format;
A more elegant and subtle way of creating a footer is to redefine $^L
. This is a lot simpler to arrange but suffers in terms of flexibility since the footer is fixed once it is defined, so page numbering is not possible unless we redefine the footer on each new page.
For example, if we want to put a two-line footer on the bottom of 60-line pages, we can do so by putting the footer into $^L
(suffixed with the original formfeed) and then reducing the page length by the size of the footer, in this case to 58 lines:
# define a footer.
$footer = ('-'x80). "
End of Page
";
# redefine the format formfeed to be the footer plus a formfeed
$^L = $footer. "f";
# reduce page length from default 60 to 58 lines
# if we wanted to be creative we could count the instances of '
' instead.
$= -= 2;
Now every page will automatically get a footer without any tracking or examination of the line count. We still have to add a footer to the last page manually. The number of lines remaining to fill on the last page is held by $-
, so this turns out to be trivial:
print ("
" * $-); # fill out the rest of the page (to 58 lines)
print $footer; # print the final footer
As mentioned earlier, arranging for a changing footer such as a page number is slightly trickier, but it can be done by remembering and checking the value of $-
after each write
:
$lines = $-;
write;
redefine_footer() if $- > $lines;
This will work for many cases but will not always work when using ˜˜
, since it may cause write
to generate more lines than the page has left before we get a chance to check it.
It is possible to print both unformatted and formatted output on the same filehandle.
However, while write
and print
can be freely mixed together, print
knows nothing about the special formatting variables such as $=, $-
, and $%
that track pagination and trigger the top-of-page format. Consequently, we must take care to track line counts ourselves if we want pages to be of even length, by adjusting $-
ourselves.
For instance:
write;
foreach (@extra_lines) {
print $_, "
";
--$-; # decrement $-.
}
Unfortunately, this solution does not take into account that $-
might become negative if there is not enough room left on the current page. Due to the complexities of managing mixtures of write
and print
, it is often simpler to either use formline
or create a special format that is simply designed to print out the information we were using print
for.
Generating Report Text with formline
The formline
function is a lower-level interface to the same formatting system used by write
. formline
generates text from a single picture line and a list of values, the result of which is placed into the special variable $^A
. For example, this is how we could create a formatted string containing the current time using formline
:
($sec, $min, $hour) = localtime;
formline '@#/@#/@#', $hour, $min, $sec;
$time = $^A;
print "The time is: $hour:$min.$sec
";
In this case, it would probably be easier to use sprintf
, but we can also use formline
to create text from more complex patterns. For instance, to format a line of text into an array of text lines wrapped at 20 characters, we could use formline
like this:
$text = get_text(); # get a chunk of text from somewhere
@lines;
while ($text) {
formline '^<<<<<<<<<<<<<<<<<<<', $text;
push @lines, $^A;
}
The formline
function is only designed to handle single lines, so it ignores newlines and treats the picture text as a single line. This means that we cannot feed formline
a complete format definition and expect it to produce the correct result in $^A
.
Strangely, there is no simple way to generate text from write
, other than by redirecting filehandles, since write
sends its results to a filehandle. However, we can produce a version of write
that returns its result instead.
sub swrite ($@) {
my $picture = shift;
formline ($picture, @_);
return $^A;
}
This function is a friendly version of formline
, but it is not a direct replacement for write
, since it only operates on a single picture line and expects a conventional list of values as an argument. However, it is convenient and simple to use.
This chapter dealt with text processing in depth, building on the concepts of regular expressions and interpolation to carry out advanced text manipulation. To begin with, we looked at text processing modules, including Text::Tab, Text::Abbrev, Text::ParseWords
, and the versatile Text::Balanced
. We also looked at rewrapping text with Text::Wrap
and tokenizing it with Text::Soundex
.
Source code is an important subclass of text document. We covered Perl's Plain Old Documentation (POD) syntax, and saw how to construct it, format it, render it, and write our own tools to parse it. From here we went on to look at preprocessing source files using a source filter. We covered the Filter::Util::Call
module and also saw how to simplify some aspects of filter development with the Filter::Simple
module.
Finally, we looked at reports, the "R" in Perl, which provide us with a way to create simple templates to format the way output is rendered. We looked at the format data type, formats and filehandles, format structure (including justification), and page control.