Chapter 12. Regular Expressions

Some people, when confronted with a problem, think: "I know, I'll use regular expressions".
 Now they have two problems.

Jamie Zawinski

Regular expressions are one of the signature features of Perl, providing it with most of the practical extraction facilities for which it is famous. Many of those who are new to Perl (and many who aren't so new) approach regexes with mistrust, trepidation, or outright fear.

And with some justification. Regexes are specified in a compact and sometimes baroque syntax that is, all by itself, responsible for much of Perl's "executable line noise" reputation. Moreover, in the right hands, patterns are capable of performing mystifying feats of text recognition, analysis, transformation, and computation[62].

It's no wonder they scare so many otherwise stalwart Perl hackers.

And no surprise that they also figure heavily in many suboptimal programming practices, especially of the "cut-and-paste" variety. Or, more often, of the "cut-and-paste-and-modify-slightly-and-oh-now-it-doesn't-work-at-all-so-let's-modify-it-some-more-and-see-if-that-helps-no-it-didn't-but-we're-committed-now-so-maybe-if-we-change-that-bit-instead-hmmmm-that's-closer-but-still-not-quite-right-maybe-if-I-made-that-third-repetition-non-greedy-instead-oops-now-it's-back-to-not-matching-at-all-perhaps-I-should-just-post-it-to-PerlMonks.org-and-see-if-they-know-what's-wrong" variety.

Yet the secret to taming regular expressions is remarkably simple. You merely have to recognize them for what they really are, and treat them accordingly.

And what are regular expressions really? They're subroutines. Text-matching subroutines. Text-matching subroutines that are coded in an embedded programming language that's nearly entirely unrelated to Perl.

Once you realize that regexes are just code, it becomes obvious that regex best practices will, for the most part, simply be adaptations of the universal coding best practices described in other chapters: consistent and readable layout, sensible naming conventions, decomposition of complex code, refactoring of commonly used constructs, choosing robust defaults, table-based techniques, code reuse, and test-driven development.

This chapter illustrates how those approaches can be applied to improving the readability, robustness, and efficiency of your regular expressions.

Extended Formatting

Always use the /x flag.

Because regular expressions are really just programs, all the arguments in favour of careful code layout that were advanced in Chapter 2 must apply equally to regexes. And possibly more than equally, since regexes are written in a language much "denser" than Perl itself.

At very least, it's essential to use whitespace to make the code more readable, and comments to record your intent[63]. Writing a pattern like this:

m{'[^']*(?:\.[^']*)*'}

is no more acceptable than writing a program like this:

sub'x{local$_=pop;sub'_{$_>=$_[0
]?$_[1]:$"}_(1,'*')._(5,'-')._(4,'*').$/._(6,'|').($_>9?'X':$_>8
?'/':$")._(8,'|').$/._(2,'*')._(
7,'-')._(3,'*').$/}print$/x($=).
x(10)x(++$x/10).x($x%10)while<>;

And no more readable, or maintainable.

The /x mode allows regular expressions to be laid out and annotated in a maintainable manner. Under /x mode, whitespace in your regex is ignored (i.e., it no longer matches the corresponding whitespace character), so you're free to use spaces and newlines for indentation and layout, as you do in regular Perl code. The # character is also special under /x. Instead of matching a literal '#', it introduces a normal Perl comment.

For example, the pattern shown previously could be rewritten like so:

# Match a single-quoted string efficiently...

m{ '             # an opening single quote
   [^']*       # any non-special chars (i.e., not backslash or single quote)
   (?:           # then all of...
       \ .      #    any explicitly backslashed char
       [^']*   #    followed by any non-special chars
   )*            # ...repeated zero or more times
   '             # a closing single quote }x

That may still not be pretty, but at least it's now survivable.

Some people argue that the /x flag should be used only when a regular expression exceeds some particular threshold of complexity, such as only when it won't fit on a single line. But, as with all forms of code, regular expressions tend to grow in complexity over time. So even "simple" regexes will eventually need a /x, which will most likely not be retrofitted when the pattern reaches the particular complexity threshold you are using.

Besides, setting some arbitrary threshold of complexity makes both coding and maintenance harder. If you always use /x, then you can train your fingers to type it automatically for you, and you never need to think about it again. That's much more efficient and reliable than having to consciously[64] assess each regex you write to determine whether it merits the flag. And when you're maintaining the code, if you can rely on every regex having a /x flag, then you never have to check whether a particular regex is or isn't using the flag, and you never have to mentally switch regex "dialects".

In other words, it's perfectly okay to use the /x flag only when a regular expression exceeds some particular threshold of complexity…so long as you set that particular threshold at zero.

Line Boundaries

Always use the /m flag.

In addition to always using the /x flag, always use the /m flag. In every regular expression you ever write.

The normal behaviour of the ^ and $ metacharacters is unintuitive to most programmers, especially if they're coming from a Unix background. Almost all of the Unix utilities that feature regular expressions (e.g., sed, grep, awk) are intrinsically line-oriented. So in those utilities, ^ and $ naturally mean "match at the start of any line" and "match at the end of any line", respectively.

But they don't mean that in Perl.

In Perl, ^ and $ mean "match at the start of the entire string" and "match at the end of the entire string". That's a crucial difference, and one that leads to a very common type of mistake:

# Find the end of a Perl program...
$text =~ m{ [^]*?       # match the minimal number of non-null chars
            ^_  _END_  _$    # until a line containing only an end-marker
          }x;

In fact, what that code really does is:

$text =~ m{ [^]*?       # match the minimal number of non-null chars
            ^             # until the start of the string

            _  _END_  _      # then match the end-marker
            $             # then match the end of the string
          }x;

The minimal number of characters until the start of the string is, of course, zero[65]. Then the regex has to match '_ _END_ _'. And then it has to be at the end of the string. So the only strings that this pattern matches are those that consist of '_ _END_ _'. That is clearly not what was intended.

The /m mode makes ^ and $ work "naturally"[66]. Under /m, ^ no longer means "match at the start of the string"; it means "match at the start of any line". Likewise, $ no longer means "at end of string"; it means "at end of any line".

The previous example could be fixed by making those two metacharacters actually mean what the original developer thought they meant, simply by adding a /m:

# Find the end of a Perl program...
$text =~ m{ [^]*?      # any non-nulls
            ^_  _END_  _$   # until an end-marker line          }xm;

Which now really means:

$text =~ m{ [^]*?      # match the minimal number of chars
            ^            # until the start of any line (/m mode)
            _  _END_  _     # then match the end-marker
            $            # then match the end of a line (/m mode)          }xm;

Consistently using the /m on every regex makes Perl's behaviour consistently conform to your unreasonable expectations. So you don't have to unreasonably change your expectations to conform to Perl's behaviour[67].

String Boundaries

Use A and z as string boundary anchors.

Even if you don't adopt the previous practice of always using /m, using ^ and $ with their default meanings is a bad idea. Sure, you know what ^ and $ actually mean in a Perl regex. But will those who read or maintain your code know? Or is it more likely that they will misinterpret those metacharacters in the ways described earlier?

Perl provides markers that always—and unambiguously—mean "start of string" and "end of string": A and z (capital A, but lowercase z). They mean "start/end of string" regardless of whether /m is active. They mean "start/end of string" regardless of what the reader thinks ^ and $ mean.

They also stand out well. They're unusual. They're likely to be unfamiliar to the readers of your code, in which case those readers will have to look them up, rather than blithely misunderstanding them.

So rather than:

# Remove leading and trailing whitespace...
$text =~ s{^ s* | s* $}{}gx;

use:

# Remove leading and trailing whitespace...
$text =~ s{A s* | s* z}{}gxm;

And when you later need to match line boundaries as well, you can just use ^ and $ "naturally":

# Remove leading and trailing whitespace, and any -- line...
$text =~ s{A s* | ^-- [^
]* $ | s* z}{}gxm;

The alternative (in which ^ and $ each have three distinct meanings in different contexts) is unnecessarily cruel:

# Remove leading and trailing whitespace, and any -- line...
$text =~ s{^ s* | (?m: ^-- [^
]* $) | s* $}{}gx;

End of String

Use z, not , to indicate "end of string".

Perl provides a variant of the z marker: . Whereas lowercase z means "match at end of string", capital  means "match an optional newline, then at end of string". This variant can occasionally be convenient, if you're working with line-based input, as you don't have to worry about chomping the lines first:

# Print contents of lines starting with --...
LINE:
while (my $line = <>) {
    next LINE if $line !~ m/ A -- ([^
]+) /xm;
    print $1;
}

But using  introduces a subtle distinction that can be hard to detect when displayed in some fonts. It's safer to be more explicit: to stick with using z, and say precisely what you mean:

# Print contents of lines starting with --...
LINE:
while (my $line = <>) {
    next LINE if $line !~ m/ A -- ([^
]+) 
? z/xm;  # Might be newline at end
    print $1;}

especially if what you actually meant was:

# Print contents of lines starting with -- (including any trailing newline!)...
LINE:
while (my $line = <>) {
    next LINE if $line !~ m/ A -- ([^
]* 
?) z/xm;
    print $1;}

Using ? z instead of  forces you to decide whether the newline is part of the output or merely part of the scenery.

Matching Anything

Always use the /s flag.

At this point, you might be starting to detect a pattern. Once again, the problem is that the dot metacharacter (.) doesn't mean what most people think it means. Most people—even those who actually know better—habitually think of it as meaning: "match any character".

It's easy to forget that it doesn't really mean that, and accidentally write something like:

# Capture the source of a Perl program...
$text =~ m{A          # From start of string...
           (.*?)       # ...match and capture any characters
           ^_  _END_  _$  # ...until the first _  _END_  _ line
          }xm;

But the dot metacharacter doesn't match newlines, so the only strings this regex will match are those that start with '_ _END_ _'. That's because the ^ (start-of-line) metacharacter can match only at the start of the string or after a newline. But the preceding dot metacharacter can never match a newline, so the only way the ^ can match is if the preceding dot matches a start-of-string. But the dot metacharacter never matches start-of-string, because dot always matches exactly one character and start-of-string isn't a character.

In other words, as with ^ and $, the default behaviour of the dot metacharacter fails to be unreasonable (i.e., to be what most people expect). Fortunately, however, dot can be made to conform to the typical programmer's unreasonable expectations, simply by adding the /s flag. Under /s mode, a dot really does match every character, including newline:

# Capture the source of a Perl program...

$text =~ m{A             # From start of string...
           (.*?)          # ...match and capture any characters (including newlines!)
           ^_  _END_  _$    # ...until the first _  _END_  _ line          }xms;

Of course, the question then becomes: if you always use /s, how do you get the normal "any char but newline" meaning of dot when you actually need it? As with many of these guidelines, you do it by saying explicitly what you mean. If you need to match any character that isn't a newline, that's just the complemented character class [^ ]:

# Delete comments....

$source_code =~ s{               # Substitute...
                   #            # ...a literal octothorpe
                   [^
]*        # ...followed by any number of non-newlines
                 }                 {$SPACE}gxms;   # Replacing it with a single space

Lazy Flags

Consider mandating the Regexp::Defaultflags module.

It takes about a week to accustom your fingers to automatically typing /xms at the end of every regex. But, realistically, some programmers will still not have the discipline required to develop and foster that good habit.

An alternative is to allow (that is, require) them to use the Regexp::Autoflags CPAN module instead, at the start of every source file they create. That module will then automatically turn on /xms mode in every regex they write.

That is, if they put:

use Regexp::Autoflags;

at the start of their file, from that point on they can write regexes like:

$text =~ m{A          # From start of string...
           (.*?)       # ...match and capture any characters (including newlines!)
           ^_  _END_  _$  # ...until the first _  _END_  _ line          };

and:

$source_code =~ s{               # Substitute...
                   #            # ...a literal octothorpe
                   [^
]*        # ...followed by any number of non-newlines
                 }                 {$SPACE}g;      # Replacing it with a single space

They won't have to remember to append the all-important /xms flags, because the Regexp::Autoflags module will have automatically applied them.

Of course, this merely replaces the need for one kind of discipline (always use /xms) with the requirement for another (always use Regexp::Autoflags). However, it's much easier to check whether a single module has been loaded at least once, than it is to verify that the right regex flags have been used everywhere.

Brace Delimiters

Use m{…} in preference to /.../ in multiline regexes.

You might have noticed that every regex in this book that spans more than a single line is delimited with braces rather than slashes. That's because it's much easier to identify the boundaries of the brace-delimited form, both by eye and from within an editor[68].

That ability is especially important in regexes where you need to match a literal slash, or in regexes which use many escape characters. For example, this:

Readonly my $C_COMMENT => qr{
    / *   # Opening C comment delimiter
    .*?    # Smallest number of characters (C comments don't nest)
    * /   # Closing delimiter}xms;

is a little easier to read than the more heavily backslashed:

Readonly my $C_COMMENT => qr/
    / *  # Opening C comment delimiter
    .*?    # Smallest number of characters (delims don't nest)
    * /  # Closing delimiter
/xms;

Using braces as delimiters can also be advantageous in single-line regexes that are heavily laden with slash characters. For example:

$source_code =~ s/ / * (.*?) * / //gxms;

is considerably harder to unravel than:

$source_code =~ s{ / * (.*?) * / }{}gxms;

In particular, a final empty {} as the replacement text is much easier to detect and decipher than a final empty //. Though, of course, it would be better still to write that substitution as:

$source_code =~ s{$C_COMMENT}{$EMPTY_STR}gxms;

to ensure maximum maintainability.

Using braces as regex delimiters has two other advantages. Firstly, in a substitution, the two "halves" of the operation can be placed on separate lines, to further distinguish them from each other. For example:

$source_code =~ s{$C_COMMENT}
                 {$EMPTY_STR}xms;

The second advantage is that raw braces "nest" correctly within brace delimiters, whereas raw slashes don't nest at all within slash-delimited patterns. This is a particular problem under the /x flag, because it means that a seemingly straightforward regex like:

# Parse a 'set' command in our mini-language...
m/
    set      s+  # Keyword
    ($IDENT) s*  # Name of file/option/mode
    =        s*  # literal =
    ([^
]*)      # Value of file/option/mode
/xms;

is seriously (and subtly) broken. It's broken because the compiler first determines that the regex delimiter is a slash, so it looks ahead to locate the next unescaped slash in the source, and treats the intervening characters as the pattern. Then it looks for any trailing regex flags, after which it continues parsing the next part of the current expression.

Unfortunately, in the previous example, the next unescaped slash in the source is the first unescaped slash in the line:

    ($IDENT) s*  # Name of file/option/mode

which means that the regex finishes at that point, causing the code to be parsed as if it were something like:

m/
    set      s+ # Keyword
    ($IDENT) s* # Name of file/o
ption(  ) / mode(  ) = s*         # literal =
                  ([^
(  )]*)     # Value of file/option/mode
                  /xms(  )
               );

whereupon it complains bitterly about the illegal call to the (probably non-existent) ption( ) subroutine, when it was expecting an operator or semicolon after the end of that nice m/.../o pattern. It probably won't be too pleased about the incomplete s*...*...* substitution with the weird asterisk delimiters either, or the dodgy assignment to mode( ).

The problem is that programmers expect comments to have no compile-time semantics[69]. But, within a regex, a comment becomes a comment only after the parser has decided where the surrounding regex finishes. So a slash character that seems to be within a regex comment may actually be a slash delimiter in your code.

Using braces as delimiters significantly reduces the likelihood of encountering this problem:

m{
    set       s+  # Keyword
    ($IDENT)  s*  # Name of file/option/mode
    =         s*  # literal =
    ([^
]*)       # Value of file/option/mode}xms;

because the slashes are no longer special to the parser, which consequently parses the entire regex correctly. Furthermore, as matching braces may be nested inside a brace-delimited regex, this variation is okay too:

m{
    set       s+  # Keyword
    ($IDENT)  s*  # Name of file/option/mode
    =         s*  # literal =
    {             # literal {
    ([^
]*)       # Value of file/option/mode
    }             # literal }}xms;

Of course, unbalanced raw braces still cause problems within regex comments:

m{
    set       s+  # Keyword
    ($IDENT)  s*  # Name of file/option/mode
    =         s*  # literal =
    ([^
]*)       # Value of file/option/mode
    }             # literal }
}xms;

However, unlike /, unbalanced raw braces are not a valid English punctuation form, and hence they're far rarer within comments than slashes. Besides which, the error message that's generated by that particular mistake:

Unmatched right curly bracket at demo.pl line 49, at end of line
(Might be a runaway multi-line {} string starting on line 42)

is much clearer than the sundry lamentations the equivalent slash-delimited version would produce:

Bareword found where operator expected at demo.pl line 46,
near "($IDENT)     # File/option"
    (Might be a runaway multi-line // string starting on line 42)
    (Missing operator before ption?)
Backslash found where operator expected at demo.pl line 49,
near ")     # File/option/mode value "
    (Missing operator before ?)
syntax error at demo.pl line 46, near "($IDENT)     # File/option"
Unmatched right curly bracket at demo.pl line 7, at end of line

So use m{...}xms in preference to /.../xms wherever possible. Indeed, the only reason to ever use slashes to delimit regexes is to improve the comprehensibility of short, embedded patterns. For example, within the blocks of list operations:

my @counts = map { m/(d{4,8})/xms } @count_reports;

slashes are better than braces. A brace-delimited version of the same regex would be using braces to denote "code block", "regex boundary", and "repetition count", all within the space of 20 characters:

my @counts = map { m{(d{4,8})}xms } @count_reports;

Using the slashes as the regex delimiters in this case increases the visual distinctiveness of the regex and thereby improves the overall readability of the code.

Other Delimiters

Don't use any delimiters other than /.../ or m{...}.

Although Perl allows you to use any non-whitespace character you like as a regex delimiter, don't. Because leaving some poor maintenance programmer to take care of (valid) code like this:

last TRY if !$!!~m!/pattern/!;

or this:

$same=m={===m=}=;

or this:

harry s truman was the 33rd u.s. president;

is just cruel.

Even with more reasonable delimiter choices:

last TRY if !$OS_ERROR !~ m!/pattern/!;

$same = m#{# == m#}#;

harry s|ruman was |he 33rd u.s. presiden|;

the boundaries of the regexes don't stand out well.

By sticking with the two recommended delimiters (and other best practices), you make your code more predictable, so it is easier for future readers to identify and understand your regexes:

last TRY if !$OS_ERROR !~ m{ /pattern/ }xms;

$same = ($str =~ m/{/xms  ==  $str =~ m/}/xms);harry( $str =~ s{ruman was }{he 33rd u.s. presiden}xms );

Note that the same advice also applies to substitutions and transliterations: stick to s/.../.../xms or s{...}{...}xms, and tr/.../.../ or tr{...}{...}.

Metacharacters

Prefer singular character classes to escaped metacharacters.

Escaped metacharacters are harder to decipher, and harder to distinguish from their unescaped originals:

m/ { . . d{2} } /xms;

The alternative is to put each metacharacter in its own tiny, one-character character class, like so:

m/ [{] . [.] d{2} [}] /xms;

Once you're familiar with this convention, it's very much easier to see the literal metacharacters when they're square-bracketed. That's particularly true for spaces under the /x flag. For example, the literal spaces to be matched in:

$name =~ m{ harry [ ] s [ ] truman
          | harry [ ] j [ ] potter          }ixms;

stand out much better than those in:

$name =~ m{ harry  s  truman
          | harry  j  potter
          }ixms;

Note, however, that this approach can reduce the optimizer's ability to accelerate pattern matching under some versions of Perl. If benchmarking (see Chapter 19) indicates that this may be a problem for you, try the alternative approach suggested in the next guideline.

Named Characters

Prefer named characters to escaped metacharacters.

As an alternative to the previous guideline, Perl 5.6 (and later) supports named characters in regexes. As previously discussed[70], this mechanism is much better for "unprintable" components of a regex. For example, instead of:

if ($escape_seq =~ /177 06 30 Z/xms) {   # Octal DEL-ACK-CAN-Z
    blink(182);
}

use:

use charnames qw( :full );

if ($escape_seq =~ m/N{DELETE} N{ACKNOWLEDGE} N{CANCEL} Z/xms) {
    blink(182);}

Note, however that named whitespace characters are treated like ordinary whitespace (i.e., they're ignored) under the /x flag:

use charnames qw( :full );
# and later...
$name =~ m{ harry N{SPACE} s N{SPACE} truman     # harrystruman
          | harry N{SPACE} j N{SPACE} potter     # harryjpotter
          }ixms;

You would still need to put them in characters classes to make them match:

use charnames qw( :full );

# and later...

$name =~ m{ harry [N{SPACE}] s [N{SPACE}] truman     # harry s truman
          | harry [N{SPACE}] j [N{SPACE}] potter     # harry j potter          }ixms;

Properties

Prefer properties to enumerated character classes.

Explicit character classes are frequently used to match character ranges, especially alphabetics. For example:

# Alphabetics-only identifier...
Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms;

However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full gamut of Unicode alphabetics.

That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather déclassé for an überhacking rōnin to create identifier regexes that won't even match 'déclassé' or 'überhacking' or 'rōnin'.

Regular expressions in Perl 5.6 and later[71] support the use of the p{...} escape, which allows you to use full Unicode properties. Properties are Unicode-compliant named character classes and are both more general and more self-documenting than explicit ASCII character classes. The perlunicode manpage explains the mechanism in detail and lists the available properties.

So, if you're ready to concede that ASCII-centrism is a naïve façade that's gradually fading into Götterdämmerung, you might choose to bid it adiós and open your regexes to the full Unicode smörgåsbord, by changing the previous identifier regex to:

Readonly my $ALPHA_IDENT => qr/ p{Uppercase}  p{Alphabetic}* /xms;

There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of:

Readonly my $PERL_IDENT => qr/ [A-Za-z_] w*/xms;

you can use:

Readonly my $PERL_IDENT => qr/ p{ID_Start} p{ID_Continue}* /xms;

One other particularly useful property is p{Any}, which provides a more readable alternative to the normal dot (.) metacharacter. For example, instead of:

m/ [{] . [.] d{2} [}] /xms;

you could write:

m/ [{] p{Any} [.] d{2} [}] /xms;

and leave the reader in no doubt that the second character to be matched really can be anything at all—an ASCII alphabetic, a Latin-1 superscript, an Extended Latin diacritical, a Devanagari number, an Ogham rune, or even a Bopomofo symbol.

Whitespace

Consider matching arbitrary whitespace, rather than specific whitespace characters.

Unless you're matching regular expressions against fixed-format machine-generated data, avoid matching specific whitespace characters exactly. Because if humans were directly involved anywhere in the data acquisition, then the notion of "fixed" will probably have been more honoured in the breach than in the observance.

If, for example, the input is supposed to consist of a label, followed by a single space, followed by an equals sign, followed by a single space, followed by an value . . . don't bet on it. Most users nowadays will—quite reasonably—assume that whitespace is negotiable; nothing more than an elastic formatting medium. So, in a configuration file, you're just as likely to get something like:

name       = Yossarian, J
rank       = Captain
serial_num = 3192304

The whitespaces in that data might be single tabs, multiple tabs, multiple spaces, single spaces, or any combination thereof. So matching that data with a pattern that insists on exactly one space character at the relevant points is unlikely to be uniformly successful:

$config_line =~ m{ ($IDENT)  [N{SPACE}]  =  [N{SPACE}]  (.*) }xms

Worse still, it's also unlikely to be uniformly unsuccessful. For instance, in the example data, it might only match the serial number. And that kind of intermittent success will make your program much harder to debug. It might also make it difficult to realize that any debugging is required.

Unless you're specifically vetting data to verify that it conforms to a required fixed format, it's much better to be very liberal in what you accept when it comes to whitespace. Use s+ for any required whitespace and s* for any optional whitespace. For example, it would be far more robust to match the example data against:

$config_line =~ m{ ($IDENT)  s*  =  s*  (.*) }xms

Unconstrained Repetitions

Be specific when matching "as much as possible".

The .* construct is a particularly blunt and ponderous weapon, especially under /s. For example, consider the following parser for some very simple language, in which source code, data, and configuration information are separated by % and & characters (which are otherwise illegal):

# Format is: <statements> % <data> & <config>...

if ($source =~ m/A  (.*)  %  (.*)  &  (.*) /xms) {
    my ($statements, $data, $config) = ($1, $2, $3);

    my $prog = compile($statements, {config=>$config});
    my $res  = execute($prog, {data=>$data, config=>$config});
}
else {
    croak 'Invalid program';
}

Under /s, the first .* will successfully match the entire string in $source. Then it will attempt to match a %, and immediately fail (because there's none of the string left to match). At that point the regex engine will backtrack one character from the end of the string and try to match a % again, which will probably also fail. So it will backtrack one more character, try again, backtrack once more, try again, et cetera, et cetera, et cetera.

Eventually it will backtrack far enough to successfully match %, whereupon the second .* will match the remainder of the string, then fail to match &, backtrack one character, try again, fail again, and the entire "one-step-forward-two-steps-back" sequence will be played out again. Sequences of unconstrained matches like this can easily cause regular expression matches to become unacceptably slow.

Using a .*? can help in such cases:

if ($source =~ m/A  (.*?)  %  (.*?)  &  (.*) /xms) {
    my ($statements, $data, $config) = ($1, $2, $3);

    my $prog = compile($statements, {config=>$config});
    my $res  = execute($prog, {data=>$data, config=>$config});
}
else {
    croak 'Invalid program';
}

since the "parsimonious repetitions" will then consume as little of the string as possible. But, to do this, they effectively have to do a look-ahead at every character they match, which can also become expensive if the terminator is more complicated than just a single character.

More importantly, both .* and .*? can also mask logical errors in the parsing process. For example, if the program incorrectly had an extra % or & in it, that would simply be consumed by one of the .* or .*? constructs, and therefore treated as part of the code or data, rather than as an error.

If you know precisely what character (or characters) the terminator of a "match anything" sequence will be, then it's very much more efficient—and clearer—to use a complemented character class instead:

# Format is: <source> % <data> & <config>...

if ($source =~ m/A  ([^%]*)  %  ([^&]*)  &  (.*) /xms) {
    my ($statements, $data, $config) = ($1, $2, $3);

    my $prog = compile($statements, {config=>$config});
    my $res  = execute($prog, {data=>$data, config=>$config});
}
else {
    croak 'Invalid program';}

This version matches every non-% (using [^%]*), followed by a %, followed by every non-& (via [^&]*), followed by a &, followed by the rest of the string (.*). The principal advantage is that the complemented character classes don't have to do per-character look-ahead like .*?, nor per-character backtracking like .*. Nor will this version allow an extra % in the source or & in the source. Once again, you're encoding your exact intentions.

Note that the .* at the end of the regex is still perfectly okay. When it finally gets its chance and gobbles up the rest of the source, the match will then be finished, so no backtracking will ever occur. On the other hand, putting a.*? at the end of a regular expression is always a mistake, as it will always successfully match nothing, at which point the pattern match will succeed and then terminate. A final .*? is either redundant, or it's not doing what you intended, or you forgot a z anchor.

Capturing Parentheses

Use capturing parentheses only when you intend to capture.

It's a waste of processor cycles to capture a substring you don't need. More importantly, it's misleading to do so. When the unfortunates who have to maintain the following code see:

if ( $cmd =~ m/A (q | quit | bye | exit) 
? z/xms ) {
    perform_cleanup(  );
    exit;
}

they will almost certainly start casting around to determine where $1 is used (perhaps for an exit confirmation request, or inside perform_cleanup( )).

They'll be rightly annoyed when they eventually discover that $1 isn't used anywhere. Because now they can't be sure whether that indicates a bug, or was just laziness on the part of the original coder. Hence, they'll probably have to re-examine the logic of perform_cleanup( ) to determine whether that unused capture is actually A.W.O.L. And that's a waste of maintainer cycles.

Perl provides a form of regex parentheses that deliberately don't capture: the (?:...) parentheses. If the previous example had been written:

if ( $cmd =~ m/A (?:q | quit | bye | exit) 
? z/xms ) {
    perform_cleanup(  );
    exit;}

then there would be no doubt that the parentheses were being used simply to group the four alternative "exit" commands, rather than to capture the particular "exit" command used.

Use non-capturing parentheses by default, and reserve capturing parentheses for when you need to make use of some part of a matched string. That way, your coded instructions will also encode your intentions, which is a much more robust and effective style of programming.

Captured Values

Use the numeric capture variables only when you're sure that the preceding match succeeded.

Pattern matches that fail never assign anything to $1, $2, etc., nor do they leave those variables undefined. After an unsuccessful pattern match, the numeric capture variables remain exactly as they were before the match was attempted. Often, that means that they retain whatever values some earlier successful pattern match gave them.

So you can't test whether a pattern has matched by testing the numeric capture variables directly. A common mistake along those lines is to write something like:

$full_name =~ m/A (Mrs?|Ms|Dr) s+ (S+) s+ (S+) z/xms;
if (defined $1) {
    ($title, $first_name, $last_name) = ($1, $2, $3);
}

The problem is that, if the match fails, $1 may still have been set by some earlier successful match in the same scope, in which case the three variables would be assigned capture values left over from that previous match.

Captured values should be used only when it's certain they actually were captured. The easiest way to ensure that is to always put capturing matches inside some kind of preliminary boolean test. For example:

if ($full_name =~ m/A (Mrs?|Ms|Dr) s+ (S+) s+ (S+) z/xms) {
    ($title, $first_name, $last_name) = ($1, $2, $3);}

or:

next NAME if $full_name !~ m/A (Mrs?|Ms|Dr) s+ (S+) s+ (S+) z/xms;

($title, $first_name, $last_name) = ($1, $2, $3);

Capture Variables

Always give captured substrings proper names.

$1, $2, etc. are dreadful names for variables. Like the parameter variables $_[0], $_[1], etc. (see "Named Arguments" in Chapter 9), they convey absolutely nothing about the values they store, except the order in which they occurred. They produce unreadable code like this:

CONFIG_LINE:
while (my $config = <>) {
    # Ignore lines that are unrecognisable...
    next CONFIG_LINE
        if $config !~ m/ A  (S+)  s* = s*  ([^;]+) ;  s* # (.*)/xms;

    # Verify the option makes sense...
    debug($3);
    croak "Unknown option ($1)"
        if not exists $option{$2};

    # Record the configuration option...
    $option{$2} = $1;
}

As the capture variables don't have meaningful names, it's much harder to work out what this code is actually doing, and to verify that it's correct. (It's not.)

Because numbered variables suffer from the same drawbacks as numbered arguments, it's not surprising that the solution is the same, too: simply unpack $1, $2, etc. into sensibly named variables immediately after a successful match. Doing that makes the purpose—and the errors—much more obvious:

CONFIG_LINE:
while (my $config = <>) {
    # Ignore lines that are unrecognisable...
    next CONFIG_LINE
        if $config !~ m/ A  (S+)  s* = s*  ([^;]+) ;  s* # (.*)/xms;

    # Name captured components...

    my ($opt_name, $opt_val, $comment) = ($1, $2, $3);

    # Verify the option makes sense...
    debug($comment);
    croak "Unknown option ($opt_name)"
        if not exists $option{$opt_val};   # Oops: value used as key

    # Record the configuration option...
    $option{$opt_val} = $opt_name;         # Oops*2: value as key; name as value}

That, in turn, makes the code far easier to correct:

CONFIG_LINE:
while (my $config = <>) {
    # Ignore lines that are unrecognisable...
    next CONFIG_LINE
        if $config !~ m/ A  (S+)  s* = s*  ([^;]+) ;  s* # (.*)/xms;

    # Name captured components...

    my ($opt_name, $opt_val, $comment) = ($1, $2, $3);

    # Verify that the option makes sense...
    debug($comment);
    croak "Unknown option ($opt_name)"
        if not exists $option{$opt_name};  # Name used as key

    # Record the configuration option...
    $option{$opt_name} = $opt_val;         # Names as key; value as value}

Naming the captures improves maintainability in another way too. If it later became necessary to capture some other piece of the match, some of the numbered variables might change number. For example, suppose you needed to support appending to an option as well as assigning. Then you'd need to capture the operator as well. The original code would have become:

CONFIG_LINE:
while (my $config = <>) {
    # Ignore lines that are unrecognisable...
    next CONFIG_LINE
        if $config !~ m/A (S+) s* (=|[+]=) s* ([^;]+) ; s* # (.*)/xms;

    # Verify that the option makes sense...
    debug($4);
    croak "Unknown option ($1)"
        if not exists $option{$1};

    # Replace or append value depending on specified operator...
    if ($2 eq '=') {
        $option{$1} = $3;
    }
    else {
        $option{$1}.= $3;
    }
}

The Variable Formerly Known As $2 is now $3, and the old $3 is now $4. The odds of correctly managing that code change diminish rapidly as the size of the if block—or the number of captures—increases. But, if the captures are unpacked into named variables, then none of the previous names needs to change when a new capture is added:

CONFIG_LINE:
while (my $config = <>) {
    # Ignore lines that are unrecognisable...
    next CONFIG_LINE
        if $config !~ m/A (S+) s* (=|[+]=) s* ([^;]+) ; s* # (.*)/xms;

    # Unpack the components of the config line...
    my ($opt_name, $operator, $opt_val, $comment) = ($1, $2, $3, $4);

    # Verify that the option makes sense...
    debug($comment);
    croak "Unknown option ($opt_name)"
        if not exists $option{$opt_name};

    # Replace or append value depending on specified operator...
    if ($operator eq '=') {
        $option{$opt_name} = $opt_val;
    }
    else {
        $option{$opt_name}.= $opt_val;
    }}

Better still, Perl provides a way to assign captured substrings directly to named variables, without ever mentioning the numbered variables explicitly. If a regex match is performed in a list context, the list it returns is the list of captures that it made. That is, a match in a list context returns the list ($1, $2, $3, etc.). Those captures can then be unpacked directly, like so:

CONFIG_LINE:
while (my $config = <>) {
    # Match config line in list context, capturing components into named vars...
    my ($opt_name, $operator, $opt_val, $comment)
        = $config =~ m/A (S+) s* (=|[+]=) s* ([^;]+) ; s* # (.*)/xms;

    # Process line only if it was recognizable...
    next CONFIG_LINE if !defined $opt_name;

    # Verify that the option makes sense...
    debug($comment);
    croak "Unknown option ($opt_name)"
        if not exists $option{$opt_name};

    # Replace or append value depending on specified operator...
    if ($operator eq '=') {
        $option{$opt_name} = $opt_val;
    }
    else {
        $option{$opt_name}.= $opt_val;
    }}

Capturing directly to named variables in this way avoids the possibility of introducing subtle unpacking mistakes such as:

# Ignore lines that are unrecognisable...
next CONFIG_LINE
    if $config !~ m/ A  (S+)  s* (=|[+]=) s*  ([^;]+) ;  s* # (.*)/xms;

# Unpack the components of the config line...
my ($opt_name, $operator, $opt_val, $comment) = ($1, $2, $3);    # Missing $4!

because a match in a list context always returns all of its captures, not just the ones you remembered to specify explicitly.

List-context captures are the least error-prone way of extracting information from pattern matches and, hence, strongly recommended. Note, however, that list-context captures aren't appropriate for regexes that use the /gc modifier (see the following guideline, "Piecewise Matching").

Piecewise Matching

Tokenize input using the /gc flag.

The typical approach to breaking an input string into individual tokens is to "nibble" at it, repeatedly biting off the start of the input string with successive substitutions:

while (length $input > 0) {
    if ($input =~ s{A ($KEYWORD)}{}xms) {
        my $keyword = $1;
        push @tokens, start_cmd($keyword);
    }
    elsif ($input =~ s{A ($IDENT)}{}xms) {
        my $ident = $1;
        push @tokens, make_ident($ident);
    }
    elsif ($input =~ s{A ($BLOCK)}{}xms) {
        my $block = $1;
        push @tokens, make_block($block);
    }
    else {
        my ($context) = $input =~ m/ A ([^
]*) /xms;
        croak "Error near: $context";
    }
}

But this approach requires a modification to the $input string on every successful match, which makes it expensive to start with, and then causes it to scale badly as well. Nibbling away at strings is slow and gets slower as the strings get bigger.

In Perl 5.004 and later, there's a much better way to use regexes for tokenizing an input: you can just "walk" the string, using the /gc flag. The /gc flag tells a regex to track where each successful match finishes matching. You can then access that "end-of-the-last-match" position via the built-in pos( ) function. There is also a G metacharacter, which is a positional anchor, just like A is. However, whereas A tells the regex to match only at the start of the string, G tells it to match only where the previous successful /gc match finished. If no previous /gc match was successful, G acts like a A and matches only at the start of the string.

All of which means that, instead of using a regex substitution to lop each token off the start of the string (s{A...}{}), you can simply use a regex match to start looking for the next token at the point where the previous token match finished (m{G...}gc).

So the previous tokenizer could be rewritten more efficiently as:

# Reset the matching position of $input to the beginning of the string...

pos $input = 0;

# ...and continue until the matching position is past the last character...
while (pos $input < length $input) {
    if ($input =~ m{ G ($KEYWORD) }gcxms) {
        my $keyword = $1;
        push @tokens, start_cmd($keyword);
    }
    elsif ($input =~ m{ G ( $IDENT) }gcxms) {
        my $ident = $1;
        push @tokens, make_ident($ident);
    }
    elsif ($input =~ m{ G ($BLOCK) }gcxms) {
        my $block = $1;
        push @tokens, make_block($block);
    }
    else {
        $input =~ m/ G ([^
]*) /gcxms;
        my $context = $1;
        croak "Error near: $context";
    }}

Of course, because this style of parsing inevitably spawns a series of cascaded if statements that all feed the same @tokens array, it's even better practice to use the ternary operator and create a "parsing table" (see "Tabular Ternaries" in Chapter 6):

while (pos $input < length $input) {
    push @tokens,  (
                       # For token type...      #  Build token...
             $input =~ m{ G ($KEYWORD) }gcxms  ?  start_cmd($1)
           : $input =~ m{ G ( $IDENT ) }gcxms  ?  make_ident($1)
           : $input =~ m{ G ( $BLOCK ) }gcxms  ?  make_block($1)
           : $input =~ m{ G ( [^
]* ) }gcxms  ?  croak "Error near:$1"
           :                                       die 'Internal error'
    );}

Note that these examples don't use direct list capturing to rename the capture variables (as recommended in the preceding guideline). Instead they pass $1 into a token-constructing subroutine immediately after the match. That's because a list capture would cause the regex to match in list context, which would force the /g component of the flag to incorrectly match every occurrence of the pattern, rather than just the next one.

Tabular Regexes

Build regular expressions from tables.

Tables like the one shown at the end of the previous guideline are a cleaner way of structuring regex matches, but they can also be a cleaner way of building a regex in the first place—especially when the resulting regex will be used to extract keys for the table.

Don't duplicate existing table information as part of a regular expression:

# Table of irregular plurals...
my %irregular_plural_of = (
    'child'       => 'children',
    'brother'     => 'brethren',
    'money'       => 'monies',
    'mongoose'    => 'mongooses',
    'ox'          => 'oxen',
    'cow'         => 'kine',
    'soliloquy'   => 'soliloquies',
    'prima donna' => 'prime donne',
    'octopus'     => 'octopodes',
    'tooth'       => 'teeth',
    'toothfish'   => 'toothfish',
);

# Pattern matching any of those irregular plurals...

my $has_irregular_plural = qr{
    child     | brother     | mongoose
  | ox        | cow         | monkey
  | soliloquy | prima donna | octopus
  | tooth(?:fish)?
}xms;

# Form plurals...
while (my $word = <>) {
    chomp $word;
    if ($word =~ m/A ($has_irregular_plural) z/xms) {
        print $irregular_plural_of{$word}, "
";
    }
    else {
        print form_regular_plural_of($word), "
";
    }
}

Apart from the annoying redundancy of specifying each key twice, this kind of duplication is a prime opportunity for mistakes to creep in. As they did—twice—in the previous example[72].

It's much easier to ensure consistency between a look-up table and the regex that feeds it if the regex is automatically constructed from the table itself. That's relatively easy to achieve, by replacing the regex definition with:

# Build a pattern matching any of those irregular plurals...
my $has_irregular_plural    = join '|', map {quotemeta $_} reverse sort keys %irregular_plural_of;

The assignment statement starts by extracting the keys from the table (keys %irregular_plural_of), then sorts them in reverse order (reverse sort keys %irregular_plural_of). Sorting them is critical because the order in which hash keys are returned is unpredictable, so there's a 50/50 chance that the key 'tooth' will appear in the key list before the key 'toothfish'. That would be unfortunate, because the list of keys is about to be converted to a list of alternatives, and regexes always match the left-most alternative first. In that case, the word "toothfish" would always be matched by the alternative 'tooth', rather than by the later alternative 'toothfish'.

Once the keys are in a reliable order, the map operation escapes any metacharacters within the keys (map {quotemeta $_} keys %irregular_plural_of). This step ensures, for example, that 'prima donna' becomes 'prima donna', and so behaves correctly under the /x flag. The various alternatives are then joined together with standard "or" markers to produce the full pattern.

Setting up this automated process takes a little extra effort, but it significantly improves the robustness of the resulting code. Not only does it eliminate the possibility of mismatches between the table keys and the regex alternatives, it also makes extending the table a one-step operation: just add the new singular/plural pair to the initialization of %irregular_plural_of; the pattern in $has_irregular_plural will automatically reconfigure itself accordingly.

About the only way the code could be further improved would be to factor out the hairy regex-building statements into a subroutine:

# Build a pattern matching any of the arguments given...
sub regex_that_matches {
    return join '|', map {quotemeta $_} reverse sort @_;
}

# and later...

my $has_irregular_plural    = regex_that_matches(keys %irregular_plural_of);

Note that—as is so often the case—refactoring shaggy code in this way not only cleans up the source in which the statements were formerly used, but also makes the refactored statements themselves a little less hirsute.

Note that if you're in some strange locale where strings with common prefixes don't sort shortest-to-longest, then you may need to be more specific (but less efficient) about your sorting order, by including an explicit length comparison in your sort block:

# Build a pattern matching any of the arguments given...
sub regex_that_matches {
    return join '|',
                map {quotemeta $_}
                    # longest strings first, otherwise alphabetically...
                    sort { length($b) <=> length($a) or $a cmp $b }
                         @_;
}

# and later...

my $has_irregular_plural    = regex_that_matches(keys %irregular_plural_of);

Constructing Regexes

Build complex regular expressions from simpler pieces.

Building a regular expression from the keys of a hash is a special case of a much more general best practice. Most worthwhile regexes—even those for simple tasks—are still too tedious or too complicated to code directly. For example, to extract the components of a number, you could write:

my ($number, $sign, $digits, $exponent)
    = $input =~ m{ (                          # Capture entire number
                     ( [+-]? )                # Capture leading sign (if any)
                     ( d+ (?: [.] d*)?      # Capture mantissa: NNN.NNN
                     | [.] d+                #               or:    .NNN
                     )
                     ( (?:[Ee] [+-]? d+)? )  # Capture exponent (if any)
                   )
                 }xms;

Even with the comments, that pattern is bordering on unreadable. And checking that it works as advertised is highly non-trivial.

But a regular expression is really just a program, so all the arguments in favour of program decomposition (see Chapter 9) apply to regexes too. In particular, it's often better to decompose a complex regular expression into manageable (named) fragments, like so:

# Build a regex that matches floating point representations...
Readonly my $DIGITS    => qr{ d+ (?: [.] d*)? | [.] d+         }xms;
Readonly my $SIGN      => qr{ [+-]                                }xms;
Readonly my $EXPONENT  => qr{ [Ee] $SIGN? d+                     }xms;
Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

# and later...

my ($number, $sign, $digits, $exponent)    = $input =~ $NUMBER;

Here, the full $NUMBER regex is built up from simpler components ($DIGITS, $SIGN, and $EXPONENT), much in the same way that a full Perl program is built from simpler subroutines. Notice that, once again, refactoring cleans up both the refactored code itself and the place that code is later used.

Note, however, that interpolating qr'd regexes inside other qr'd regexes (as in the previous example) may impose a performance penalty in some cases. That's because when the component regexes are interpolated, they are first decompiled back to strings, then interpolated, and finally recompiled. Unfortunately, the conversion of the individual components back to strings is not optimized, and will sometimes produce less efficient patterns, which are then recompiled into less efficient regexes.

The alternative is to use q{} or qq{} strings to specify the components. Using strings ensures that what you write in a component is exactly what's later interpolated from it:

# Build a regex that matches floating-point representations...
Readonly my $DIGITS    =>  q{ (?: d+ (?: [.] d*)? | [.] d+   ) };
Readonly my $SIGN      =>  q{ (?: [+-]                          ) };
Readonly my $EXPONENT  => qq{ (?: [Ee] $SIGN? \d+              ) };
Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

However, using qr{} instead of strings is still the recommended practice here. Specifying subpatterns in a q{} or qq{} requires very careful attention to the use of escape characters (such as writing \d in some, but not all, of the components). You must also remember to add an extra (?:…) around each subpattern, to ensure that the final interpolated string is treated as a single item (for example, so the ? in $EXPONENT? applies to the entire exponent subpattern). In contrast, the inside of a qr{} always behaves exactly like the inside of an m{} match, so no arcane metaquoting is required.

If you need to build very complicated regular expressions, you should also look at the Regexp::Assemble CPAN module, which allows you to build regexes in an OO style, and then optimizes the resulting patterns to minimize backtracking. The module can also optionally insert debugging information into the regular expressions it builds, which can be invaluable for highly complex regexes.

Canned Regexes

Consider using Regexp::Common instead of writing your own regexes.

Regular expressions are wonderfully easy to code wrongly: to miss edge-cases, to include unexpected (and incorrect) matches, or to create a pattern that's correct but hopelessly inefficient. And even when you get your regex right, you still have to maintain the code that you used to build it.

It's a drag. Worse, it's everybody's drag. All around the world there are thousands of Perl programmers continually reinventing the same regexes: to match numbers, and URLs, and quoted strings, and programming language comments, and IP addresses, and Roman numerals, and zip codes, and Social Security numbers, and balanced brackets, and credit card numbers, and email addresses.

Fortunately there's a CPAN module named Regexp::Common, whose entire purpose is to generate these kinds of everyday regular expressions for you. The module installs a single hash (%RE), through which you can create thousands of commonly needed regexes.

For example, instead of building yourself a number-matcher:

# Build a regex that matches floating point representations...
Readonly my $DIGITS    => qr{ d+ (?: [.] d*)? | [.] d+         }xms;
Readonly my $SIGN      => qr{ [+-]                                }xms;
Readonly my $EXPONENT  => qr{ [Ee] $SIGN? d+                     }xms;
Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

# and later...
my ($number)
    = $input =~ $NUMBER;

you can ask Regexp::Common to do it for you:

use Regexp::Common;

# Build a regex that matches floating point representations...
Readonly my $NUMBER => $RE{num}{real}{-keep};

# and later...

my ($number)    = $input =~ $NUMBER;

And instead of beating your head against the appalling regex needed to match formal HTTP-style URIs:

# Build a regex that matches HTTP addresses...
Readonly my $HTTP => qr{
    (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*
    (?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.]
    [0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9
    -_.!~*'(  ):@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9
    -_.!~*'(  ):@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9
    -_.!~*'(  ):@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9
    -_.!~*'(  ):@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?]
    (?:(?:(?:[;/?:@&=+$,a-zA-Z0-9-_.!~*'(  )]+|(?:%[a-fA-F0-9][a-fA-F0-9
    ]))*)))?))?)
}xms;

# Find web pages...
URI:
while (my $uri = <>) {
    next URI if $uri !~ m/ $HTTP /xms;
    print $uri;
}

You can just use:

use Regexp::Common;

# Find web pages...
URI:
while (my $uri = <>) {
    next URI if $uri !~ m/ $RE{URI}{HTTP} /xms;
    print $uri;}

The benefits are perhaps most noticeable when you need a slight variation on a common regex, such as one that matches numbers in base 12, with between six and nine duodecimal places:

use Regexp::Common;

# The alien hardware device requires duodecimal floating-point numbers...
Readonly my $NUMBER => $RE{num}{real}{-base=>12}{-places=>'6,9'}{-keep};

# and later...

my ($number)    = $input =~ m/$NUMBER/xms;

or a regular expression to help expurgate potentially rude words:

use Regexp::Common;

# Clean up their [DELETED] language...$text =~ s{ $RE{profanity}{contextual} }{[DELETED]}gxms;

or a pattern that checks Australian postcodes:

use Regexp::Common;
use IO::Prompt;

# Strewth, better find out where this bloke lives...
my $postcode
    = prompt 'Giz ya postcode, mate: ',             -require=>{'Try again, cobber: ' => qr/A $RE{zip}{Australia} /xms};

The regexes produced by Regexp::Common are reliable, robust, and efficient, because they're in wide and continual use (i.e., endlessly crash-tested), and they're regularly maintained and enhanced by some of the most competent developers in the Perl community. The module also has the most extensive test suite on the entire CPAN, with more than 175,000 tests.

Alternations

Always use character classes instead of single-character alternations.

Individually testing for single character alternatives:

if ($cmd !~ m{A (?: a | d | i | q | r | w | x ) z}xms) {
    carp "Unknown command: $cmd";
    next COMMAND;
}

may make your regex slightly more readable. But that gain isn't sufficient to compensate for the heavy performance penalty this approach imposes. Furthermore, the cost of testing separate alternatives this way increases linearly with the number of alternatives to be tested.

The equivalent character class:

if ($cmd !~ m{A [adiqrwx] z}xms) {
    carp "Unknown command: $cmd";
    next COMMAND;}

does exactly the same job, but 10 times faster. And it costs the same no matter how many characters are later added to the set.

Sometimes a set of alternatives will contain both single- and multicharacter alternatives:

if ($quotelike !~ m{A (?: qq | qr | qx | q | s | y | tr ) z}xms) {
    carp "Unknown quotelike: $quotelike";
    next QUOTELIKE;
}

In that case, you can still improve the regex by aggregating the single characters:

if ($quotelike !~ m{A (?: qq | qr | qx | [qsy] | tr ) z}xms) {
    carp "Unknown quotelike: $quotelike";
    next QUOTELIKE;}

Sometimes you can then factor out the commonalities of the remaining multicharacter alternatives into an additional character class:

if ($quotelike !~ m{A (?: q[qrx] | [qsy] | tr ) z}xms) {
    carp "Unknown quotelike: $quotelike";
    next QUOTELIKE;}

Factoring Alternations

Factor out common affixes from alternations.

It's not just single character alternatives that are slow. Any alternation of subpatterns can be expensive. Especially if the resulting set of alternatives involves a repetition.

Every alternative that has to be tried requires the regex engine to backtrack up the string and re-examine the same sequence of characters it just rejected. And, if the alternatives are inside a repeated subpattern, the repetition itself may have to backtrack and retry every alternative from a different starting point. That kind of nested backtracking can easily produce an exponential increase in the time the complete match requires.

As if those problems weren't bad enough, alternations aren't very smart either. If one alternative fails, the matching engine just backs up and tries the next possibility, with absolutely no forethought as to whether that next alternative can possibly match.

For example, when a regular expression like:

m{
   with s+ each s+ $EXPR s* $BLOCK
 | with s+ each s+ $VAR  s* in s* [(] $LIST [)] s* $BLOCK
 | with s+ [(] $LIST [)] s* $BLOCK
}xms

is matching a string, it obviously tries the first alternative first. Suppose the string begins 'with er go est...'. In that case, the first alternative will successfully match with, then successfully match s+, then successfully match e, but will then fail to match r (since it expected ach at that point). So the regex engine will backtrack to the start of the string and try the second alternative instead. Once again, it will successfully match with and s+ and e, but then once again fail to match r. So it will backtrack to the start of the string once more and try the third alternative. Yet again it will successfully match with, then s+, before failing to match the [(].

That's much less efficient than it could be. The engine had to backtrack twice and, in doing so, it had to retest and rematch the same with s+ subpattern three times, and the longer with s+ e subpattern twice.

A human in the same situation would notice that all three alternatives start the same way, remember that the first four characters of the string matched the first time, and simply skip that part of the rematch on the rest of the alternatives.

But Perl doesn't optimize regexes in that way. There's no theoretical reason why it couldn't do so, but there is an important practical reason: it's prohibitively expensive to analyze every regex every time it's compiled, just to identify these kinds of occasional opportunities for optimization. The extra time required to be that clever almost always far outweighs any performance gain that might be derived. So Perl sticks with a "dumb but fast" approach instead.

So if you want Perl to be smarter about matching regexes of this kind, you have to do the thinking for it; analyze and optimize the regex yourself. That's not particularly difficult. You just put the set of alternatives in non-capturing parentheses:

m{
   (?: with s+ each s+ $EXPR s* $BLOCK
     | with s+ each s+ $VAR  s* in s* [(] $LIST [)] s* $BLOCK
     | with s+ [(] $LIST [)] s* $BLOCK
   )}xms

then grab the common prefix shared by every alternative and factor it out, placing it in front of the parentheses:

m{
   with s+
   (?: each s+ $EXPR s* $BLOCK
     | each s+ $VAR  s* in s* [(] $LIST [)] s* $BLOCK
     | [(] $LIST [)] s* $BLOCK
   )}xms

This version of the regex does exactly what a human (or programmer) would do: it matches the with s+ once only, then tries the three alternative "completions", without stupidly backtracking to the very start of the string to recheck that the initial 'with ' is still there.

Of course, having made that optimization, you might well find it opens up other opportunities to avoid backtracking and rechecking. For example, the first two alternations in the non-capturing parentheses now both start with each s+, so you could repeat the factoring-out on just those two alternatives, by first wrapping them in another set of parentheses:

m{
   with s+
   (?:
       (?: each s+ $EXPR s* $BLOCK
         | each s+ $VAR  s* in s* [(] $LIST [)] s* $BLOCK
       )
     | [(] $LIST [)] s* $BLOCK
   )}xms

and then extracting the common prefix:

m{
   with s+
   (?: each s+
       (?: $EXPR s* $BLOCK
         | $VAR  s* in s* [(] $LIST [)] s* $BLOCK
       )
     | [(] $LIST [)] s* $BLOCK
   )}xms

Likewise, if every alternative ends in the same sequence—in this case, s* $BLOCK—then that common sequence can be factored out and placed after the alternatives:

m{
   with s+
   (?: each s+
       (?:$EXPR
         | $VAR  s* in s* [(] $LIST [)]
       )
     | [(] $LIST [)]
   )
   s* $BLOCK}xms

Note, however, that there is a significant price to be paid for these optimizations. Compared to the original:

m{
   with s+ each s+ $EXPR s* $BLOCK
 | with s+ each s+ $VAR  s* in s* [(] $LIST [)] s* $BLOCK
 | with s+ [(] $LIST [)] s* $BLOCK
}xms

the final version of the regex is considerably more efficient, but also considerably less readable. Of course, the original is no great beauty either, so that may not be a critical issue, especially if the refactored regex is appropriately commented and perhaps turned into a constant:

Readonly my $WITH_BLOCK => qr{
   with s+                                # Always a 'with' keyword
   (?: each s+                            #  If followed by 'each'
       (?:$EXPR                            #    Then expect an expression
         | $VAR  s* in s* [(] $LIST [)]  #    or a variable and list
       )
     | [(] $LIST [)]                       #  Otherwise, no 'each' and just a list
   )
   s* $BLOCK                              # And the loop block always at the end}xms;

Backtracking

Prevent useless backtracking.

In the final example of the previous guideline:

qr{
   with s+
   (?: each s+
       (?:$EXPR
         | $VAR  s* in s* [(] $LIST [)]
       )
     | [(] $LIST [)]
   )
   s* $BLOCK}xms

if the match successfully reaches the shared s* $BLOCK suffix but subsequently fails to match the trailing block, then the regex engine will immediately backtrack. That backtracking will cause it to reconsider the various (nested) alternatives: first by backtracking within the previous successful alternative, and then by trying any remaining unexamined alternatives. That's potentially a lot of expensive matching, all of which is utterly useless. For a start, the syntaxes of the various options are mutually exclusive, so if one of them already matched, none of the subsequent candidates ever will.

Even if that weren't the case, the regex is backtracking only because there wasn't a valid block at the end of the loop specification. But backtracking and messing around with the other alternatives won't change that fact. Even if the regex does find another way to match the first part of the loop specification, there still won't be a valid block at the end of the string when matching reaches that point again.

This particular situation arises every time an alternation consists of mutually exclusive alternatives. The "dumb but fast" behaviour of the regex engine forces it to go back and mindlessly try every other possibility, even when—to an outside observer—that's provably a complete waste of time and the engine would do much better to just forget about backtracking into the alternation.

As before, you have to explicitly point that optimization out to Perl. In this case, that's done by enclosing the alternation in a special form of parentheses: (?>…). These are Perl's "don't-ever-backtrack-into-me" markers. They tell the regex engine that the enclosed subpattern can safely be skipped over during backtracking, because you're confident that re-matching the contents either won't succeed or, if it does succeed, won't help the overall match.

In practical terms, you just need to replace the (?:…) parentheses of any mutually exclusive set of alternatives with (?>…) parentheses. For example, like this:

m{
   with s+
   (?> each s+                              # (?> means:
       (?: $EXPR                             #     There can be only
         | $VAR  s* in s* [(] $LIST [)]    #     one way to match
       )                                     #     the enclosed set
     | [(] $LIST [)]                         #     of alternatives
   )                                         # )
   s* $BLOCK}xms;

This kind of optimization is even more important for repeated subpatterns (especially those containing alternations).

Suppose you wanted to write a regular expression to match a parenthesized list of comma-separated items. You might write:

$str =~ m{ [(]               # A literal opening paren
           $ITEM             # At least one item
           (?:               # Followed by...
               ,             #     a comma
               $ITEM         #     and another item
           )*                #  as many times as possible (but none is okay too)
           [)]               # A literal closing paren
         }xms;

That pattern works fine: it matches every parenthesized list you give it and fails to match everything else. But consider what actually happens when you give it nearly the right input. If, for example, $str contains a list of items that's missing its final closing parenthesis, then the regex engine would have to backtrack into the (?: , $ITEM)* and try to match one fewer comma-item sequence. But doing that will leave the matching position at the now-relinquished comma, which certainly won't match the required closing parenthesis. So the regex engine will backtrack again, giving back another comma-item sequence, leaving the matching position at the comma, where it will again fail to find a closing parenthesis. And so on and so on until every other possibility has failed.

There's no point in backtracking the repeated comma-item subpattern at all. Either it must succeed "all the way", or it can never succeed at all. So this is another ideal place to add a pair of non-backtracking parentheses. Like so:

m{ [(]  $ITEM  (?> (?: , $ITEM )* )  [)] }xms;

Note that the (?>…) have to go around the entire repeated grouping; don't use them simply to replace the existing parentheses:

m{ [(]  $ITEM      (?> , $ITEM )*    [)] }xms;     # A common mistake

This version is wrong because the repetition marker is still outside the (?>…) and hence it will still be allowed to backtrack (uselessly).

In summary, whenever two subpatterns X and Y are mutually exclusive in terms of the strings they match, then rewrite any instance of:

X | Y

as:

(?> X | Y )

and rewrite:

X* Y

as:

(?> X* ) Y

String Comparisons

Prefer fixed-string eq comparisons to fixed-pattern regex matches.

If you're trying to compare a string against a fixed number of fixed keywords, the temptation is to put them all inside a single regex, as anchored alternatives:

# Quit command has several variants...
last COMMAND if $cmd =~ m{A (?: q | quit | bye ) z}xms;

The usual rationale for this is that a single, highly optimized regex match must surely be quicker than three separate eq tests:

# Quit command has several variants...
last COMMAND if $cmd eq 'q'
             || $cmd eq 'quit'             || $cmd eq 'bye';

Unfortunately, that's not the case. Regex-matching against a series of fixed alternations is at least 20% slower than individually eq-matching the same strings—not to mention the fact that the eq-based version is significantly more readable.

Likewise, if you're doing a pattern match merely to get case insensitivity:

# Quit command is case-insensitive...
last COMMAND if $cmd =~ m{A quit z}ixms;

then it's more efficient, and arguably more readable, to write:

# Quit command is case-insensitive...
last COMMAND if lc($cmd) eq 'quit';  

Sometimes, if there are a large number of possibilities to test:

Readonly my @EXIT_WORDS => qw(
    q  quit  bye  exit  stop  done  last  finish  aurevoir);

or the number of possibilities is indeterminate at compile time:

Readonly my @EXIT_WORDS
    => slurp $EXIT_WORDS_FILE, {chomp=>1};

then a regex might seem like a better alternative, because it can easily be built on the fly:

Readonly my $EXIT_WORDS => join '|', @EXIT_WORDS;
# Quit command has several variants...
last COMMAND if $cmd =~ m{A (?: $EXIT_WORDS ) z}xms;

But, even in these cases, eq offers a cleaner (though now slower) solution:

use List::MoreUtils  qw( any );

# Quit command has several variants...last COMMAND if any { $cmd eq $_ } @EXIT_WORDS;

Of course, in this particular case, an even better solution would be to use table look-up instead:

Readonly my %IS_EXIT_WORD
    => map { ($_ => 1) } qw(
           q  quit  bye  exit  stop  done  last  finish  aurevoir
       );

# and later...

# Quit command has several variants...last COMMAND if $IS_EXIT_WORD{$cmd};


[62] As anyone who has seen Abigail's virtuoso "prime number identifier" must surely agree:

sub is_prime {
    my ($number) = @_;
    return (1 x $number) !~ m/A (?: 1? | (11+?) (?> 1+ ) ) /xms;
}

(Working out precisely how this regex works its wonders is left as a punishment for the reader.)

[63] Particularly as regular expressions so often fail precisely because the coder's intent is not accurately translated into their patterns.

[64] Or, worse still, unconsciously.

[65] "What part of 'the start' don't you understand???"

[66] That is, it makes them work in the unnatural way in which most programmers think they work.

[67] In Maxims for Revolutionists (1903), George Bernard Shaw observed: "The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man." That is an equally deep and powerful approach to programming.

[68] Most editors can be configured to jump to a matching brace (in vi it's %; in Emacs it's a little more complicated—see Appendix C). You can also set most editors to autohighlight matching braces as you type (set the blink-matching-paren variable in Emacs, or the showmatch option in vi).

[69] That sentence originally read: The problem is that programmers are used to ignoring the specific content of comments. Which is depressingly true, but not the relevant observation here.

[70] "Escaped Characters" in Chapter 4.

[71] Perl's Unicode support was still highly experimental in the 5.6 releases, and has improved considerably since then. If you're intending to make serious use of Unicode in production code, you really need to be running the latest 5.8.X release you can, and at very least Perl 5.8.1.

[72] The regular expression shown matches 'monkey', but the particular irregular noun it's supposed to match in that case is 'money'. The regex also matches 'primadonna' instead of 'prima donna', because the /x flag makes the intervening space non-significant within the regex.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset