As any good manager knows, you shouldn't micromanage your employees. Just tell them what you want, and let them figure out the best way of doing it. Similarly, it's often best to think of a regular expression as a kind of specification: "Here's what I want; go find a string that fits the bill."
On the other hand, the best managers also understand the job their employees are trying to do. The same is true of pattern matching in Perl. The more thoroughly you understand of how Perl goes about the task of matching any particular pattern, the more wisely you'll be able to make use of Perl's pattern matching capabilities.
One of the most important things to understand about Perl's pattern-matching is when not to use it.
When people of a certain temperament first learn regular expressions, they're often tempted to see everything as a problem in pattern matching. And while that may even be true in the larger sense, pattern matching is about more than just evaluating regular expressions. It's partly about looking for your car keys where you dropped them, not just under the streetlamp where you can see better. In real life, we all know that it's a lot more efficient to look in the right places than the wrong ones.
Similarly, you should use Perl's control flow to decide which patterns to execute, and which ones to skip. A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees too much at once. So sometimes you have to put blinders onto it. For example, you'll recall our earlier example of alternation:
/Gandalf|Saruman|Radagast/
That works as advertised, but not as well as it might, because it searches every position in the string for every name before it moves on to the next position. Astute readers of The Lord of the Rings will recall that, of the three wizards named above, Gandalf is mentioned much more frequently than Saruman, and Saruman is mentioned much more frequently than Radagast. So it's generally more efficient to use Perl's logical operators to do the alternation:
/Gandalf/ || /Saruman/ || /Radagast/
This is yet another way of defeating the "leftmost" policy of
the Engine. It only searches for Saruman
if
Gandalf
was nowhere to be seen. And it only
searches for Radagast
if
Saruman
is also absent.
Not only does this change the order in which things are searched, but it sometimes allows the regular expression optimizer to work better. It's generally easier to optimize searching for a single string than for several strings simultaneously. Similarly, anchored searches can often be optimized if they're not too complicated.
You don't have to limit your control of the control
flow to the ||
operator. Often you can control
things at the statement level. You should always think about weeding
out the common cases first. Suppose you're writing a loop to process
a configuration file. Many configuration files are
mostly comments. It's often best to discard comments and blank lines
early before doing any heavy-duty processing, even if the heavy duty
processing would throw out the comments and blank lines in the
course of things:
while (<CONF>) { next if /^#/; next if /^s*(#|$)/; chomp; munchabunch($_); }
Even if you're not trying to be efficient, you often need to alternate ordinary Perl expressions with regular expressions simply because you want to take some action that is not possible (or very difficult) from within the regular expression, such as printing things out. Here's a useful number classifier:
warn "has nondigits" if /D/; warn "not a natural number" unless /^d+$/; # rejects -3 warn "not an integer" unless /^-?d+$/; # rejects +3 warn "not an integer" unless /^[+-]?d+$/; warn "not a decimal number" unless /^-?d+.?d*$/; # rejects .2 warn "not a decimal number" unless /^-?(?:d+(?:.d*)?|.d+)$/; warn "not a C float" unless /^([+-]?)(?=d|.d)d*(.d*)?([Ee]([+-]?d+))?$/;
We could stretch this section out a lot longer, but really, that sort of thing is what this whole book is about. You'll see many more examples of the interplay of Perl code and pattern matching as we go along. In particular, see the later section Section 5.10.3. (It's okay to read the intervening material first, of course.)
Using Perl's control flow mechanisms to control regular expression matching has its limits. The main difficulty is that it's an "all or nothing" approach; either you run the pattern, or you don't. Sometimes you know the general outlines of the pattern you want, but you'd like to have the capability of parameterizing it. Variable interpolation provides that capability, much like parameterizing a subroutine lets you have more influence over its behavior than just deciding whether to call it or not. (More about subroutines in the next chapter).
One nice use of interpolation is to provide a little abstraction, along with a little readability. With regular expressions you may certainly write things concisely:
if ($num =~ /^[-+]?d+.?d*$/) { … }
But what you mean is more apparent when you write:
$sign = '[-+]?'; $digits = 'd+'; $decimal = '.?'; $more_digits = 'd*'; $number = "$sign$digits$decimal$more_digits"; … if ($num =~ /^$number$/o) { … }
We'll cover this use of interpolation more under
"Generated patterns" later in this chapter. We'll just point out
that we used the /o
modifier to suppress
recompilation because we don't expect $number
to
change its value over the course of the program.
Another cute trick is to turn your tests inside out and use the variable string to pattern-match against a set of known strings:
chomp($answer = <STDIN>); if ("SEND" =~ /^Q$answer/i) { print "Action is send " } elsif ("STOP" =~ /^Q$answer/i) { print "Action is stop " } elsif ("ABORT" =~ /^Q$answer/i) { print "Action is abort " } elsif ("LIST" =~ /^Q$answer/i) { print "Action is list " } elsif ("EDIT" =~ /^Q$answer/i) { print "Action is edit " }
This lets your user perform the "send" action by typing any of
S
, SE
, SEN
,
or SEND
(in any mixture of upper- and lowercase).
To "stop", they'd have to type at least ST
(or
St
, or sT
, or
st
).
When you think of double-quote interpolation, you
usually think of both variable and backslash interpolation. But as
we mentioned earlier, for regular expressions there are two
passes, and the interpolation pass defers most of the backslash
interpretation to the regular expression parser (which we discuss
later). Ordinarily, you don't notice the difference, because Perl
takes pains to hide the difference. (One sequence that's obviously
different is the metasymbol, which turns
into a word boundary assertion--outside of character classes,
anyway. Inside a character class where assertions make no sense,
it reverts to being a backspace, as it is normally.)
It's actually fairly important that the regex parser handle
the backslashes. Suppose you're searching for tab characters in a
pattern with a /x
modifier:
($col1, $col2) = /(.*?) + (.*?)/x;
If Perl didn't defer the interpretation of
to the regex parser, the
would have turned into whitespace, which the
regex parser would have ignorantly ignored because of the
/x
. But Perl is not so ignoble, or
tricky.
You can trick yourself though. Suppose you abstracted out the column separator, like this:
$colsep = " +"; # (double quotes) ($col1, $col2) = /(.*?) $colsep (.*?)/x;
Now you've just blown it, because the
turns into a real tab before it gets to the regex parser, which
will think you said /(.*?)+(.*?)/
after it
discards the whitespace. Oops. To fix, avoid
/x
, or use single quotes. Or better, use
qr//
. (See the next section.)
The only double-quote escapes that are processed as
such are the six translation escapes: U
,
u
, L
,
l
, Q
, and
E
. If you ever look into the inner workings of
the Perl regular expression compiler, you'll find code for
handling escapes like
for tab,
for newline, and so on. But you won't find
code for those six translation escapes. (We only listed them in
Table 5.7 because
people expect to find them there.) If you somehow manage to sneak
any of them into the pattern without going through double-quotish
evaluation, they won't be recognized.
How could they find their way in? Well, you can
defeat interpolation by using single quotes as your pattern
delimiter. In m'…
', qr'…
',
and s'…'…
', the single quotes suppress variable
interpolation and the processing of translation escapes, just as
they would in a single-quoted string. Saying
m'ufrodo
' won't find a capitalized version of
poor frodo. However, since the "normal" backslash characters
aren't really processed on that level anyway,
m' d
' still matches a real tab followed by
any digit.
Another way to defeat interpolation is through interpolation itself. If you say:
$var = 'U'; /${var}frodo/;
poor frodo remains uncapitalized. Perl won't redo the interpolation pass for you just because you interpolated something that looks like it might want to be reinterpolated. You can't expect that to work any more than you'd expect this double interpolation to work:
$hobbit = 'Frodo'; $var = '$hobbit'; # (single quotes) /$var/; # means m'$hobbit', not m'Frodo'.
Here's another example that shows how most backslashes are interpreted by the regex parser, not by variable interpolation. Imagine you have a simple little grep-style program written in Perl:[10]
#!/usr/bin/perl $pattern = shift; while (<>) { print if /$pattern/o; }
If you name that program pgrep and call it this way:
% pgrep ' d' *.c
then you'll find that it prints out all lines of all your C
source files in which a digit follows a tab. You didn't have to do
anything special to get Perl to realize that
was a tab. If Perl's patterns were just
double-quote interpolated, you would have; fortunately, they
aren't. They're recognized directly by the regex parser.
The real grep program has a
-i
switch that turns off case-sensitive
matching. You don't have to add such a switch to your
pgrep program; it can already handle that
without modification. You just pass it a slightly fancier pattern,
with an embedded /i
modifier:
% pgrep '(?i)ring' LotR*.pod
That now searches for any of "Ring", "ring", "RING", and so
on. You don't see this feature too much in literal patterns, since
you can always just write /ring/i
. But for
patterns passed in on the command line, in web search forms, or
embedded in configuration files, it can be a lifesaver. (Speaking
of rings.)
Variables that interpolate into patterns
necessarily do so at run time, not compile time. This slows down
execution because Perl has to check whether you've changed the
contents of the variable; if so, it would have to recompile the
regular expression. As mentioned in "Pattern-Matching Operators",
if you promise never to change the pattern, you can use the
/o
option to interpolate and compile only
once:
print if /$pattern/o;
Although that works fine in our pgrep program, in the general case, it doesn't. Imagine you have a slew of patterns, and you want to match each of them in a loop, perhaps like this:
foreach $item (@data) { foreach $pat (@patterns) { if ($item =~ /$pat/) { … } } }
You couldn't write /$pat/o
because the
meaning of $pat
varies each time through the
inner loop.
The solution to this is the
qr/
PATTERN
/imosx
operator. This operator quotes--and compiles--its
PATTERN
as a regular expression.
PATTERN
is interpolated the same way as
in
m/
PATTERN
/
.
If ' is used as the delimiter, no interpolation of variables (or
the six translation escapes) is done. The operator returns a Perl
value that may be used instead of the equivalent literal in a
corresponding pattern match or substitute. For example:
$regex = qr/my.STRING/is; s/$regex/something else/;
is equivalent to:
s/my.STRING/something else/is;
So for our nested loop problem above, preprocess your pattern first using a separate loop:
@regexes = (); foreach $pat (@patterns) { push @regexes, qr/$pat/; }
Or all at once using Perl's map
operator:
@regexes = map { qr/$_/ } @patterns;
And then change the loop to use those precompiled regexes:
foreach $item (@data) { foreach $re (@regexes) { if ($item =~ /$re/) { … } } }
Now when you run the match, Perl doesn't have to create a
compiled regular expression on each if
test,
because it sees that it already has one.
The result of a qr//
may even be
interpolated into a larger match, as though it were a simple
string:
$regex = qr/$pattern/; $string =~ /foo${regex}bar/; # interpolate into larger patterns
This time, Perl does recompile the pattern, but you could
always chain several qr//
operators together
into one.
The reason this works is because the qr//
operator returns a special kind of object that has a
stringification overload as described in Chapter 13. If you print out the
return value, you'll see the equivalent string:
$re = qr/my.STRING/is; print $re; # prints (?si-xm:my.STRING)
The /s
and /i
modifiers were enabled in the pattern because they were supplied
to qr//
. The /x
and
/m
, however, are disabled because they were
not.
Any time you interpolate strings of unknown provenance into a pattern, you should be prepared to handle any exceptions thrown by the regex compiler, in case someone fed you a string containing untamable beasties:
$re = qr/$pat/is; # might escape and eat you $re = eval { qr/$pat/is } || warn … # caught it in an outer cage
For more on the eval
operator, see Chapter 29.
After the variable interpolation pass has had its way with the string, the regex parser finally gets a shot at trying to understand your regular expression. There's not actually a great deal that can go wrong at this point, apart from messing up the parentheses, or using a sequence of metacharacters that doesn't mean anything. The parser does a recursive-descent analysis of your regular expression and, if it parses, turns it into a form suitable for interpretation by the Engine (see the next section). Most of the interesting stuff that goes on in the parser involves optimizing your regular expression to run as fast as possible. We're not going to explain that part. It's a trade secret. (Rumors that looking at the regular expression code will drive you insane are greatly exaggerated. We hope.)
But you might like to know what the parser actually thought of
your regular expression, and if you ask it politely, it will tell
you. By saying use re "debug
", you can examine
how the regex parser processes your pattern. (You can also see the
same information by using the -Dr
command-line switch, which is available to you if your Perl was
compiled with the -DDEBUGGING
flag during
installation.)
#!/usr/bin/perl use re "debug"; "Smeagol" =~ /^Sm(.*)g[aeiou]l$/;
The output is below. You can see that prior to execution Perl
compiles the regex and assigns meaning to the components of the
pattern: BOL
for the beginning of line
(^
), REG_ANY
for the dot, and
so on:
Compiling REx `^Sm(.*)g[aeiou]l$' size 24 first at 2 rarest char l at 0 rarest char S at 0 1: BOL(2) 2: EXACT <Sm>(4) 4: OPEN1(6) 6: STAR(8) 7: REG_ANY(0) 8: CLOSE1(10) 10: EXACT <g>(12) 12: ANYOF[aeiou](21) 21: EXACT <l>(23) 23: EOL(24) 24: END(0) anchored `Sm' at 0 floating `l'$ at 4..2147483647 (checking anchored) anchored(BOL) minlen 5 Omitting $` $& $' support.
Some of the lines summarize the conclusions of the regex
optimizer. It knows that the string must start with
"Sm
", and that therefore there's no reason to do
the ordinary left-to-right scan. It knows that the string must end
with an "l
", so it can reject out of hand any
string that doesn't. It knows that the string must be at least five
characters long, so it can ignore any string shorter than that right
off the bat. It also knows what the rarest character in each
constant string is, which can help in searching "studied" strings.
(See study
in Chapter 29.)
It then goes on to trace how it executes the pattern:
EXECUTING… Guessing start of match, REx `^Sm(.*)g[aeiou]l$' against `Smeagol'… Guessed: match at offset 0 Matching REx `^Sm(.*)g[aeiou]l$' against `Smeagol' Setting an EVAL scope, savestack=3 0 <> <Smeagol> | 1: BOL 0 <> <Smeagol> | 2: EXACT <Sm> 2 <Sm> <eagol> | 4: OPEN1 2 <Sm> <eagol> | 6: STAR REG_ANY can match 5 times out of 32767… Setting an EVAL scope, savestack=3 7 <Smeagol> <> | 8: CLOSE1 7 <Smeagol> <> | 10: EXACT <g> failed… 6 <Smeago> <l> | 8: CLOSE1 6 <Smeago> <l> | 10: EXACT <g> failed… 5 <Smeag> <ol> | 8: CLOSE1 5 <Smeag> <ol> | 10: EXACT <g> failed… 4 <Smea> <gol> | 8: CLOSE1 4 <Smea> <gol> | 10: EXACT <g> 5 <Smeag> <ol> | 12: ANYOF[aeiou] 6 <Smeago> <l> | 21: EXACT <l> 7 <Smeagol> <> | 23: EOL 7 <Smeagol> <> | 24: END Match successful! Freeing REx: `^Sm(.*)g[aeiou]l$'
If you follow the stream of whitespace down
the middle of Smeagol
, you can actually see how
the Engine overshoots to let the .*
be as greedy
as possible, then backtracks on that until it finds a way for the
rest of the pattern to match. But that's what the next section is
about.
And now we'd like to tell you the story of the Little Regex Engine that says, "I think I can. I think I can. I think I can."
In this section, we lay out the rules used by Perl's regular expression engine to match your pattern against a string. The Engine is extremely persistent and hardworking. It's quite capable of working even after you think it should quit. The Engine doesn't give up until it's certain there's no way to match the pattern against the string. The Rules below explain how the Engine "thinks it can" for as long as possible, until it knows it can or can't. The problem for our Engine is that its task is not merely to pull a train over a hill. It has to search a (potentially) very complicated space of possibilities, keeping track of where it has been and where it hasn't.
The Engine uses a nondeterministic finite-state automaton (NFA, not to be confused with NFL, a nondeterministic football league) to find a match. That just means that it keeps track of what it has tried and what it hasn't, and when something doesn't pan out, it backs up and tries something else. This is known as backtracking. (Er, sorry, we didn't invent these terms. Really.) The Engine is capable of trying a million subpatterns at one spot, then giving up on all those, backing up to within one choice of the beginning, and trying the million subpatterns again at a different spot. The Engine is not terribly intelligent; just persistent, and thorough. If you're cagey, you can give the Engine an efficient pattern that doesn't let it do a lot of silly backtracking.
When someone trots out a phrase like "Regexes choose the leftmost, longest match", that means that Perl generally prefers the leftmost match over longest match. But the Engine doesn't realize it's "preferring" anything, and it's not really thinking at all, just gutting it out. The overall preferences are an emergent behavior resulting from many individual and unrelated choices. Here are those choices:[11]
The Engine tries to match as far left in the string as it can, such that the entire regular expression matches under Rule 2.
The Engine starts just before the first character and tries to match the entire pattern starting there. The entire pattern matches if and only if the Engine reaches the end of the pattern before it runs off the end of the string. If it matches, it quits immediately--it doesn't keep looking for a "better" match, even though the pattern might match in many different ways.
If it is unable to match the pattern at the first position in the string, it admits temporary defeat and moves to the next position in the string, between the first and second characters, and tries all the possibilities again. If it succeeds, it stops. If it fails, it continues on down the string. The pattern match as a whole doesn't fail until it has tried to match the entire regular expression at every position in the string, including after the last character.
A string of n characters actually
provides n + 1
positions to match at. That's because the beginnings and the
ends of matches are between the
characters of the string. This rule sometimes surprises people
when they write a pattern like /x*/
that
can match zero or more "x
" characters. If
you try that pattern on a string like
"fox
", it won't find the
"x
". Instead, it will immediately match the
null string before the "f
" and never look
further. If you want it to match one or more
x
characters, you need to use
/x+/
instead. See the quantifiers under
Rule 5.
A corollary to this rule is that any pattern matching the null string is guaranteed to match at the leftmost position in the string (in the absence of any zero-width assertions to the contrary).
When the Engine encounters a set of alternatives
(separated by |
symbols), either at the top
level or at the current "cluster" level, it tries them
left-to-right, stopping on the first successful match that
allows successful completion of the entire pattern.
A set of alternatives matches a string if any of the alternatives match under Rule 3. If none of the alternatives matches, it backtracks to the Rule that invoked this Rule, which is usually Rule 1, but could be Rule 4 or 6, if we're within a cluster. That rule will then look for a new position at which to apply Rule 2.
If there's only one alternative, then either it matches or it doesn't, and Rule 2 still applies. (There's no such thing as zero alternatives, because a null string always matches.)
Any particular alternative matches if every item listed in the alternative matches sequentially according to Rules 4 and 5 (such that the entire regular expression can be satisfied).
An item consists of either an assertion, which is covered in Rule 4, or a quantified atom, covered by Rule 5. Items that have choices on how to match are given a "pecking order" from left to right. If the items cannot be matched in order, the Engine backtracks to the next alternative under Rule 2.
Items that must be matched sequentially aren't
separated in the regular expression by anything
syntactic--they're merely juxtaposed in the order they must
match. When you ask to match /^foo/
, you're
actually asking for four items to be matched one after the
other. The first is a zero-width assertion, matched under
Rule 4, and the other three are ordinary characters that
must match themselves, one after the other, under
Rule 5.
The left-to-right pecking order means that in a pattern like:
/x*y*/
x*
gets to pick one way to match, and
then y*
tries all its ways. If that fails,
then x*
gets to pick its second choice, and
make y*
try all of its ways again. And so
on. The items to the right "vary faster", to borrow a phrase
from multi-dimensional arrays.
If an assertion does not match at the current position, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices.
Some assertions are fancier than others. Perl
supports many regex extensions, some of which are zero-width
assertions. For example, the positive lookahead
(?=…)
and the negative lookahead
(?!…)
don't actually match any characters,
but merely assert that the regular expression represented by
…
would (or would not) match at this point,
were we to attempt it, hypothetically speaking.[12]
A quantified atom matches only if the atom itself matches some number of times that is allowed by the quantifier. (The atom itself is matched according to Rule 6.)
Different quantifiers require different numbers of
matches, and most of them allow a range of numbers of matches.
Multiple matches must all match in a row; that is, they must
be adjacent within the string. An unquantified atom is assumed
to have a quantifier requiring exactly one match (that is,
/x/
is the same as
/x{1}/
). If no match can be found at the
current position for any allowed quantity of the atom in
question, the Engine backtracks to Rule 3 and retries
higher-pecking-order items with different choices.
The quantifiers are *
,
+
, ?
,
*?
, +?
,
??
, and the various brace forms. If you use
the
{
COUNT
}
form, then there is no choice, and the atom must match exactly
that number of times or not at all. Otherwise, the atom can
match over a range of quantities, and the Engine keeps track
of all the choices so that it can backtrack if necessary. But
then the question arises as to which of these choices to try
first. One could start with the maximal number of matches and
work down, or the minimal number of matches and work
up.
The traditional quantifiers (without a trailing question mark) specify greedy matching; that is, they attempt to match as many characters as possible. To find the greediest match, the Engine has to be a little bit careful. Bad guesses are potentially rather expensive, so the Engine doesn't actually count down from the maximum value, which after all could be Very Large and cause millions of bad guesses. What the Engine actually does is a little bit smarter: it first counts up to find out how many matching atoms (in a row) are really there in the string, and then it uses that actual maximum as its first choice. (It also remembers all the shorter choices in case the longest one doesn't pan out.) It then (at long last) tries to match the rest of the pattern, assuming the longest choice to be the best. If the longest choice fails to produce a match for the rest of the pattern, it backtracks and tries the next longest.
If you say /.*foo/
, for example, it
will try to match the maximal number of "any" characters
(represented by the dot) clear out to the end of the line
before it ever tries looking for "foo
"; and
then when the "foo
" doesn't match there
(and it can't, because there's not enough room for it at the
end of the string), the Engine will back off one character at
a time until it finds a "foo
". If there is
more than one "foo
" in the line, it'll stop
on the last one, since that will really be the
first one it encounters as it backtracks.
When the entire pattern succeeds using some particular length
of .*
, the Engine knows it can throw away
all the other shorter choices for .*
(the
ones it would have used had the current
"foo
" not panned out).
By placing a question mark after any greedy quantifier,
you turn it into a frugal quantifier that chooses the smallest
quantity for the first try. So if you say
/.*?foo/
, the .*?
first
tries to match 0 characters, then 1 character, then 2, and so
on until it can match the "foo
". Instead of
backtracking backward, it backtracks forward, so to speak, and
ends up finding the first "foo
" on the line
instead of the last.
Each atom matches according to the designated semantics of its type. If the atom doesn't match (or does match, but doesn't allow a match of the rest of the pattern), the Engine backtracks to Rule 5 and tries the next choice for the atom's quantity.
Atoms match according to the following types:
A regular expression in parentheses,
(…)
, matches whatever the regular
expression (represented by …
) matches
according to Rule 2. Parentheses therefore serve as a
clustering operator for quantification. Bare parentheses
also have the side effect of capturing the matched
substring for later use in a
backreference. This side effect can
be suppressed by using (?:…)
instead,
which has only the clustering semantics--it doesn't store
anything in $1
, $2
,
and so on. Other forms of parenthetical atoms (and
assertions) are possible--see the rest of this
chapter.
A dot matches any character, except maybe newline.
A list of characters in square brackets (a character class)matches any one of the characters specified by the list.
A backslashed letter matches either a particular character or a character from a set of characters, as listed in Table 5.7.
Any other backslashed character matches that character.
That all sounds rather complicated, but the upshot of it is that, for each set of choices given by a quantifier or alternation, the Engine has a knob it can twiddle. It will twiddle those knobs until the entire pattern matches. The Rules just say in which order the Engine is allowed to twiddle those knobs. Saying the Engine prefers the leftmost match merely means it twiddles the start position knob the slowest. And backtracking is just the process of untwiddling the knob you just twiddled in order to try twiddling a knob higher in the pecking order, that is, one that varies slower.
Here's a more concrete example, a program that detects when two consecutive words share a common ending and beginning:
$a = 'nobody'; $b = 'bodysnatcher'; if ("$a $b" =~ /^(w+)(w+) 2(w+)$/) { print "$2 overlaps in $1-$2-$3 "; }
This prints:
body overlaps in no-body-snatcher
You might think that $1
would first grab up
all of "nobody
" due to greediness. And in fact,
it does--at first. But once it's done so, there aren't any further
characters to put in $2
, which needs characters
put into it because of the +
quantifier. So the
Engine backs up and $1
begrudgingly gives up one
character to $2
. This time the space character
matches successfully, but then it sees 2
, which
represents a measly "y
". The next character in
the string is not a "y
", but a
"b
". This makes the Engine back up all the way
and try several more times, eventually forcing $1
to surrender the body to $2
. Habeas corpus, as it
were.
Actually, that won't quite work out if the overlap is itself
the product of a doubling, as in the two words
"rococo
" and "cocoon
". The
algorithm above would have decided that the overlapping string,
$2
, must be just "co
" rather
than "coco
". But we don't want a
"rocococoon
"; we want a
"rococoon
". Here's one of those places you can
outsmart the Engine. Adding a minimal matching quantifier to the
$1
part gives the much better pattern
/^(w+?)(w+) 2(w+)$/
, which does exactly what
we want.
For a much more detailed discussion of the pros and cons of various kinds of regular expression engines, see Jeffrey Friedl's book, Mastering Regular Expressions. Perl's regular expression Engine works very well for many of the everyday problems you want to solve with Perl, and it even works okay for those not-so-everyday problems, if you give it a little respect and understanding.
[10] If you didn't know what a grep program was before, you will now. No system should be without grep--we believe grep is the most useful small program ever invented. (It logically follows that we don't believe Perl is a small program.)
[11] Some of these choices may be skipped if the regex optimizer has any say, which is equivalent to the Little Engine simply jumping through the hill via quantum tunneling. But for this discussion we're pretending the optimizer doesn't exist.
[12] In actual fact, the Engine does attempt it. The Engine goes back to Rule 2 to test the subpattern, and then wipes out any record of how much string was eaten, returning only the success or failure of the subpattern as the value of the assertion. (It does, however, remember any captured substrings.)