Sometimes you just need to sneak a peek. There are four regex extensions that help you do just that, and we call them lookaround assertions because they let you scout around in a hypothetical sort of way, without committing to matching any characters. What these assertions assert is that some pattern would (or would not) match if we were to try it. The Engine works it all out for us by actually trying to match the hypothetical pattern, and then pretending that it didn't match (if it did).
When the Engine peeks ahead from its current position in the string, we call it a lookahead assertion. If it peeks backward, we call it a lookbehind assertion. The lookahead patterns can be any regular expression, but the lookbehind patterns may only be fixed width, since they have to know where to start the hypothetical match from.
While these four extensions are all zero-width assertions, and hence do not consume characters (at least, not officially), you can in fact capture substrings within them if you supply extra levels of capturing parentheses.
(?=
PATTERN
)
(positive lookahead)When the Engine encounters
(?=
PATTERN
)
,
it looks ahead in the string to ensure that
PATTERN
occurs. If you'll recall,
in our earlier duplicate word remover, we had to write a loop
because the pattern ate too much each time through:
$_ = "Paris in THE THE THE THE spring."; # remove duplicate words (and triplicate (and quadruplicate…)) 1 while s/(w+) 1/$1/gi;
Whenever you hear the phrase "ate too much", you should always think "lookahead assertion". (Well, almost always.) By peeking ahead instead of gobbling up the second word, you can write a one-pass duplicate word remover like this:
s/ (w+) s (?= 1 ) //gxi;
Of course, this isn't quite right, since it will mess up valid phrases like "The clothes you DON DON't fit."
(?!
PATTERN
)
(negative lookahead)When the Engine encounters
(?!
PATTERN
)
,
it looks ahead in the string to ensure that
PATTERN
does
not occur. To fix our previous example,
we can add a negative lookahead assertion after the positive
assertion to weed out the case of contractions:
s/ (w+) s (?= 1 (?! 'w))//xgi;
That final w
is necessary
to avoid confusing contractions with words at the ends of
single-quoted strings. We can take this one step further,
since earlier in this chapter we intentionally used "that that
particular", and we'd like our program to not "fix" that for
us. So we can add an alternative to the negative lookahead in
order to pre-unfix that "that", (thereby demonstrating that
any pair of parentheses can be used to cluster
alternatives):
s/ (w+) s (?= 1 (?! 'w | s particular))//gix;
Now we know that that particular phrase is safe. Unfortunately, the Gettysburg Address is still broken. So we add another exception:
s/ (w+) s (?= 1 (?! 'w | s particular | s nation))//igx;
This is just starting to get out of hand. So let's do an
Official List of Exceptions, using a cute interpolation trick
with the $
" variable to separate the
alternatives with the |
character:
@thatthat = qw(particular nation); local $" = '|'; s/ (w+) s (?= 1 (?! 'w | s (?: @thatthat )))//xig;
(?<=
PATTERN
)
(positive lookbehind)When the Engine encounters
(?<=
PATTERN
)
,
it looks backward in the string to ensure that
PATTERN
already occurred.
Our example still has a problem. Although it now lets
Honest Abe say things like "that that nation", it also allows
"Paris, in the the nation of France". We can add a positive
lookbehind assertion in front of our exception list to make
sure that we apply our @thatthat
exceptions
only to a real "that that".
s/ (w+) s (?= 1 (?! 'w | (?<= that) s (?: @thatthat )))//ixg;
Yes, it's getting terribly complicated, but that's why
this section is called "Fancy Patterns", after all. If you
need to complicate the pattern any more than we've done so
far, judicious use of comments and qr//
will help keep you sane. Or at least saner.
(?<!
PATTERN
)
(negative lookbehind)When the Engine encounters
(?<!
PATTERN
)
,
it looks backward in the string to ensure that
PATTERN
did not occur.
Let's go for a really simple example this time. How about the easy version of that old spelling rule, "I before E except after C"? In Perl, you spell it:
s/(?<!c)ei/ie/g
You'll have to weigh for yourself whether you want to handle any of the exceptions. (For example, "weird" is spelled weird, especially when you spell it "wierd".)
As described in "The Little Engine That
/Could(n't)?/", the Engine often backtracks as it proceeds through
the pattern. You can block the Engine from backtracking back through
a particular set of choices by creating a nonbacktracking
subpattern. A nonbacktracking subpattern looks like
(?>
PATTERN
)
,
and it works exactly like a simple
(?
:PATTERN
)
,
except that once PATTERN
has found a
match, it suppresses backtracking on any of the quantifiers or
alternatives inside the subpattern. (Hence, it is meaningless to use
this on a PATTERN
that doesn't contain
quantifiers or alternatives.) The only way to get it to change its
mind is to backtrack to something before the subpattern and reenter
the subpattern from the left.
It's like going into a car dealership. After a certain amount of haggling over the price, you deliver an ultimatum: "Here's my best offer; take it or leave it." If they don't take it, you don't go back to haggling again. Instead, you backtrack clear out the door. Maybe you go to another dealership, and start haggling again. You're allowed to haggle again, but only because you reentered the nonbacktracking pattern again in a different context.
For devotees of Prolog or SNOBOL, you can think of this as a scoped cut or fence operator.
Consider how in "aaab" =~ /(?:a*)ab/
, the
a*
first matches three a
's,
but then gives up one of them because the last a
is needed later. The subgroup sacrifices some of what it wants in
order for the whole match to succeed. (Which is like letting the car
salesman talk you into giving him more of your money because you're
afraid to walk away from the deal.) In contrast, the subpattern in
"aaab" =~ /(?>a*)ab/
will never give up what
it grabs, even though this behavior causes the whole match to fail.
(As the song says, you have to know when to hold 'em, when to fold
'em, and when to walk away.)
Although
(?>
PATTERN
)
is useful for changing the behavior of a pattern, it's mostly used
for speeding up the failure of certain matches that you know will
fail anyway (unless they succeed outright). The Engine can take a
spectacularly long time to fail, particular with nested quantifiers.
The following pattern will succeed almost instantly:
$_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/;
But success is not the problem. Failure is. If you remove that
final "b
" from the string, the pattern will
probably run for many, many years before failing. Many, many
millennia. Actually, billions and billions of years.[13] You can see by inspection that the pattern can't
succeed if there's no "b
" on the end of the
string, but the regex optimizer is not smart enough (as of this
writing) to figure out that /[b]/
is equivalent
to /b/
. But if you give it a hint, you can get it
to fail quickly while still letting it succeed where it can:
/(?>a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*)[b]/;
For a (hopefully) more realistic example, imagine a program that's supposed to read in a paragraph at a time and show just the lines that are continued, where contination lines are specified with trailing backslashes. Here's a sample from Perl's Makefile that uses this line-continuation convention:
# Files to be built with variable substitution before miniperl # is available. sh = Makefile.SH cflags.SH config_h.SH makeaperl.SH makedepend.SH makedir.SH myconfig.SH writemain.SH
You could write your simple program this way:
#!/usr/bin/perl -00p while ( /( (.+) ( (?<=\) .* )+ ) /gx) { print "GOT $.: $1 "; }
That works, but it's really quite slow.
That's because the Engine backtracks a character at a time from the
end of the line, shrinking what's in $1
. This is
pointless. And writing it without the extraneous captures doesn't
help much. Using:
(.+(?:(?<=\) .*)+)
for a pattern is somewhat faster, but not much. This is where a nonbacktracking subpattern helps a lot. The pattern:
((?>.+)(?:(?<=\) .*)+)
does the same thing, but more than an order of magnitude faster because it doesn't waste time backtracking in search of something that isn't there.
You'll never get a success with (?>…)
that you wouldn't get with (?:…)
or even a simple
(…)
. But if you're going to fail, it's best to
fail quickly and get on with your life.
Most Perl programs tend to follow an imperative (also called procedural) programming style, like a series of discrete commands laid out in a readily observable order: "Preheat oven, mix, glaze, heat, cool, serve to aliens." Sometimes into this mix you toss a few dollops of functional programming ("Use a little more glaze than you think you need, even after taking this into account, recursively"), or sprinkle it with bits of object-oriented techniques ("but please hold the anchovy objects"). Often it's a combination of all of these.
But the regular expression Engine takes a completely different approach to problem solving, more of a declarative approach. You describe goals in the language of regular expressions, and the Engine implements whatever logic is needed to solve your goals. Logic programming languages (such as Prolog) don't always get as much exposure as the other three styles, but they're more common than you'd think. Perl couldn't even be built without make (1) or yacc (1), both of which could be considered, if not purely declarative languages, at least hybrids that blend imperative and logic programming together.
You can do this sort of thing in Perl, too, by blending goal declarations and imperative code together more miscibly than we've done so far, drawing upon the strengths of both. You can programmatically build up the string you'll eventually present to the regex Engine, in a sense creating a program that writes a new program on the fly.
You can also supply ordinary Perl expressions as the
replacement part of s///
via the
/e
modifier. This allows you to dynamically
generate the replacement string by executing a bit of code every
time the pattern matches.
Even more elaborately, you can interject bits of code
wherever you'd like in a middle of a pattern using the
(?{
CODE
})
extension, and that code will be executed
every time the Engine encounters that code as it advances and
recedes in its intricate backtracking dance.
Finally, you can use s///ee
or
(??{
CODE
})
to add another level of indirection: the
results of executing those code snippets will
themselves be re-evaluated for further use, creating bits of program
and pattern on the fly, just in time.
It has been said[14] that programs that write programs are the happiest programs in the world. In Jeffrey Friedl's book, Mastering Regular Expressions, the final tour de force demonstrates how to write a program that produces a regular expression to determine whether a string conforms to the RFC 822 standard; that is, whether it contains a standards-compliant, valid mail header. The pattern produced is several thousand characters long, and about as easy to read as a crash dump in pure binary. But Perl's pattern matcher doesn't care about that; it just compiles up the pattern without a hitch and, even more interestingly, executes the match very quickly--much more quickly, in fact, than many short patterns with complex backtracking requirements.
That's a very complicated example. Earlier we showed you a
very simple example of the same technique when we built up a
$number
pattern out of its components (see
Section 5.9.2). But
to show you the power of this programmatic approach to producing a
pattern, let's work out a problem of medium complexity.
Suppose you wanted to pull out all the words with a certain vowel-consonant sequence; for example, "audio" and "eerie" both follow a VVCVV pattern. Although describing what counts as a consonant or a vowel is easy, you wouldn't ever want to type that in more than once. Even for our simple VVCVV case, you'd need to type in a pattern that looked something like this:
^[aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy]$
A more general-purpose program would accept a string like
"VVCVV
" and programmatically generate that
pattern for you. For even more flexibility, it could accept a word
like "audio
" as input and use that as a
template to infer "VVCVV
", and from that, the
long pattern above. It sounds complicated, but really isn't,
because we'll let the program generate the pattern for us. Here's
a simple cvmap program that does all of
that:
#!/usr/bin/perl $vowels = 'aeiouy'; $cons = 'cbdfghjklmnpqrstvwxzy'; %map = (C => $cons, V => $vowels); # init map for C and V for $class ($vowels, $cons) { # now for each type for (split //, $class) { # get each letter of that type $map{$_} .= $class; # and map the letter back to the type } } for $char (split //, shift) { # for each letter in template word $pat .= "[$map{$char}]"; # add appropriate character class } $re = qr/^${pat}$/i; # compile the pattern print "REGEX is $re "; # debugging output @ARGV = ('/usr/dict/words') # pick a default dictionary if -t && !@ARGV; while (<>) { # and now blaze through the input print if /$re/; # printing any line that matches }
The %map
variable holds all the
interesting bits. Its keys are each letter of the alphabet, and
the corresponding value is all the letters of its type. We throw
in C and V, too, so you can specify either
"VVCVV
" or "audio
", and
still get out "eerie
". Each character in the
argument supplied to the program is used to pull out the right
character class to add to the pattern. Once the pattern is created
and compiled up with qr//
, the match (even a
very long one) will run quickly. Here's what you might get if you
run this program on "fortuitously":
% cvmap fortuitously /usr/dict/words
REGEX is (?i-xsm:^[cbdfghjklmnpqrstvwxzy][aeiouy][cbdfghjklmnpqrstvwxzy][cbd
fghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][c
bdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouycbdfghjklmnpqrstvwxzy]$)
carriageable
circuitously
fortuitously
languorously
marriageable
milquetoasts
sesquiquarta
sesquiquinta
villainously
Looking at that REGEX
,
you can see just how much villainous typing you saved by
programming languorously, albeit circuitously.
When the /e
modifier ("e" is for
expression evaluation) is used on an
s/
PATTERN
/
CODE
/e
expression, the replacement portion is interpreted as a Perl
expression, not just as a double-quoted string. It's like an
embedded do {
CODE
}
. Even though it looks like a string, it's
really just a code block that gets compiled up at the same time as
rest of your program, long before the substitution actually
happens.
You can use the /e
modifier to build
replacement strings with fancier logic than double-quote
interpolation allows. This shows the difference:
s/(d+)/$1 * 2/; # Replaces "42" with "42 * 2" s/(d+)/$1 * 2/e; # Replaces "42" with "84"
And this converts Celsius temperatures into Fahrenheit:
$_ = "Preheat oven to 233C. "; s/(d+.?d*)C/int($1 * 1.8 + 32) . "F"/e; # convert to 451F
Applications of this technique are limitless. Here's a filter that modifies its files in place (like an editor) by adding 100 to every number that starts a line (and that is followed by a colon, which we only peek at, but don't actually match, or replace):
% perl -pi -e 's/^(d+)(?=:)/100 + $1/e' filename
Now and then, you want to do more than
just use the string you matched in another computation. Sometimes
you want that string to be a computation,
whose own evaluation you'll use for the replacement value. Each
additional /e
modifier after the first wraps an
eval
around the code to execute. The following
two lines do the same thing, but the first one is easier to
read:
s/PATTERN
/CODE
/ee s/PATTERN
/eval(CODE
)/e
You could use this technique to replace mentions of simple scalar variables with their values:
s/($w+)/$1/eeg; # Interpolate most scalars' values
Because it's really an
eval
, the /ee
even finds
lexical variables. A slightly more elaborate example calculates a
replacement for simple arithmetical expressions on (nonnegative)
integers:
$_ = "I have 4 + 19 dollars and 8/2 cents. "; s{ ( d+ s* # find an integer [+*/-] # and an arithmetical operator s* d+ # and another integer ) }{ $1 }eegx; # then expand $1 and run that code print; # "I have 23 dollars and 4 cents."
Like any other eval
STRING
, compile-time errors (like
syntax problems) and run-time exceptions (like dividing by zero)
are trapped. If so, the $@
($EVAL_ERROR
) variable says what went
wrong.
In most programs that use regular expressions, the
surrounding program's run-time control structure drives the
logical execution flow. You write if
or
while
loops, or make function or method calls,
that wind up calling a pattern-matching operation now and then.
Even with s///e
, it's the substitution operator
that is in control, executing the replacement code only after a
successful match.
With code subpatterns, the normal
relationship between regular expression and program code is
inverted. As the Engine is applying its Rules to your pattern at
match time, it may come across a regex extension of the form
(?{
CODE
})
. When triggered, this subpattern doesn't do
any matching or any looking about. It's a zero-width assertion
that always "succeeds", evaluated only for its side effects.
Whenever the Engine needs to progress over the code subpattern as
it executes the pattern, it runs that code.
"glyph" =~ /.+ (?{ print "hi" }) ./x; # Prints "hi" twice.
As the Engine tries to match glyph
against this pattern, it first lets the .+
eat
up all five letters. Then it prints "hi
". When
it finds that final dot, all five letters have been eaten, so it
needs to backtrack back to the .+
and make it
give up one of the letters. Then it moves forward through the
pattern again, stopping to print "hi
" again
before assigning h
to the final dot and
completing the match successfully.
The braces around the CODE
fragment are intended to remind you that it is a block of Perl
code, and it certainly behaves like a block in the lexical sense.
That is, if you use my
to declare a lexically
scoped variable in it, it is private to the block. But if you use
local
to localize a dynamically scoped
variable, it may not do what you expect. A
(?{
CODE
})
subpattern creates an implicit dynamic scope
that is valid throughout the rest of the pattern, until it either
succeeds or backtracks through the code subpattern. One way to
think of it is that the block doesn't actually return when it gets
to the end. Instead, it makes an invisible recursive call to the
Engine to try to match the rest of the pattern. Only when that
recursive call is finished does it return from the block,
delocalizing the localized variables.[15]
In the next example, we initialize $i
to
0
by including a code subpattern at the
beginning of the pattern. Then we match any number of characters
with .*
--but we place another code subpattern
in between the . and the *
so we can count how
many times . matches.
$_ = 'lothlorien'; m/ (?{ $i = 0 }) # Set $i to 0 (. (?{ $i++ }) )* # Update $i, even after backtracking lori # Forces a backtrack /x;
The Engine merrily goes along, setting $i
to 0
and letting the .*
gobble up all 10 characters in the string. When it encounters the
literal lori
in the pattern, it backtracks and
gives up those four characters from the .*
.
After the match, $i
will still be
10
.
If you wanted $i
to reflect how many
characters the .*
actually ended up with, you
could make use of the dynamic scope within the pattern:
$_ = 'lothlorien'; m/ (?{ $i = 0 }) (. (?{ local $i = $i + 1; }) )* # Update $i, backtracking-safe. lori (?{ $result = $i }) # Copy to non-localized location. /x;
Here, we use local
to ensure that
$i
contains the number of characters matched by
.*
, regardless of backtracking.
$i
will be forgotten after the regular
expression ends, so the code subpattern, (?{ $result = $i
})
, ensures that the count will live on in
$result
.
The special variable $^R
(described in
Chapter 28) holds the result
of the last (?{
CODE
})
that was executed as part of a successful
match.
You can use a (?{
CODE
})
extension as
the COND
of a
(?(
COND
)
IFTRUE
|
IFFALSE
)
.
If you do this, $^R
will not be set, and you
may omit the parentheses around the conditional:
"glyph" =~ /.+(?(?{ $foo{bar} gt "symbol" }).|signet)./;
Here, we test whether $foo{bar}
is
greater than symbol
. If so, we include . in the
pattern, and if not, we include signet
in the
pattern. Stretched out a bit, it might be construed as more
readable:
"glyph" =~ m{ .+ # some anythings (?(?{ # if $foo{bar} gt "symbol" # this is true }) . # match another anything | # else signet # match signet ) . # and one more anything }x;
When use re 'eval
' is in effect, a regex
is allowed to contain (?{
CODE
})
subpatterns
even if the regular expression interpolates variables:
/(.*?) (?{length($1) < 3 && warn}) $suffix/; # Error without use re 'eval'
This is normally disallowed since it is a potential security
risk. Even though the pattern above may be innocuous because
$suffix
is innocuous, the regex parser can't
tell which parts of the string were interpolated and which ones
weren't, so it just disallows code subpatterns entirely if there
were any interpolations.
If the pattern is obtained from tainted data, even
use re 'eval
' won't allow the pattern match to
proceed.
When use re 'taint
' is in effect and a
tainted string is the target of a regex, the captured subpatterns
(either in the numbered variables or in the list of values
returned by m//
in list context) are tainted.
This is useful when regex operations on tainted data are meant not
to extract safe substrings, but merely to perform other
transformations. See Chapter
23, for more on tainting. For the purpose of this pragma,
precompiled regular expressions (usually obtained from
qr//
) are not considered to be
interpolated:
/foo${pat}bar/
This is allowed if $pat
is a precompiled
regular expression, even if $pat
contains
(?{
CODE
})
subpatterns.
Earlier we showed you a bit of what use
re
'debug
' prints out. A
more primitive debugging solution is to use (?{
CODE
})
subpatterns
to print out what's been matched so far during the match:
"abcdef" =~ / .+ (?{print "Matched so far: $& "}) bcdef $/x;
This prints:
Matched so far: abcdef Matched so far: abcde Matched so far: abcd Matched so far: abc Matched so far: ab Matched so far: a
showing the .+
grabbing all the letters
and giving them up one by one as the Engine backtracks.
You can build parts of your pattern from within the
pattern itself. The
(??{
CODE
})
extension allows you to insert code that evaluates to a valid
pattern. It's like saying /$pattern/
, except
that you can generate $pattern
at run
time--more specifically, at match time. For instance:
/w (??{ if ($threshold > 1) { "red" } else { "blue" } }) d/x;
This is equivalent to /wredd/
if
$threshold
is greater than 1, and
/wblued/
otherwise.
You can include backreferences inside the evaluated code to derive patterns from just-matched substrings (even if they will later become unmatched through backtracking). For instance, this matches all strings that read the same backward as forward (known as palindromedaries, phrases with a hump in the middle):
/^ (.+) .? (??{quotemeta reverse $1}) $/xi;
You can balance parentheses like so:
$text =~ /( (+ ) (.*?) (??{ ')' x length $1 })/x;
This matches strings of the form
(shazam!)
and (((shazam!)))
,
sticking shazam!
into $2
.
Unfortunately, it doesn't notice whether the parentheses in the
middle are balanced. For that we need recursion.
Fortunately, you can do recursive patterns too. You
can have a compiled pattern that uses (??{
CODE
})
to refer to
itself. Recursive matching is pretty irregular, as regular
expressions go. Any text on regular expressions
will tell you that a standard regex can't match nested parentheses
correctly. And that's correct. It's also correct that Perl's
regexes aren't standard. The following pattern[16] matches a set of nested parentheses, however deep
they go:
$np = qr{ ( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $np }) # Group with matching parens )* ) }x;
You could use it like this to match a function call:
$funpat = qr/w+$np/; 'myfunfun(1,(2*(3+4)),5)' =~ /^$funpat$/; # Matches!
The
(?(
COND
)
IFTRUE
|
IFFALSE
)
regex extension is similar to Perl's ?
:
operator. If COND
is true, the
IFTRUE
pattern is used; otherwise, the
IFFALSE
pattern is used. The
COND
can be a backreference (expressed
as a bare integer, without the or
$
), a lookaround assertion, or a code
subpattern. (See Section
5.10.1 and Section
5.10.3.3 earlier in this chapter.)
If the COND
is an integer, it is
treated as a backreference. For instance, consider:
#!/usr/bin/perl $x = 'Perl is free.'; $y = 'ManagerWare costs $99.95.'; foreach ($x, $y) { /^(w+) (?:is|(costs)) (?(2)($d+)|w+)/; # Either ($d+) or w+ if ($3) { print "$1 costs money. "; # ManagerWare costs money. } else { print "$1 doesn't cost money. "; # Perl doesn't cost money. } }
Here, the COND
is
(2)
, which is true if a second backreference
exists. If that's the case, ($d+)
is included
in the pattern at that point (creating the $3
backreference); otherwise, w+
is used.
If the COND
is a lookaround or
code subpattern, the truth of the assertion is used to determine
whether to include IFTRUE
or
IFFALSE
:
/[ATGC]+(?(?<=AA)G|C)$/;
This uses a lookbehind assertion as the
COND
to match a DNA sequence that ends
in either AAG
, or some other base combination
and C
.
You can omit the
|
IFFALSE
alternative. If you do, the IFTRUE
pattern will be included in the pattern as usual if the
COND
is true, but if the condition
isn't true, the Engine will move on to the next portion of the
pattern.
You can't change how Perl's Engine works, but if you're sufficiently warped, you can change how it sees your pattern. Since Perl interprets your pattern similarly to double-quoted strings, you can use the wonder of overloaded string constants to see to it that text sequences of your choosing are automatically translated into other text sequences.
In the example below, we specify two transformations
to occur when Perl encounters a pattern. First, we define
ag
so that when it appears in a pattern, it's
automatically translated to (?:<.*?>)
,
which matches most HTML and XML tags. Second, we "redefine" the
w
metasymbol so that it handles only English
letters.
We'll define a package called Tagger
that
hides the overloading from our main program. Once we do that, we'll
be able to say:
use Tagger; $_ = '<I>camel</I>'; print "Tagged camel found" if / agw+ ag/;
Here's Tagger.pm, couched in the form of a Perl module (see Chapter 11):
package Tagger; use overload; sub import { overload::constant 'qr' => &convert } sub convert { my $re = shift; $re =~ s/ \tag /<.*?>/xg; $re =~ s/ \w /[A-Za-z]/xg; return $re; } 1;
The Tagger
module is handed
the pattern immediately before interpolation, so you can bypass the
overloading by bypassing interpolation, as follows:
$re = ' agw+ ag'; # This string begins with , a tab print if /$re/; # Matches a tab, followed by an "a"…
If you wanted the interpolated variable to be customized, call
the convert
function directly:
$re = ' agw+ ag'; # This string begins with , a tab $re = Tagger::convert $re; # expand ag and w print if /$re/; # $re becomes <.*?>[A-Za-z]+<.*?>
Now if you're still wondering what those
sub
thingies are there in the
Tagger
module, you'll find out soon enough
because that's what our next chapter is all about.
[13] Actually, it's more on the order of septillions and septillions. We don't know exactly how long it would take. We didn't care to wait around watching it not fail. In any event, your computer is likely to crash before the heat death of the universe, and this regular expression takes longer than either of those.
[14] By Andrew Hume, the famous Unix philosopher.
[15] People who are familiar with recursive descent parsers may find this behavior confusing because such compilers return from a recursive function call whenever they figure something out. The Engine doesn't do that--when it figures something out, it goes deeper into recursion (even when exiting a parenthetical group!). A recursive descent parser is at a minimum of recursion when it succeeds at the end, but the Engine is at a local maximum of recursion when it succeeds at the end of the pattern. You might find it helpful to dangle the pattern from its left end and think of it as a skinny representation of a call graph tree. If you can get that picture into your head, the dynamic scoping of local variables will make more sense. (And if you can't, you're no worse off than before.)
[16] Note that you can't declare the variable in the same statement in which you're going to use it. You can always declare it earlier, of course.