Unless you say otherwise, each item in a regular
expression matches just once. With a pattern like
/nop/
, each of those characters must match, each
right after the other. Words like "panoply" or "xenophobia" are fine,
because where the match occurs doesn't
matter.
If you wanted to match both "xenophobia" and "Snoopy", you
couldn't use the /nop/
pattern, since that requires
just one "o" between the "n" and the "p", and Snoopy has two. This is
where quantifiers come in handy: they say how
many times something may match, instead of the default of matching
just once. Quantifiers in a regular expression are like loops in a
program; in fact, if you think of a regex as a program, then they
are loops. Some loops are exact, like "repeat
this match five times only" ({5}
). Others give both
lower and upper bounds on the match count, like "repeat this match at
least twice but no more than four times" ({2,4}
).
Others have no closed upper bound at all, like "match this at least
twice, but as many times as you'd like"
({2,}
).
Table 5.12 shows the quantifiers that Perl recognizes in a pattern.
Something with a *
or a ?
doesn't actually have to match. That's because they can match 0 times
and still be considered a success. A +
may often be
a better fit, since it has to be there at least once.
Don't be confused by the use of "exactly" in the previous table.
It refers only to the repeat count, not the overall string. For
example, $n =~ /d{3}/
doesn't say "is this string
exactly three digits long?" It asks whether there's any point within
$n
at which three digits occur in a row. Strings
like "101 Morris Street" test true, but so do strings like "95472" or
"1-800-555-1212". All contain three digits at one
or more points, which is all you asked about. See Section 5.6 for how to use
positional assertions (as in /^d{3}$/
) to nail
this down.
Given the opportunity to match something a variable number of times, maximal quantifiers will elect to maximize the repeat count. So when we say "as many times as you'd like", the greedy quantifier interprets this to mean "as many times as you can possibly get away with", constrained only by the requirement that this not cause specifications later in the match to fail. If a pattern contains two open-ended quantifiers, then obviously both cannot consume the entire string: characters used by one part of the match are no longer available to a later part. Each quantifier is greedy at the expense of those that follow it, reading the pattern left to right.
That's the traditional behavior of quantifiers in
regular expressions. However, Perl permits you to reform the behavior
of its quantifiers: by placing a ?
after that
quantifier, you change it from maximal to minimal. That doesn't mean
that a minimal quantifier will always match the smallest number of
repetitions allowed by its range, any more than a maximal quantifier
must always match the greatest number allowed in its range. The
overall match must still succeed, and the minimal match will take as
much as it needs to succeed, and no more. (Minimal quantifiers value
contentment over greed.)
For example, in the match:
"exasperate" =~ /e(.*)e/ # $1 now "xasperat"
the .*
matches "xasperat
",
the longest possible string for it to match. (It also stores that
value in $1
, as described in Section 5.7 later in the
chapter.) Although a shorter match was available, a greedy match
doesn't care. Given two choices at the same starting point, it always
returns the longer of the two.
Contrast this with this:
"exasperate" =~ /e(.*?)e/ # $1 now "xasp"
Here, the minimal matching version, .*?
, is
used. Adding the ?
to *
makes
*?
take on the opposite behavior: now given two
choices at the same starting point, it always returns the
shorter of the two.
Although you could read *?
as saying to match
zero or more of something but preferring zero, that doesn't mean it
will always match zero characters. If it did so here, for example, and
left $1
set to "", then the second
"e
" wouldn't be found, since it doesn't immediately
follow the first one.
You might also wonder why, in minimally matching
/e(.*?)e/
, Perl didn't stick
"rat
" into $1
. After all,
"rat
" also falls between two
e
's, and is shorter than "xasp
".
In Perl, the minimal/maximal choice applies only when selecting the
shortest or longest from among several matches that all have the same
starting point. If two possible matches exist, but these start at
different offsets in the string, then their lengths don't matter--nor
does it matter whether you've used a minimal quantifier or a maximal
one. The earliest of several valid matches always wins out over all
latecomers. It's only when multiple possible matches start at the same
point that you use minimal or maximal matching to break the tie. If
the starting points differ, there's no tie to break. Perl's matching
is normally leftmost longest; with minimal
matching, it becomes leftmost shortest. But the
"leftmost" part never varies and is the dominant criterion.[7]
There are two ways to defeat the leftward leanings of
the pattern matcher. First, you can use an earlier greedy quantifier
(typically .*
) to try to slurp earlier parts of the
string. In searching for a match for a greedy quantifier, it tries for
the longest match first, which effectively searches the rest of the
string right-to-left:
"exasperate" =~ /.*e(.*?)e/ # $1 now "rat"
But be careful with that, since the overall match now includes the entire string up to that point.
The second way to defeat leftmostness to use positional assertions, discussed in the next section.
[7] Not all regex engines work this way. Some believe in overall greed, in which the longest match always wins, even if it shows up later. Perl isn't that way. You might say that eagerness holds priority over greed (or thrift). For a more formal discussion of this principle and many others, see Section 5.9.4.