Some regex constructs represent positions in the string to be matched, which is a location just to the left or right of a real character. These metasymbols are examples of zero-width assertions because they do not correspond to actual characters in the string. We often just call them "assertions". (They're also known as "anchors" because they tie some part of the pattern to a particular position.)
You can always manipulate positions in a string without
using patterns. The built-in substr
function lets
you extract and assign to substrings, measured from the beginning of
the string, the end of the string, or from a particular numeric
offset. This might be all you need if you were working with
fixed-length records, for instance. Patterns are only necessary when a
numeric offset isn't sufficient. But most of the time, offsets aren't
sufficient--at least, not sufficiently convenient, compared to
patterns.
The A
assertion matches only at
the beginning of the string, no matter what. However, the
^
assertion is the traditional beginning-of-line
assertion as well as a beginning-of-string assertion. Therefore, if
the pattern uses the /m
modifier[8] and the string has embedded newlines,
^
also matches anywhere inside the string
immediately following a newline character:
/Abar/ # Matches "bar" and "barstool" /^bar/ # Matches "bar" and "barstool" /^bar/m # Matches "bar" and "barstool" and "sand bar"
Used in conjunction with /g
, the
/m
modifier lets ^
match many
times in the same string:
s/^s+//gm; # Trim leading whitespace on each line $total++ while /^./mg; # Count nonblank lines
The z
metasymbol matches at the end of the
string, no matter what's inside. matches right
before the newline at the end of the string if there is a newline,
or at the end if there isn't. The
$
metacharacter
usually means the same as . However, if the
/m
modifier was specified and the string has
embedded newlines, then $
can also match anywhere
inside the string right in front of a newline:
/botz/ # Matches "robot" /bot/ # Matches "robot" and "abbot " /bot$/ # Matches "robot" and "abbot " /bot$/m # Matches "robot" and "abbot " and "robot rules" /^robot$/ # Matches "robot" and "robot " /^robot$/m # Matches "robot" and "robot " and "this robot " /Arobot/ # Matches "robot" and "robot " /Arobotz/ # Matches only "robot" -- but why didn't you use eq?
As with ^
, the /m
modifier lets $
match many times in the same
string when used with /g
. (These examples assume
that you've read a multiline record into $_
,
perhaps by setting $/
to "" before
reading.)
s/s*$//gm; # Trim trailing whitespace on each line in paragraph while (/^([^:]+):s*(.*)/gm ) { # get mail header $headers{$1} = $2; }
In "Variable Interpolation" later in this chapter, we'll
discuss how you can interpolate variables into patterns: if
$foo
is "bc
", then
/a$foo/
is equivalent to
/abc/
. Here, the $
does not
match the end of the string. For a $
to match the
end of the string, it must be at the end of the pattern or
immediately be followed by a vertical bar or closing
parenthesis.
The assertion matches at any
word boundary, defined as the position between a
w
character and a W
character, in either order. If the order is Ww
,
it's a beginning-of-word boundary, and if the order is
wW
, it's an end-of-word boundary. (The ends of
the string count as W
characters here.) The
B
assertion matches any position that is
not a word boundary, that is, the middle of
either ww
or WW
.
/is/ # matches "what it is" and "that is it" /BisB/ # matches "thistle" and "artist" /isB/ # matches "istanbul" and "so--isn't that butter?" /Bis/ # matches "confutatis" and "metropolis near you"
Because W
includes all punctuation
characters (except the underscore), there are
boundaries in the middle of strings like "isn't",
"[email protected]", "M.I.T.", and "key/value".
Inside a character class ([]
), a
represents a backspace rather than a word
boundary.
When used with the /g
modifier,
the pos
function allows you to read or set the
offset where the next progressive match will start:
$burglar = "Bilbo Baggins"; while ($burglar =~ /b/gi) { printf "Found a B at %d ", pos($burglar)-1; }
(We subtract one from the position because that was the length
of the string we were looking for, and pos
is
always the position just past the match.)
The code above prints:
Found a B at 0 Found a B at 3 Found a B at 6
After a failure, the match position normally resets
back to the start. If you also apply the /c
(for
"continue") modifier, then when the /g
runs out,
the failed match doesn't reset the position pointer. This lets you
continue your search past that point without starting over at the
very beginning.
$burglar = "Bilbo Baggins"; while ($burglar =~ /b/gci) { # ADD /c printf "Found a B at %d ", pos($burglar)-1; } while ($burglar =~ /i/gi) { printf "Found an I at %d ", pos($burglar)-1; }
Besides the three B
's it found earlier,
Perl now reports finding an i
at position 10.
Without the /c
, the second loop's match would
have restarted from the beginning and found another
i
at position 1 first.
Whenever you start thinking in terms of the
pos
function, it's tempting to start carving your
string up with substr
, but this is rarely the
right thing to do. More often, if you started with pattern matching,
you should continue with pattern matching. However, if you're
looking for a positional assertion, you're probably looking for
G
.
The G
assertion represents within the
pattern the same point that pos
represents
outside of it. When you're progressively matching a string with the
/g
modifier (or you've used the
pos
function to directly select the starting
point), you can use G
to specify the position
just after the previous match. That is, it matches the location
immediately before whatever character would be identified by
pos
. This allows you to remember where you left
off:
($recipe = <<'DISH') =~ s/^s+//gm; Preheat oven to 451 deg. fahrenheit. Mix 1 ml. dilithium with 3 oz. NaCl and stir in 4 anchovies. Glaze with 1 g. mercury. Heat for 4 hours and let cool for 3 seconds. Serves 10 aliens. DISH $recipe =~ /d+ /g; $recipe =~ /G(w+)/; # $1 is now "deg" $recipe =~ /d+ /g; $recipe =~ /G(w+)/; # $1 is now "ml" $recipe =~ /d+ /g; $recipe =~ /G(w+)/; # $1 is now "oz"
The G
metasymbol is often used in
a loop, as we demonstrate in our next example. We "pause" after
every digit sequence, and at that position, we test whether there's
an abbreviation. If so, we grab the next two words. Otherwise, we
just grab the next word:
pos($recipe) = 0; # Just to be safe, reset G to 0 while ( $recipe =~ /(d+) /g ) { my $amount = $1; if ($recipe =~ / G (w{0,3}) . s+ (w+) /x) { # abbrev. + word print "$amount $1 of $2 "; } else { $recipe =~ / G (w+) /x; # just a word print "$amount $1 "; } }
That produces:
451 deg of fahrenheit 1 ml of dilithium 3 oz of NaCl 4 anchovies 1 g of mercury 4 hours 3 seconds 10 aliens
[8] Or you've set the deprecated $*
variable to 1
and you're not overriding
$*
with the /s
modifier.