Now that we’ve seen what goes inside a regular expression, let’s take what we’ve learned back into Perl.
We’ve been writing patterns in pairs of forward slashes, like
/fred/
. But this is actually a shortcut for the
m//
(pattern match) operator. As we saw
with the qw//
operator, you may choose any pair of
delimiters to quote the contents. So, we could write that same
expression as m(fred)
,
m<fred>
, m{fred}
, or
m[fred]
using those paired delimiters, or as
m,fred,
, m!fred!
,
m^fred^
, or many other ways using nonpaired
delimiters.[1]
The shortcut is that if you choose the forward slash as the
delimiter, you may omit the initial m
. Since Perl
folks love to avoid typing extra characters, you’ll see most
pattern matches written using
slashes, as in
/fred/
.
Of course, you should wisely choose a delimiter that doesn’t
appear in your pattern.[2] If you wanted to make a pattern to match the beginning of
an ordinary web URL, you might start to write
/^http:///
to match the initial
"http://"
. But that’s easier to read, write,
maintain, and debug if you use a better choice of delimiter:
m%^http://%
.[3]
It’s common to use curly braces as the delimiter. If you use a programmers’ text editor, it probably has the ability to jump from an opening curly brace to the corresponding closing one, which can be handy in maintaining code.
There are several option modifier letters, sometimes called flags , which may be appended as a group right after the ending delimiter of a regular expression to change its behavior from the default.
To make a case-insensitive pattern match, so that you can match
FRED
as easily as fred
or
Fred
, use the
/i
modifier:
print "Would you like to play a game? "; chomp($_ = <STDIN>); if (/yes/i) { # case-insensitive match print "In that case, I recommend that you go bowling. "; }
Do you ever feel frustrated that the dot (.
) won’t
match newline? If you might have newlines in your strings, and you
want the dot to be able to match them, the
/s
modifier will do the job. It changes
every dot[4] in the pattern to
act like the character class [dD]
does, which is
to match any character, even if it is a newline. Of course, you have
to have a string with newlines for this to make a difference:
$_ = "I saw Barney down at the bowling alley with Fred last night. "; if (/Barney.*Fred/s) { print "That string mentions Fred after Barney! "; }
Without the /s
modifier, that match would fail,
since the two names aren’t on the same line.
If you have more than one option modifier to use on the same pattern, they may be used one after the other; their order isn’t significant:
if (/barney.*fred/si) { # both /s and /i print "That string mentions Fred after Barney! "; }
There are many other option modifiers available. We’ll cover
those as we get to them, or you can read about them in the
perlop manpage and in the descriptions of
m//
and the other regular expression operators
that we’ll see later in this chapter.
Matching against $_
is merely the default; the
binding operator
(=~
) tells Perl to
match the pattern on the right against the string on the left,
instead of matching against $_
.[5] For example:
my $some_other = "I dream of betty rubble."; if ($some_other =~ /rub/) { print "Aye, there's the rub. "; }
The first time you see it, the binding operator looks like some kind
of assignment operator. But it’s no such thing! It is simply
saying, “this pattern match which would attach to
$_
by default—make it work with this string
on the left instead.” If there’s no binding operator, the
expression is using $_
by default.
In the (somewhat unusual) example below,
$likes_perl
is set to a Boolean value according to
what the user typed at the prompt. This is a little on the
quick-and-dirty side, because the line of input itself is discarded.
This code reads the line of input, tests that string against the
pattern, then discards the line of input.[6] It
doesn’t use or change $_
at all.
print "Do you like Perl? "; my $likes_perl = (<STDIN> =~ /yes/i); ... # Time passes... if ($likes_perl) { print "You said earlier that you like Perl, so... "; ... }
The parentheses around the pattern-test expression aren’t required, so the following line does the same thing as the one above—it stores the result of the test (and not the line of input) into the variable:
my $likes_perl = <STDIN> =~ /yes/i;
The regular expression is double-quote interpolated, just as if it were a double-quoted string. This allows us to write a quick grep -like program like this:
#!/usr/bin/perl -w my $what = "larry"; while (<>) { if (/^($what)/) { # pattern is anchored at beginning of string print "We saw $what in beginning of $_"; } }
The pattern will be built up out of whatever’s in
$what
when we run the pattern match. In this case,
it’s the same as if we had written
/^(larry)/
, looking for larry
at the start of each line.
But we didn’t have to get the value of $what
from a literal string; we could have gotten it instead from the
command-line arguments in @ARGV
:
my $what = shift @ARGV;
Now, if the first command-line argument were
fred|barney
, the pattern becomes
/^(fred|barney)/
, looking for
fred
or barney
at the start of
each line.[7] The parentheses
(which weren’t really necessary when searching for
larry
) are important, now, because without them
we’d be matching fred
at the start or
barney
anywhere in the string.
With that line changed to get the pattern from
@ARGV
, this program resembles the Unix
grep command. But we have to watch out for
metacharacters in the string. If $what
contains
'fred(barney'
, the pattern would look like
/^(fred(barney)/
, and you know that can’t
work right—it’ll crash your program with an invalid
regular expression error. With some advanced techniques,[8] you can trap this kind of error (or
prevent the magic of the metacharacters in the first place) so that
it won’t crash your program. But for now, just know that if you
give your users the power of regular expressions, they’ll also
need the responsibility to use them correctly.
Do
you remember the regular expression memories, which we used with
backreferences in the previous chapter? Those memories are also
available after the pattern match is done, after we return to Perl.
They’re strings, so they are kept in scalar variables with
names like $1
and $2
. There are
as many of these variables as there are pairs of memory parentheses
in the pattern. As you’d expect, $4
means
the string matched by the fourth set of parentheses. This is the same
string that 4
referred to inside the pattern
match.
Why are there two different ways to refer to that same string?
They’re not really referring to the same string at the same
time; $4
means the fourth memory of an
already
completed pattern
match, while 4
is a backreference referring back
to the fourth memory of the currently
matching regular expression. Besides,
backreferences work inside regular expressions only; once we’re
back in the world of Perl, we’ll use $4
.
These match variables are a big part of the power of regular expressions, because they let us pull out the parts of a string:
$_ = "Hello there, neighbor"; if (/s(w+),/) { # memorize the word between space and comma print "the word was $1 "; # the word was there }
Or you could use more than one memory at once:
$_ = "Hello there, neighbor"; if (/(S+) (S+), (S+)/) { print "words were $1 $2 $3 "; }
That tells us that the words were Hello there
neighbor
. Notice that there’s no comma in the output.
Because the comma is outside of the memory parentheses in the pattern, there is no comma in memory two.
Using this technique, we can choose
exactly what we want in the memories, as well as what we want to
leave out.
You could even have an empty match variable,[9] if that part of the pattern might be empty. That is, a match variable may contain the empty string:
my $dino = "I fear that I'll be extinct after 1000 years."; if ($dino =~ /(d*) years/) { print "That said '$1' years. "; # 1000 } $dino = "I fear that I'll be extinct after a few million years."; if ($dino =~ /(d*) years/) { print "That said '$1' years. "; # empty string }
These match variables generally stay around until the next
successful pattern match.[10] That is, an
unsuccessful match leaves the previous memories intact, but a successful
one resets them all. But this correctly implies that you
shouldn’t use these match variables unless the match succeeded;
otherwise, you could be seeing a memory from some previous pattern.
The following (bad) example is supposed to print a word matched from
$_
. But if the match fails, it’s using
whatever leftover string happens to be found in
$1
:
$wilma =~ /(w+)/; # BAD! Untested match result print "Wilma's word was $1... or was it? ";
This is another reason that a pattern match is almost always found in
the conditional expression of an if
or
while
:
if ($wilma =~ /(w+)/) { print "Wilma's word was $1. "; } else { print "Wilma doesn't have a word. "; }
Since these memories don’t stay around forever, you
shouldn’t use a match variable like $1
more
than a few lines after its pattern match. If your maintenance
programmer adds a new regular expression between your regular
expression and your use of $1
, you’ll be
getting the value of $1
for the second match,
rather than the first. For this reason, if you need a memory for more
than a few lines, it’s generally best to copy it into an
ordinary variable. Doing this helps make the code more readable at
the same time:
if ($wilma =~ /(w+)/) { my $wilma_word = $1; ... }
Later, in Chapter 14, we’ll see how to get
the memory value directly into the variable at
the same time as the pattern match happens, without having to use
$1
explicitly.
There are three more match variables that you get for free,[11] whether the pattern has memory parentheses or not. That’s the good news; the bad news is that these variables have weird names.
Now, Larry probably would have been happy enough to call these by
slightly-less-weird names, like perhaps $gazoo
or
$ozmodiar
. But those are names that you just might
want to use in your own code. To keep ordinary Perl programmers from
having to memorize the names of all of
Perl’s special variables before choosing their first variable
names in their first programs,[12] Larry has given strange names to many of Perl’s
builtin variables, names that
“break the rules.” In this case, the names are
punctuation marks: $&
, $`
,
and $'
. They’re strange, ugly, and weird,
but those are their names.[13]
The part of the string that actually matched the pattern is
automatically stored in
$&
:
if ("Hello there, neighbor" =~ /s(w+),/) { print "That actually matched '$&'. "; }
That tells us that the part that matched was "
there,"
(with a space, a word, and a comma). Memory one, in
$1
, has just the five-letter word
there
, but $&
has the
entire matched section.
Whatever came before the matched section is in
$`
, and whatever was after it is in
$'
. Another way to say that is that
$`
holds whatever the regular expression engine
had to skip over before it found the match, and $'
has the remainder of the string that the pattern never got to. If you
glue these three strings together in order, you’ll always get
back the original string:
if ("Hello there, neighbor" =~ /s(w+),/) { print "That was ($`)($&)($'). "; }
The message shows the string as (Hello)( there,)(
neighbor)
, showing the three automatic match variables in
action. This may seem familiar, and for good reason: These automatic
memory variables are what the pattern test program (from Chapter 7) was using in its line of
“mystery” code, to show what part of the string was being
matched by the pattern:
print "Matched: |$`<$&>$'| "; # The three automatic match variables
Any or all of these three automatic match variables may be empty, of course, just like the numbered match variables. And they have the same scope as the numbered match variables. Generally, that means that they’ll stay around until the next successful pattern match.
Now, we said earlier that these three are “free.” Well,
freedom has its price. In this case, the price is that once you use
any one of these automatic match variables anywhere in your entire
program, other regular expressions will run a little more slowly.
Now, this isn’t a giant slowdown, but it’s enough of a
worry that many Perl programmers will simply never use these
automatic match
variables.[14] Instead, they’ll use
a workaround. For example, if the only one you need is
$&
, just put parentheses around the whole
pattern and use $1
instead (you may need to
renumber the pattern’s memories, of course).
Match variables (both the automatic ones and the numbered ones) are most often used in substitutions, which are the topic of the next section.
If you think of the
m//
pattern match as being like your word
processor’s “search” feature, the “search and
replace” feature would have to be Perl’s
s///
substitution operator. This simply replaces
whatever part of a variable[15] matches a pattern
with a replacement string:
$_ = "He's out bowling with Barney tonight."; s/Barney/Fred/; # Replace Barney with Fred print "$_ ";
If the match fails, nothing happens, and the variable is untouched:
# Continuing from above; $_ has "He's out bowling with Fred tonight." s/Wilma/Betty/; # Replace Wilma with Betty (fails)
Of course, both the pattern and the replacement string could be more complex. Here, the replacement string uses the first memory variable, which is set by the pattern match:
s/with (w+)/against $1/; print "$_ "; # says "He's out bowling against Fred tonight."
Here are some other possible substitutions. (These are here only as samples; in the real world, it would not be typical to do so many unrelated substitutions in a row.)
$_ = "green scaly dinosaur"; s/(w+) (w+)/$2, $1/; # Now it's "scaly, green dinosaur" s/^/huge, /; # Now it's "huge, scaly, green dinosaur" s/,.*een//; # Empty replacement: Now it's "huge dinosaur" s/green/red/; # Failed match: still "huge dinosaur" s/w+$/($`!)$&/; # Now it's "huge (huge !)dinosaur" s/s+(!W+)/$1 /; # Now it's "huge (huge!) dinosaur" s/huge/gigantic/; # Now it's "gigantic (huge!) dinosaur"
There’s a return value from s///
; it’s
true if a substitution was successful; otherwise it’s false:
$_ = "fred flintstone"; if (s/fred/wilma/) { print "Successfully replaced fred with wilma! "; }
As you may have noticed in a previous example,
s///
will make just one replacement, even if
others are possible. Of course, that’s just the default. The
/g
modifier tells
s///
to make all possible nonoverlapping[16] replacements:
$_ = "home, sweet home!"; s/home/cave/g; print "$_ "; # "cave, sweet cave!"
A fairly common use of a global replacement is to collapse whitespace; that is, to turn any arbitrary whitespace into a single space:
$_ = "Input data may have extra whitespace."; s/s+/ /g; # Now it says "Input data may have extra whitespace."
Once we show collapsing whitespace, everyone wants to know about stripping leading and trailing whitespace. That’s easy enough, in two steps:[17]
s/^s+//; # Replace leading whitespace with nothing s/s+$//; # Replace trailing whitespace with nothing
Just as we did with m//
and
qw//
, we can change the
delimiters for
s///
. But the substitution uses three delimiter
characters, so things are a little different.
With ordinary (non-paired) characters, which don’t have a left and right variety, just use three of them, as we did with the forward slash. Here, we’ve chosen the pound sign[18] as the delimiter:
s#^https://#http://#;
But if you use paired characters, which have a left and right variety, you have to use two pairs: one to hold the pattern and one to hold the replacement string. In this case, the delimiters don’t have to be the same kind around the string as they are around the pattern. In fact, the delimiters of the string could even be non-paired. These are all the same:
s{fred}{barney}; s[fred](barney); s<fred>#barney#;
In addition to the /g
modifier,[19] substitutions
may use the /i
and
/s
modifiers that we saw in ordinary pattern
matching. The order of modifiers isn’t significant.
s#wilma#Wilma#gi; # replace every WiLmA or WILMA with Wilma s{__END_ _.*}{}s; # chop off the end marker and all following lines
Just as we saw with m//
, we can choose a different
target for
s///
by using the
binding operator:
$file_name =~ s#^.*/##s; # In $file_name, remove any Unix-style path
It often happens in a substitution that you’ll want to make
sure that a replacement word is properly capitalized (or not, as the
case may be). That’s easy to
accomplish with Perl, by using some backslash escapes. The
U
escape forces what follows to all
uppercase:
$_ = "I saw Barney with Fred."; s/(fred|barney)/U$1/gi; # $_ is now "I saw BARNEY with FRED."
Similarly, the L
escape forces lowercase. Continuing
from the previous code:
s/(fred|barney)/L$1/gi; # $_ is now "I saw barney with fred."
By default, these affect the rest of the (replacement) string; or you
can turn off case shifting with
E
:
s/(w+) with (w+)/U$2E with $1/i; # $_ is now "I saw FRED with barney."
When written in lowercase
(l
and
u
), they affect only the next character:
s/(fred|barney)/u$1/ig; # $_ is now "I saw FRED with Barney."
You can even stack them up. Using u
with
L
means “all lower case, but capitalize the
first letter”:[20]
s/(fred|barney)/uL$1/ig; # $_ is now "I saw Fred with Barney."
As it happens, although we’re covering case shifting in relation to substitutions, it’s available in any double-quotish string:
print "Hello, Lu$nameE, would you like to play a game? ";
Another operator that uses regular expressions is
split
, which breaks up a string according to
a separator. This is useful for tab-separated data, or
colon-separated, whitespace-separated, or
anything-separated data, really.[21] So long as you can specify the
separator with a regular expression (and generally, it’s a
simple regular expression), you can use split
. It
looks like this:
@fields = split /separator/, $string;
The split
operator[22]
drags the pattern through a string and returns a list of fields
(substrings) that were separated by the separators. Whenever the
pattern matches, that’s the end of one field and the start of
the next. So, anything that matches the pattern will never show up in
the returned fields. Here’s a typical split
pattern, splitting on colons:
@fields = split /:/, "abc:def:g:h"; # gives ("abc", "def", "g", "h")
You could even have an empty field, if there were two delimiters together:
@fields = split /:/, "abc:def::g:h"; # gives ("abc", "def", "", "g", "h")
Here’s a rule that seems odd at first, but it rarely causes problems: Leading empty fields are always returned, but trailing empty fields are discarded:[23]
@fields = split /:/, ":::a:b:c:::"; # gives ("", "", "", "a", "b", "c")
It’s also common to split
on
whitespace, using /s+/
as the pattern. Under that pattern, all whitespace runs are
equivalent to a single space:
my $some_input = "This is a test. "; my @args = split /s+/, $some_input; # ("This", "is", "a", "test.")
The default for split
is to break up
$_
on whitespace:
my @fields = split; # like split /s+/, $_;
This is almost the same as using /s+/
as the
pattern, except that a leading empty field is suppressed—so, if
the line starts with whitespace, you won’t see an empty field
at the start of the list. (If you’d like to get the same
behavior when splitting another string on whitespace, just use a
single space in place of the pattern: split
' ', $other_string
. Using a space instead of the
pattern is a special kind of split
.)
Generally, the patterns used for
split
are as simple as the ones you see here. But
if the pattern becomes more complex, be sure to avoid using memory
parentheses in the pattern; see the perlfunc
manpage for more information.[24]
The join
function doesn’t use patterns.
So why is it in this chapter? It’s here because, in a sense,
join
performs the opposite function of
split
: split
breaks up
a string into a number of pieces, and join
glues
together a bunch of pieces to make a single string. The
join
function looks like this:
my $result = join $glue, @pieces;
The first argument to join
is the glue, which may
be any string. The remaining arguments are a list of pieces.
join
puts the glue string between the pieces and
returns the resulting string:
my $x = join ":", 4, 6, 8, 10, 12; # $x is "4:6:8:10:12"
In that example, we had five items, so there are only four colons. That is, there are four pieces of glue. The glue shows up only between the pieces, never before or after them. So, there will be one fewer piece of glue than the number of items in the list.
This means that there may be no glue at all, if the list doesn’t have at least two elements:
my $y = join "foo", "bar"; # gives just "bar", since no fooglue is needed my @empty; # empty array my $empty = join "baz", @empty; # no items, so it's an empty string
Using $x
from above, we can break up a string and
put it back together with a different delimiter:
my @values = split /:/, $x; # @values is (4, 6, 8, 10, 12) my $z = join "-", @values; # $z is "4-6-8-10-12"
Although split
and join
work
well together, don’t forget that the first argument to
join
is always a string, not a pattern.
See Section A.8 for answers to the following exercises:
[7] Make a pattern that will match three consecutive copies of
whatever is currently contained in $what
. That is,
if $what
is fred
, your pattern
should match fredfredfred
. If
$what
is fred|barney
, your
pattern should match fredfredbarney
or
barneyfredfred
or
barneybarneybarney
or many other variations.
(Hint: You should set $what
at the top of the
pattern test program with a statement like my $what =
'fred|barney';
.)
[15] Write a program that looks through the
perlfunc.pod
file for lines that start with
=item
and some whitespace, followed by a Perl
identifier name (made of letters, digits, and underscores, but never
starting with a digit), like the lines below. (There may be more text
on the line after the identifier name; just ignore it.) You can
locate the perlfunc.pod
file on your system with
the command perldoc -l perlfunc
, or ask your local
expert. (Hint: You’ll need the diamond operator to open this
file. How will it get the filename?) Have the program print each
identifier name as it finds it; there will be hundreds of them, and
many will appear more than once in the file.
As an example, the following lines of input resemble what
you’ll find in perlfunc.pod
. For the first
line, the program should print wilma
. For the
second, it should print fred
(ignoring the word
flintstone
, since we’re interested only in
the identifier name):
=item wilma =item fred flintstone
[10] Modify the previous program to list only the identifier names
that appear more than twice on those =item
lines,
and tell how many times each one appeared. (That is, we want to know
which identifier names appear on at least three separate
=item
lines in the file.) There should be a couple
of dozen, depending upon your version of Perl.
[1] Nonpaired delimiters are the ones that don’t have a different “left” and “right” variety; the same punctuation mark is used for both ends.
[2] If you’re using paired
delimiters, you shouldn’t generally have to worry about using
the delimiter inside the pattern, since that delimiter will generally
be paired inside your pattern. That is,
m(fred(.*)barney)
and m{w{2,}}
and m[wilma[
]+betty]
are all fine, even
though the pattern contains the quoting character, since each
“left” has a corresponding “right”. But the
angle brackets (”<
" and
">
“) aren’t regular
expression metacharacters, so they may not be paired; if the pattern
were m{(d+)s*>=?s*(d+)}
, quoting it with
angle brackets would mean having to backslash the greater-than sign
so that it wouldn’t prematurely end the pattern.
[3] Remember, the forward slash is not a metacharacter, so it doesn’t need to be backslashed when it’s not the delimiter.
[4] If you wish to change just some of them,
and not all, you’ll probably want to replace just those few
with [dD]
.
[5] The binding operator is also used with some other operations besides the pattern match, as we’ll see later.
[6] Remember,
the line of input is not automatically stored into
$_
unless the line-input operator
(<STDIN>
) is all alone in the conditional
expression of a while
loop.
[7] The astute reader will know that you
can’t generally type fred|barney
as an
argument at the command line because the vertical bar is a shell
metacharacter. See the documentation to your shell to learn about how
to quote command-line arguments.
[8] In this case, you would use an eval
block to
trap the error, or you would quote the interpolated text using
quotemeta
(or its Q
equivalent form) so that it’s no longer treated as a regular
expression.
[9] As
opposed to an undefined one. If you have three or fewer sets of
parentheses in the pattern, $4
will be
undef
.
[10] The actual scoping rule is much more complex (see the documentation if you need it), but as long as you don’t expect the match variables to be untouched many lines after a pattern match, you shouldn’t have problems.
[11] Yeah, right. There’s no such thing as a free match. These are “free” only in the sense that they don’t require match parentheses. Don’t worry; we’ll mention their real cost a little later, though.
[12] You should still avoid
a few classical variable names like $ARGV
, but
these few are all in all-caps. All of Perl’s builtin variables
are documented in the perlvar manpage.
[13] If you really can’t
stand these names, check out the English
module,
which attempts to give all of Perl’s strangest variables nearly
normal names. But the use of this module has never really caught on;
instead, Perl programmers have grown to love the punctuation-mark
variable names, strange as they are.
[14] Most of these folks haven’t actually benchmarked their programs to see whether their workarounds actually save time, though; it’s as though these variables were poisonous or something. But we can’t blame them for not benchmarking—many programs that could benefit from these three variables take up only a few minutes of CPU time in a week, so benchmarking and optimizing would be a waste of time. But in that case, why fear a possible extra millisecond? By the way, the Perl developers are working on this problem, but there will probably be no solution before Perl 6.
[15] Unlike
m//
, which can match against any string
expression, s///
is modifying data that must
therefore be contained in what’s known as an
lvalue. This is nearly always a variable,
although it could actually be anything that could be used on the left
side of an assignment operator.
[16] It’s nonoverlapping because each new match starts looking just beyond the latest replacement.
[17] It could be done in one step, but this way is better.
[18] With
apologies to our British friends, to whom the pound sign is something
else! Although the pound sign is generally the start of a comment in
Perl, it won’t start a comment when the parser knows to expect
a delimiter—in this case, immediately after the
s
that starts the substitution.
[19] We
speak of the modifiers with names like
"/i
" , even if the delimiter is
something different than a slash.
[20] The L
and
u
may appear together in either order. Larry
realized that people would sometimes get those two backwards, so he
made Perl figure out that you want just the first letter capitalized
and the rest lowercase. Larry is a pretty nice guy.
[21] Except “comma-separated values,” normally called
CSV files. Those are a pain to do with split
;
you’re better off getting the Text::CSV
module from CPAN.
[22] It’s an operator, even though it acts a lot like a function, and everyone generally calls it a function. But the technical details of the difference are beyond the scope of this book.
[23] This is merely the
default. It’s this way for efficiency. If you worry about
losing trailing empty fields, use -1
as a third
argument to split
and they’ll be kept; see
the perlfunc manpage.
[24] And you might want to check out the nonmemory grouping-only parenthesis notation as well, in the perlre manpage.