Perl has many features that set it apart from other languages. Of all those features, one of the most important is its strong support for regular expressions. These allow fast, flexible, and reliable string handling.
But that power comes at a price. Regular expressions are actually tiny programs in their own special language, built inside Perl. (Yes, you’re about to learn another programming language![1] Fortunately it’s a simple one.) So for the next two chapters, we’ll be learning that language; then we’ll take what we’ve learned back to the world of Perl in Chapter 9.
Regular expressions aren’t merely part of Perl; they’re also found in sed and awk, procmail, grep, most programmers’ text editors like vi and emacs, and even in more esoteric places. If you’ve seen some of these already, you’re ahead of the game. Keep watching, and you’ll see many more tools that use or support regular expressions, such as search engines on the Web (often written in Perl), email clients, and others.
A regular expression, often called a pattern in Perl, is a template that either matches or doesn’t match a given string.[2] That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that don’t. There’s never any kinda-sorta-almost-up-to-here wishy-washy matching: either it matches or it doesn’t. A pattern may match just one possible string, or just two or three, or a dozen, or a hundred, or an infinite number. Or it may match all strings except for one, or except for some, or except for an infinite number.[3]
We already referred to regular expressions as being little programs in their own simple programming language. It’s a simple language because the programs have just one task: to look at a string and say “it matches” or “it doesn’t match”.[4] That’s all they do.
One of the places you’re likely to have seen regular
expressions is in the Unix
grep
command, which prints out text lines
matching a given pattern. For example, if you wanted to see which
lines in a given file mention
flint
and, somewhere later on the
same line, stone
, you might do something like
this, with the Unix grep command:
$ grep 'flint.*stone' some_file
a piece of flint, a stone which may be used to start a fire by striking
found obsidian, flint, granite, and small stones of basaltic rock, which
a flintlock rifle in poor condition. The sandstone mantle held several
Now, if you’ve used regular expressions somewhere else, that’s good, because you have a head start on these three chapters. But Perl’s regular expressions have somewhat different syntax than most other implementations; in fact, everybody’s regular expressions are a little different. So, if you needed to use a backslash to do something in another implementation, maybe you’ll need to leave it off in Perl, or maybe vice versa.
Don’t confuse regular expressions with shell filename-matching
patterns, called
globs
.
A typical glob is what you use when you type
*.pm
to the Unix shell to match all filenames
that end in .pm
. Globs use a lot of the same
characters that we use in regular expressions, but those characters
are used in totally different ways.[5] We’ll visit globs later, in Chapter 12, but for now try to put them totally out of
your mind.
To
compare a pattern (regular expression) to the contents of
$_
, simply put the pattern between a pair of
forward slashes
(/
), like we do here:
$_ = "yabba dabba doo"; if (/abba/) { print "It matched! "; }
The expression /abba/
looks for that four-letter
string in $_
; if it finds it, it returns a true
value. In this case, it’s found more than once, but that
doesn’t make any difference. If it’s found at all,
it’s a match; if it’s not in there at all, it fails.
Because the pattern match is generally being used to return a true or
false value, it is almost always found in the conditional expression
of if
or while
.
All of the usual backslash escapes that you can put into
double-quoted strings are available in patterns, so you could use the
pattern /coke sprite/
to match the eleven
characters of coke
, a tab, and
sprite
.
Of course, if patterns matched only simple literal strings, they wouldn’t be very useful. That’s why there are a number of special characters, called metacharacters , that have special meanings in regular expressions.
For example, the
dot
(.
) is a wildcard character—it matches any
single character except a newline (which is represented by
"
"
). So, the pattern /bet.y/
would match betty
. Or it would match
betsy
, or bet=y
, or
bet.y
, or any other string that has
bet
, followed by any one character (except a
newline), followed by y
. It wouldn’t match
bety
or betsey
, though, since
those don’t have exactly one character between the
t
and the y
. The dot always
matches exactly one character.
So, if you wanted to match a period in the string, you
could use the dot. But that would match any
possible character (except a newline), which might be more than you
wanted. If you wanted the dot to match just a
period, you can simply backslash it. In fact, that rule goes for all
of Perl’s regular expression metacharacters: a backslash in
front of any metacharacter makes it nonspecial. So, the pattern
/3.14159/
doesn’t have a wildcard
character.
So the backslash is our second metacharacter. If you mean a real backslash, just use a pair of them—a rule that applies just as well everywhere else in Perl.
It often happens that you’ll need to repeat something in a
pattern. The
star (*
) means to
match the preceding item zero or more times. So,
/fred *barney/
matches any number of tab
characters between fred
and
barney
. That is, it matches
"fred barney"
with one tab, or
"fred barney"
with two tabs, or
"fred barney"
with three tabs, or even
"fredbarney"
with nothing in between at all.
That’s because the star means “zero or
more”—so you could even have hundreds of tab characters
in between, but nothing other than tabs. You may find it helpful to
think of star as saying, “that previous thing, any number of
times, even zero times” (because * is the “times” operator in multiplication).
What if you wanted to allow something besides tab characters? The dot
matches any character[6], so .*
will match any
character, any number of times. That means that the pattern
/fred.*barney/
matches “any old junk”
between fred
and barney
. Any
line that mentions fred
and (somewhere later)
barney
will match that pattern. We often call
.*
the “any old junk” pattern, because
it can match any old junk in your strings.
The star is formally called a
quantifier
,
meaning that it specifies a quantity of the preceding item. But
it’s not the only quantifier; the
plus
(”+
“) is another. The plus means to
match the preceding item one or more times:
/fred +barney/
matches if fred
and barney
are separated by spaces and only
spaces. (The space is not a metacharacter.) This won’t match
fredbarney
, since the plus means that there must
be one or more spaces between the two names, so at least one space is
required. It may be helpful to think of the plus as saying,
“that last thing, plus any number more of
the same thing.”
There’s a third quantifier like the star and plus, but more
limited. It’s the
question mark
(”?
“), which means that the preceding
item is optional. That is, the preceding item may occur once or not
at all. Like the other two quantifiers, the question mark means that
the preceding item appears a certain number of times. It’s just
that in this case the item may match one time (if it’s there)
or zero times (if it’s not). There aren’t any other
possibilities. So, /bamm-?bamm/
matches either
spelling: bamm-bamm
or
bammbamm
. This is easy to remember, since
it’s saying “that last thing, maybe? Or maybe not?”
All three of these quantifiers must follow something, since they tell how many times the previous item may repeat.
As in mathematics,
parentheses (”(
)
“) may be used for grouping. So, parentheses are
also metacharacters. As an example, the pattern
/fred+/
matches strings like
freddddddddd
, but strings like that don’t
show up often in real life. But the pattern
/(fred)+/
matches strings like
fredfredfred
, which is more likely to be what you
wanted. And what about the pattern /(fred)*/
? That
matches strings like hello, world
.[7]
The
vertical bar (|
),
often pronounced “or” in this usage, means that either
the left side may match, or the right side. That is, if the part of
the pattern on the left of the bar fails, the part on the right gets
a chance to match. So, /fred|barney|betty/
will
match any string that mentions fred
, or
barney
, or betty
.
Now we can make patterns like /fred( | )+barney/
,
which matches if fred
and
barney
are separated by spaces, tabs, or a mixture
of the two. The plus means to repeat one or more times; each time it
repeats, the ( | )
has the chance to match either
a space or a tab.[8] There must be
at least one character between the two names.
If you wanted the characters between fred
and
barney
to all be the same, you could rewrite that
pattern as /fred( +| +)barney/
. In this case, the
separators must be all spaces, or all tabs.
The pattern /fred (and|or) barney/
matches any
string containing either of the two possible strings: fred
and barney
, or fred or barney
.[9] We
could match the same two strings with the pattern /fred and
barney|fred or
barney/
, but that would
be too much typing. It would probably also be less efficient,
depending upon what optimizations are built into the regular
expression engine.
When in the course of Perl events it becomes necessary for a programmer to write a regular expression, it may be difficult to tell just what the pattern will do. It’s normal to find that a pattern matches more than you expected, or less. Or it may match earlier in the string than you expected, or later, or not at all.
This program is useful to test out a pattern on some strings and see just what it matches, and where:
#!/usr/bin/perl while (<>) { # take one input line at a time chomp; if (/YOUR_PATTERN_GOES_HERE/) { print "Matched: |$`<$&>$'| "; # Mystery code! See the text. } else { print "No match. "; } }
This pattern test program is written for programmers to use, not
endusers; you can tell because it doesn’t have any prompts or
usage information. It will take any number of input lines and check
each one against the pattern that you’ll put in place of the
string saying YOUR_PATTERN_GOES_HERE
. For each
line that matches, the line with “mystery code” will be
run. We’ll learn about what that line is really doing in Chapter 9. But what you’ll see is this: if the
pattern is /match/
and the input is
beforematchafter
, the output will say
"|before<match>after|
“, using
angle brackets to
show you just what part of the string was matched by your pattern.
Try it and see! If your pattern matches something you didn’t
expect, you’ll be able to see that right away.
See Section A.6 for answers to the following exercises:
Remember, it’s normal to be surprised by some of the things that regular expressions do; that’s one reason that the exercises in this chapter are even more important than the others. Expect the unexpected.
Several of these exercises ask you to use the test program from this chapter. You could manually type up this program, taking great care to get all of the odd punctuation marks correct.[10] But you’ll probably find it faster and easier to simply download the program and some other goodies from the O’Reilly website, as we mentioned in the Preface. You’ll find this program under the name pattern_test.[11]
[6] Use the test program to make and test a pattern that matches any
string containing fred
. Does it match if your
string is Fred
, frederick
, or
Alfred
?
[6] Use the test program to make and test a pattern that matches any
string containing at least one a
followed by any
number of b
’s. Remember that “any
number” might be zero. Does it match if your string is
barney
, fred
,
abba
, or dinosaur
?
[5] Use the test program to make and test a pattern that matches any
string containing any number of backslashes followed by any number of
asterisks. Does it match if your string is \**
,
fred
, barney \***
, or
*wilma
? (Note the typography; those are four
separate test strings.)
[6] Write a new program (not the test program)
that prints out any input line that mentions
wilma
. (Any other lines should simply be skipped.)
For extra credit, allow it to match Wilma
with a
capital W
as well.
[8] Extra credit exercise: write a program that prints out any input
line that mentions both
wilma
and fred
.
[1] Some might argue that regular expressions are not a complete programming language. We won’t argue.
[2] Purists would ask for a more rigorous definition. But then again, purists say that Perl’s patterns aren’t really regular expressions. If you’re serious about regular expressions, we highly recommend the book Mastering Regular Expressions by Jeffrey Friedl (O’Reilly & Associates, Inc.).
[3] And as we’ll see, you could have a pattern that always matches or that never does. In rare cases, even these may be useful. Generally, though, they’re mistakes.
[4] The programs also pass back some information that Perl can use later. One such piece of information is the “regular expressions memories” that we’ll learn about a little later.
[5] Globs are also (alas) sometimes called patterns. What’s worse, though, is that some bad Unix books for beginners (and possibly written by beginners) have taken to calling globs “regular expressions”, which they certainly are not. This confuses many folks at the start of their work with Unix.
[6] Except newline. But we’re going to stop reminding you of that so often, because you know it by now. Most of the time it doesn’t matter, anyway, because your strings will most-often not have newlines. But don’t forget this detail, because someday a newline will sneak into your string and you’ll need to remember that the dot doesn’t match newline.
[7] The star means to match zero or more
repetitions of fred
. When you’re willing to
settle for zero, it’s hard to be disappointed! That pattern
will match any string, even the empty string.
[8] This particular match would normally be done more efficiently with a character class, as we’ll see in the next chapter.
[9] Note that the words and
and
or
are not operators in
regular expressions! They are shown here in a fixed-width typeface
because they’re part of the strings.
[10] If you do type it up on
your own, remember that the backtick character (`
)
is not the same as the apostrophe ('
). On most
full-sized computer keyboards these days (in the U.S., at least), the
backtick is found on a key immediately to the left of the
1
key. Try out the program with the pattern
/match/
and the string
beforematchafter
, as the text describes, and see
that it works correctly before you do the exercises.
[11] Don’t be surprised if the program you download is a little fancier than what we have in the book. The commented-out extra features you see will come in handy in later exercises.