In the previous chapter, we saw the beginnings of what regular expressions can do. Here we’ll see some of their other common features.
A character
class, a list of possible characters inside
square brackets
([]
), matches any single character from within the
class. It matches just one single character, but that one character
may be any of the ones listed.
For example, the character class [abcwxyz]
may
match any one of those seven characters. For convenience, you may
specify a range of characters with a
hyphen (-
), so
that class may also be written as [a-cw-z]
. That
didn’t save much typing, but it’s more usual to make a
character class like [a-zA-Z]
, to match any one
letter out of that set of 52.[1] You may use the same character
shortcuts as in any double-quotish string to define a character, so
the class [ 00-177]
matches any seven-bit ASCII
character.[2]
Of course, a character class will be just part of a full pattern; it will never stand on its own in Perl. For example, you might see code that says something like this:
$_ = "The HAL-9000 requires authorization to continue."; if (/HAL-[0-9]+/) { print "The string mentions some model of HAL computer. "; }
Sometimes, it’s easier to specify the characters left out,
rather than the ones within the character class. A caret
(”^
“) at the start of the character
class negates it. That is, [^def]
will match any
single character except one of those three. And
[^n-z]
matches any character except for
n
, hyphen, or z
. (Note that the
hyphen is backslashed, because it’s special inside a character
class. But the first hyphen in /HAL-[0-9]+/
doesn’t need a backslash, because hyphens aren’t special
outside a character class.)
Some character classes appear so frequently that they have
shortcuts. For example, the character
class for any digit, [0-9]
, may be abbreviated as
d
. Thus, the pattern from the
example about HAL could be written /HAL-d+/
instead.
The shortcut w
is a so-called “word”
character: [A-Za-z0-9_]
. If your
“words” are made up of ordinary letters, digits, and
underscores, you’ll be happy with this. Most of the rest of us
have words made up of ordinary letters, hyphens, and
apostrophes,[3] and we’d like to change
this. As of this writing, the Perl developers are working on it, but
it’s not available yet.[4] So use
this one only when you want ordinary letters, digits, and
underscores.
Of course, w
doesn’t match a
“word”; it merely matches a single “word”
character. To match an entire word, though, the plus modifier is
handy. A pattern like /fred w+ barney/
will match
fred
and a space, then a “word”, then
a space and barney
. That is, it’ll match if
there’s one word[5] between fred
and
barney
, set off by single spaces.
As you may have noticed in that previous example, it might be handy
to be able to match spaces more flexibly. The
s
shortcut is good for
whitespace; it’s the same as [f
]
.
That is, it’s the same as a class containing the five
whitespace characters form-feed, tab, newline, carriage return, and
the space character itself. These are the characters that merely move
the printing position around; they don’t use any ink. Still,
like the other shortcuts we’ve just seen, s
matches just a single character from the class, so it’s usual
to use either s*
for any amount of whitespace
(including none at all), or s+
for one or more
whitespace characters. (In fact, it’s rare to see
s
without one of those quantifiers.) Since all of
those whitespace characters look about the same to us humans, we can
treat them all in the same way with this shortcut.
Sometimes you may want the opposite of one of these three
shortcuts. That is, you may want
[^d]
, [^w]
, or
[^s]
, meaning a nondigit character, a nonword
character, or a nonwhitespace character. That’s easy enough to
accomplish by using their uppercase counterparts:
D
,
W
, or
S
. These match any character that their
counterpart would not match.
Any of these shortcuts will work either in place of a character class
(standing on their own in a pattern), or inside the square brackets
of a larger character class. That means that you could now use
/[dA-Fa-f]+/
to match hexadecimal (base 16)
numbers, which use letters ABCDEF
(or the same
letters in lowercase) as additional digits.
Another compound character class is [dD]
, which
means any digit, or any non-digit. That is to say, any character at
all! This is a common way to match any character, even a newline. (As
opposed to .
, which matches any character
except a newline.) And then there’s the
totally useless [^dD]
, which matches anything
that’s not either a digit or a non-digit.
Right—nothing!
A
quantifier
in a pattern means to repeat the preceding item a certain number of
times. We’ve already seen three quantifiers:
*
, +
, and ?
.
But if none of those three suits your needs, just use a
comma-separated
pair of numbers inside curly braces
({}
) to specify exactly how few and how many
repetitions are allowed.
So the pattern /a{5,15}/
will match from five to
fifteen repetitions of the letter a
. If the
a
appears three times, that’s too few, so it
won’t match. If it appears five times, it’s a match. If
it appears ten times, that’s still a match. If it appears
twenty times, just the first fifteen will match, since that’s
the upper limit.
If you omit the second number (but include the comma), there’s
no upper limit to the number of times the item will match. So,
/(fred){3,}/
will match if there are three or more
instances of fred
in a row (with no extra
characters, like spaces, allowed between each fred
and the next). There’s no upper limit, so that would match
eighty-eight instances of fred
, if you had a
string with that many.
If you omit the comma as well as the upper bound, the number given is
an exact count: /w{8}/
will match exactly eight
word characters (occuring as part of a larger string, perhaps).
In fact, the three quantifier characters that we saw earlier are just
common shortcuts. The star is the same as the quantifier
{0,}
, meaning zero or more. The plus is the same
as {1,}
, meaning one or more. And the question
mark could be written as {0,1}
. In practice,
it’s unusual to need any curly-brace quantifiers, since the
three shortcut characters are nearly always the only ones needed.
By default, if a pattern doesn’t match at the start of the string, it can “float” on down the string, trying to match somewhere else. But there are a number of anchors that may be used to hold the pattern at a particular point in a string.
The
caret[6] anchor (^
) marks the
beginning of the string, while the dollar sign ($
)
marks the end.[7] So the pattern /^fred/
will match fred
only at the start of the string;
it wouldn’t match manfred mann
. And
/rock$/
will match rock
only at
the end of the string; it wouldn’t match knute
rockne
.
Sometimes, you’ll want to use both of these anchors, to ensure
that the pattern matches an entire string. A common example is
/^s*$/
, which matches a blank line. But this
“blank” line may include some whitespace characters, like
tabs and spaces, which are invisible to you and me. Any line that
matches that pattern looks just like any other one on paper, so this
pattern treats all blank lines as equivalent. Without the anchors, it
would match nonblank lines as well.
Anchors aren’t just at the ends of the string. The
word-boundary anchor,
, matches at either end of
a word.[8] So we can use
/fred/
to match the word fred
but not
frederick
or alfred
or
manfred mann
. This is similar to the feature often
called something like “match whole words only” in a word
processor’s search command.
Alas, these aren’t words as you and I are likely to think of
them; they’re those w
-type words made up of
ordinary letters, digits, and underscores. The
anchor matches at the start or end of a group of
w
characters.
In Figure 8-1, there’s a grey underline under
each “word,” and the arrows show the corresponding places
where could match. There are always an even
number of word boundaries in a given string, since there’s an
end-of-word for every start-of-word.
The “words” are sequences of letters, digits, and
underscores; that is, a word in this sense is what’s matched by
/w+/
. There are five words in that sentence:
That
, s
, a
,
word
, and boundary
.[9] Notice
that the quote marks around word
don’t
change the word boundaries; these words are made of
w
characters.
Each arrow points to the beginning or the end of one of the grey
underlines, since the word boundary anchor
matches only at the beginning or the end of a group of word
characters.
The word-boundary anchor is useful to ensure that we don’t
accidentally find cat
in
delicatessen
, dog
in
boondoggle
, or fish
in
selfishness
. Sometimes you’ll want just one
word-boundary anchor, as when using /hunt/
to
match words like hunt
or
hunting
or hunter
, but not
shunt
, or when using /stone/
to match words like sandstone
or
flintstone
but not capstones
.
The nonword-boundary anchor is
B
; it matches at any point where
would not match. So the pattern
/searchB/
will match
searches
, searching
, and
searched
, but not search
or
researching
.
You remember that parentheses
(”( )
“) may be used for grouping
together parts of a pattern. They also have a second function: they
tell the regular expression engine to remember what was in the
substring matched by the pattern in the parentheses. That is to say,
it doesn’t remember what was in the pattern itself; it
remembers what was in the corresponding part of the string. Whenever
you use parentheses for grouping, they automatically work as memory
parentheses as well.
So, if you use /./
, you’ll match any single
character (except newline); if you use /(.)/
,
you’ll still match any single character, but now it will be
kept in a regular expression memory. For each
pair of parentheses in the pattern, you’ll have one regular
expression memory.
A
backreference
refers back to a memory that was saved earlier in the current
pattern’s processing. Backreferences are made with a
backslash, which is easy to
remember. For example, 1
contains the first
regular expression memory (that is, the part of the string matched by
the first pair of parentheses).
Backreferences are used to go back and match the exact same[10] string that was matched earlier in the pattern. So,
/(.)1/
means to match any one character, remember
it as memory one, then match memory one again. In other words, match
any character, followed by the same character.
So, this pattern will match strings with doubled-letters, as in
bamm-bamm
and betty
. Of course,
the dot will match characters other than letters, so if a string has
two spaces in a row, two tabs in a row, or two asterisks in a row, it
will match.
That’s not the same as the pattern /../
,
which will match any character followed by any character—those
two could be the same, or they could be different.
/(.)1/
means to match any character followed by
the same character.
A typical usage of these memories might be if you have some HTML-like[11] text to process. For example, maybe you want to match a tag like these two, which may use either single quotes or double quotes:
<image source='fred.png'> <image source="fred's-birthday.png">
The tag may have either single quotes or double quotes, since the
quoted data may include the other kind of mark (as with the
apostrophe in the second example tag). So the pattern might look like
this: /<image source=(['"]).*1>/
. That says
that the opening quote mark may be of either type, but there must be
a matching mark at the end of the quote.[12]
If you have more sets of parentheses, you can have more
backreferences. As you might guess, 17
is the
contents of the seventeenth regular expression memory, if you have at
least that many sets of parentheses.[13]
In numbering backreferences, you can just count the left (opening)
parentheses. The pattern/((fred|wilma) (flintstone))
1/
says to match strings like
fred
flintstone fred
flintstone
, since the first opening parenthesis and its
corresponding closing parenthesis hold a pattern that matches
fred flintstone
.[14]
If we wrote /((fred|wilma) (flintstone)) 2/
instead, we would match strings like fred flintstone
fred
; memory two is the choice of fred
or wilma
. (Notice that it wouldn’t match
fred flintsone wilma
, since the backreference can
match only the same name that was matched earlier: either
fred
or wilma
. But it could
match wilma flintstone wilma
, since that one uses
the same name.) And the pattern /((fred|wilma) (flintstone))
3/
would match strings like fred flintstone
flintstone
. It’s uncommon to have a literal string
like flintstone
in memory parentheses, though; we
did that one just to have a third example.
When we get
to the next chapter and back into the world of Perl, we’ll see
that the contents of these regular expression memories are available
to us in special variables like $1
after the
pattern match is done. We mention this here just so you’ll know
that the memories aren’t merely used for backreferences; if you
see what seem to be unnecessary parentheses in a pattern, they may
actually be setting up those memories.
With all of these metacharacters in regular expressions, you may feel that you can’t keep track of the players without a scorecard. That’s the precedence chart, which shows us which parts of the pattern “stick together” the most tightly. Unlike the precedence chart for operators, the regular expression precedence chart is simple, with only four levels. As a bonus, this section will review all of the metacharacters that Perl uses in patterns.
At the top of the precedence chart are the parentheses,
(”( )
“), used for grouping and memory.
Anything in parentheses will “stick together” more
tightly than anything else.
The second level is the quantifiers. These are the repeat
operators—star (*
), plus
(+
), and question mark
(?
)—as well as the quantifiers made with
curly braces, like {5,15}
,
{3,}
, and {5}
. These always
stick to the item they’re following.
The third level of the precedence chart holds anchors and sequence.
The anchors are the caret (^
) start-of-string
anchor, the dollar-sign ($
) end-of-string anchor,
the word-boundary anchor, and the
B
nonword-boundary anchor. Sequence (putting one
item after another) is actually an operator, even though it
doesn’t use a metacharacter. That means that letters in a word
will stick together just as tightly as the anchors stick to the
letters.
The lowest level of precedence is the vertical bar
(|
) of alternation. Since this is at the bottom of
the chart, it effectively cuts the pattern into pieces. It’s at
the bottom of the chart because we want the letters in the words in
/fred|barney/
to stick together more tightly than
the alternation. If alternation were higher priority than sequence,
that pattern would mean to match fre
, followed by
a choice of d
or b
, followed by
arney
. So, alternation is at the bottom of the
chart, and the letters within the names stick together.
Besides the precedence chart, there are the so-called atoms that make up the most basic pieces of the pattern. These are the individual characters, character classes, and backreferences.
When you need to decipher a complex regular expression, you’ll need to do as Perl does, and use the precedence chart to see what’s really going on.
For example, /^fred|barney$/
is probably not what
the programmer intended. That’s because the vertical bar of
alternation is very low precedence; it cuts the pattern in two. That
pattern matches either fred
at the beginning of
the string or barney
at the end. It’s much
more likely that the programmer wanted
/^(fred|barney)$/
, which will match if the whole line has
nothing but fred
, or nothing but
barney
.[15]
And what will /(wilma|pebbles?)/
match? The
question mark applies to the previous character,[16] so that
will match either wilma
or
pebbles
or pebble
, perhaps as
part of a larger string (since there are no anchors).
The pattern /^(w+)s+(w+)$/
matches lines that
have a “word,” some required whitespace, and another
“word,” with nothing else before or after. That might be
used to match lines like fred
flintstone
, for example. The parentheses around the
words aren’t needed for grouping, so they may be intended to
save those substrings into the regular expression memories, which
we’ll see more about in the next chapter.
When you’re trying to understand a complex pattern, it may be helpful to add parentheses to clarify the precedence. That’s okay, but remember that grouping parentheses are also automatically memory parentheses; you may need to change the numbering of other memories when you add the parentheses.[17]
Although we’ve covered all of the regular expression features that most people are likely to need for everyday programming, there are more features being added all the time. Check the perlre , perlrequick , and perlretut manpages for the latest news about what patterns in Perl can do.[18]
See Section A.7 for answers to the following exercises. These exercises are among the most challenging in the entire book. But don’t get too discouraged! The following chapters will actually be easier, partly because you’ll have the power of regular expressions to help you.
[4] Using the test program from the previous chapter, make a pattern
that matches only lines containing either the word
fred
or wilma
, followed by some
whitespace, and then the word flintstone
. So it
should match the string I am fred flintstone
(with
one or more spaces or tabs between the names).
[10] Here, we give you the answer; you decide what problem it’s trying to solve. What do these real-world patterns match? What might they be used for?
/"([^"]*)"/ /^0?[0-3]?[0-7]{1,2}$/ /^[w.]{1,12}$/
Try each of them in the test program. It may help to find some strings that match (and that fail to match) each one.
[8] Make a pattern that will match a string containing nothing but a
scalar variable’s name (not its value!), like
$fred
, $barney
, or
$_
(but you shouldn’t match special
variables like $0
). That is, if the line of input
has the six characters $wilma
, the pattern should
match. If the input says wilma
, it should not
match.
[12] Make a pattern that matches any line of input that has the same
word repeated two or more times in a row. Words in this problem can
be considered to be sequences of letters a
to
z
or A
to Z
,
digits, and underscores. Whitespace between words may differ. For
example, the classic observation-test string Paris in the
the spring
should match, since it has a doubled word. Also,
I think that that is the problem
should match,
even though that may be a correct use of a doubled word. Does your
pattern match all three words in I
think that that that is the problem
(with extra spaces
between only some of the words)? Does it match This is
a
test
? How about This
shouldn't match, according to the theory
of
regular expressions
?
[1] Notice that those 52 don’t include letters like Å and É and Î and Ø and Ü. But when Unicode processing is available, that particular character range is noticed and enhanced to automatically do the right thing.
[2] At least, if you use ASCII and not EBCDIC.
[3] At least, in usual English we do. In
other languages, you may have different components of words. And when
looking at ASCII-encoded English text, we have the problem that the
single quote and the apostrophe are the same character, so it’s
not possible in isolation to tell whether cats'
is
a word with an apostrophe or a word at the end of a quotation. This
is probably one reason that computers haven’t been able to take
over the world yet.
[4] Except to a limited (but nevertheless useful) extent in connection with locales; see the perllocale manpage.
[5] We’re going to stop saying “word” in quotes so much; you know by now that these letter-digit-underscore words are the ones we mean.
[6] Yes, you’ve seen that caret is already used in another way in patterns. As the first character of a character class, it negates the class. But outside of a character class, it’s a metacharacter in a different way, being the start-of-string anchor. There are only so many characters, so we have to use some of them twice.
[7] Actually, it matches either the end of
the string, or at a newline at the end of the string. That’s so
that you can match the end of the string whether it has a trailing
newline or not. Most folks don’t worry about this distinction
much, but once in a long while it’s important to remember that
/^fred$/
will match either
"fred"
or "fred
"
with equal
ease.
[8] Some regular
expression implementations have one anchor for start-of-word and
another for end-of-word, but Perl uses for
both.
[9] You can see why we wish that we could change the definition of
“word”; That's
should be one word, not
two words with an apostrophe in-between. And even in text that may be
mostly ordinary English, it’s normal to find a soupçon
of other characters spicing things up.
[10] Well, if the pattern is case-insensitive, as we’ll learn in the next chapter, the capitalization doesn’t have to match. Other than that, though, the string must be the same.
[11] These examples are intentionally
not
HTML, because there are too many tricky things
that crop up in real HTML, or any similar markup language like XML or
SGML. If you need to work with HTML, don’t use simple patterns
like these. Get a robust module from CPAN, so that you can start with
code that’s already written and debugged. If you don’t,
we promise that you’ll be sorry. Don’t say we
didn’t warn you.
[12] If you realize that there may be problems with using this pattern on a markup language like HTML, that’s okay. There are lots of problems with that! This is just an example to illustrate a use of a backreference. You shouldn’t use simple patterns to parse anything as complex as HTML anyway.
[13] If you
don’t have that many sets of parentheses before that point in
the pattern, backreferences 10
and beyond will be
treated as octal character escapes. To keep an octal character escape
like 12
from accidentally meaning a
backreference, just use a leading zero: