Regular expressions (a.k.a. regexes, regexps, or REs) are used by many search programs such as grep and findstr, text-munging programs like sed and awk, and editors like vi and emacs. A regular expression is a way of describing a set of strings without having to list all the strings in your set.[22]
Many other computer languages incorporate regular
expressions (some of them even advertise "Perl5 regular
expressions"!), but none of these languages integrates regular
expressions into the language the way Perl does. Regular expressions
are used several ways in Perl. First and foremost, they're used in
conditionals to determine whether a string matches a particular
pattern, because in a Boolean context they return true and false. So
when you see something that looks like /foo/
in a
conditional, you know you're looking at an ordinary
pattern-matching operator:
if (/Windows 95/) { print "Time to upgrade? " }
Second, if you can locate patterns within a string, you
can replace them with something else. So when you see something that
looks like s/foo/bar/
, you know it's asking Perl to
substitute "bar" for "foo", if possible. We call that the
substitution operator. It also happens to return
true or false depending on whether it succeeded, but usually it's
evaluated for its side effect:
s/Windows/Linux/;
Finally, patterns can specify not only where something
is, but also where it isn't. So the
split
operator uses a regular expression to specify
where the data isn't. That is, the regular expression defines the
separators that delimit the fields of data. Our
Average Example has a couple of trivial examples of this. Lines 5 and
12 each split strings on the space character in order to return a list
of words. But you can split on any separator you can specify with a
regular expression:
($good, $bad, $ugly) = split(/,/, "vi,emacs,teco");
(There are various modifiers you can use in each of these situations to do exotic things like ignore case when matching alphabetic characters, but these are the sorts of gory details that we'll cover later when we get to the gory details.)
The simplest use of regular expressions is to match a
literal expression. In the case of the split
above,
we matched on a single comma character. But if you match on several
characters in a row, they all have to match sequentially. That is, the
pattern looks for a substring, much as you'd expect. Let's say we want
to show all the lines of an HTML file that contain HTTP links (as
opposed to FTP links). Let's imagine we're working with HTML for the
first time, and we're being a little naïve. We know that these links
will always have "http
:" in them somewhere. We
could loop through our file with this:
while ($line = <FILE>) { if ($line =~ /http:/) { print $line; } }
Here, the =~
(pattern-binding
operator) is telling Perl to look for a match of the regular
expression "http
:" in the variable
$line
. If it finds the expression, the operator
returns a true value and the block (a print
statement) is executed.[23]
By the way, if you don't use the =~
binding operator, Perl will search a default string instead of
$line
. It's like when you say, "Eek! Help me find
my contact lens!" People automatically know to look around near you
without your actually having to tell them that. Likewise, Perl knows
that there is a default place to search for things when you don't say
where to search for them. This default string is actually a special
scalar variable that goes by the odd name of $_
. In
fact, it's not the default just for pattern matching; many operators
in Perl default to using the $_
variable, so a
veteran Perl programmer would likely write the last example as:
while (<FILE>) { print if /http:/; }
(Hmm, another one of those statement modifiers seems to have snuck in there. Insidious little beasties.)
This stuff is pretty handy, but what if we wanted to find all of
the link types, not just the HTTP links? We could give a list of link
types, like "http
:", "ftp
:",
"mailto
:", and so on. But that list could get long,
and what would we do when a new kind of link was added?
while (<FILE>) { print if /http:/; print if /ftp:/; print if /mailto:/; # What next? }
Since regular expressions are descriptive of a set of
strings, we can just describe what we are looking for: a number of
alphabetic characters followed by a colon. In regular expression talk
(Regexese?), that would be /[a-zA-Z]+:/
, where the
brackets define a character class. The
a-z
and A-Z
represent all
alphabetic characters (the dash means the range of all characters
between the starting and ending character, inclusive). And the
+
is a special character that says "one or more of
whatever was before me". It's what we call a
quantifier, meaning a gizmo that says how many
times something is allowed to repeat. (The slashes aren't really part
of the regular expression, but rather part of the pattern-match
operator. The slashes are acting like quotes that just happen to
contain a regular expression.)
Because certain classes like the alphabetics are so commonly used, Perl defines shortcuts for them:
Note that these match single characters. A
w
will match any single word character, not an
entire word. (Remember that +
quantifier? You can
say w+
to match a word.) Perl also provides the
negation of these classes by using the uppercased character, such as
D
for a nondigit character.
We should note that w
is not always
equivalent to [a-zA-Z_0-9]
(and
d
is not always [0-9]
). Some
locales define additional alphabetic characters outside the ASCII
sequence, and w
respects them. Newer versions of
Perl also know about Unicode letter and digit properties and treat
Unicode characters with those properties accordingly. (Perl also
considers ideographs to be w
characters.)
There is one other very special character class,
written with a ".", that will match any character whatsoever.[24] For example, /a./
will match any
string containing an "a
" that is not the last
character in the string. Thus it will match "at
" or
"am
" or even "a!
", but not
"a
", since there's nothing after the
"a
" for the dot to match. Since it's searching for
the pattern anywhere in the string, it'll match
"oasis
" and "camel
", but not
"sheba
". It matches "caravan
" on
the first "a
". It could match on the second
"a
", but it stops after it finds the first suitable
match, searching from left to right.
The characters and character classes we've talked
about all match single characters. We mentioned that you could match
multiple "word" characters with w+
. The
+
is one kind of quantifier, but there are
others. All of them are placed after the item being
quantified.
The most general form of quantifier specifies both
the minimum and maximum number of times an item can match. You put
the two numbers in braces, separated by a comma. For example, if you
were trying to match North American phone numbers, the sequence
d{7,11}
would match at least seven digits, but
no more than eleven digits. If you put a single number in the
braces, the number specifies both the minimum and the maximum; that
is, the number specifies the exact number of times the item can
match. (All unquantified items have an implicit
{1}
quantifier.)
If you put the minimum and the comma but omit the
maximum, then the maximum is taken to be infinity. In other words,
it will match at least the minimum number of times, plus as many as
it can get after that. For example, d{7}
will
match only the first seven digits (a local North American phone
number, for instance, or the first seven digits of a longer number),
while d{7,}
will match any phone number, even an
international one (unless it happens to be shorter than seven
digits). There is no special way of saying "at most" a certain
number of times. Just say .{0,5}
, for example, to
find at most five arbitrary characters.
Certain combinations of minimum and maximum occur
frequently, so Perl defines special quantifiers for them. We've
already seen +
, which is the same as
{1,}
, or "at least one of the preceding item".
There is also *
, which is the same as
{0,}
, or "zero or more of the preceding item",
and ?
, which is the same as
{0,1}
, or "zero or one of the preceding item"
(that is, the preceding item is optional).
You need to be careful of a couple things about
quantification. First of all, Perl quantifiers are by default
greedy. This means that they will attempt to
match as much as they can as long as the whole pattern still
matches. For example, if you are matching /d+/
against "1234567890
", it will match the entire
string. This is something to watch out for especially when you are
using ".", any character. Often, someone will have a string
like:
larry:JYHtPh0./NJTU:100:10:Larry Wall:/home/larry:/bin/tcsh
and will try to match "larry
:" with
/.+:/
. However, since the +
quantifier is greedy, this pattern will match everything up to and
including "/home/larry
:", because it matches as
much as possible before the last colon, including all the other
colons. Sometimes you can avoid this by using a negated character
class, that is, by saying /[^:]+:/
, which says to
match one or more noncolon characters (as many as possible), up to
the first colon. It's that little caret in there that negates the
Boolean sense of the character class. [25] The other point to be careful about is that regular
expressions will try to match as early as
possible. This even takes precedence over being greedy. Since
scanning happens left-to-right, this means that the pattern will
match as far left as possible, even if there is some other place
where it could match longer. (Regular expressions may be greedy, but
they aren't into delayed gratification.) For example, suppose you're
using the substitution command (s///
) on the
default string (variable $_
, that is), and you
want to remove a string of x's from the middle of the string. If you
say:
$_ = "fred xxxxxxx barney"; s/x*//;
it will have absolutely no effect! This is because the
x*
(meaning zero or more "x
"
characters) will be able to match the "nothing" at the beginning of
the string, since the null string happens to be zero characters wide
and there's a null string just sitting there plain as day before the
"f
" of "fred
".[26]
There's one other thing you need to know. By default,
quantifiers apply to a single preceding character, so
/bam{2}/
will match "bamm
" but
not "bambam
". To apply a quantifier to more than
one character, use parentheses. So to match
"bambam
", use the pattern
/(bam){2}/
.
If you were using an ancient version of Perl and you didn't want greedy matching, you had to use a negated character class. (And really, you were still getting greedy matching of a constrained variety.)
In modern versions of Perl, you can force nongreedy, minimal
matching by placing a question mark after any quantifier. Our same
username match would now be /.*?:/
. That
.*?
will now try to match as few characters as
possible, rather than as many as possible, so it stops at the first
colon rather than at the last.
Whenever you try to match a pattern, it's going to try to match in every location till it finds a match. An anchor allows you to restrict where the pattern can match. Essentially, an anchor is something that matches a "nothing", but a special kind of nothing that depends on its surroundings. You could also call it a rule, or a constraint, or an assertion. Whatever you care to call it, it tries to match something of zero width, and either succeeds or fails. (Failure merely means that the pattern can't match that particular way. The pattern will go on trying to match some other way, if there are any other ways left to try.)
The special symbol matches at a
word boundary, which is defined as the "nothing" between a word
character (
w
) and a nonword character
(W
), in either order. (The characters that don't
exist off the beginning and end of your string are considered to be
nonword characters.) For example,
/Fred/
would match "Fred
" in both "The
Great Fred
" and "Fred the Great
", but
not in "Frederick the Great
" because the
"d
" in "Frederick
" is not
followed by a nonword character.
In a similar vein, there are also anchors for the
beginning of the string and the end of the string. If it is the
first character of a pattern, the caret (^
)
matches the "nothing" at the beginning of the string. Therefore, the
pattern /^Fred/
would match
"Fred
" in "Frederick the Great" but not in "The
Great Fred", whereas /Fred^/
wouldn't match
either. (In fact, it doesn't even make much sense.) The dollar sign
($
) works like the caret, except that it matches
the "nothing" at the end of the string instead of the
beginning.[27]
So now you can probably figure out that when we said:
next LINE if $line =~ /^#/;
we meant "Go to the next iteration of LINE
loop if this line happens to begin with a #
character."
Earlier we said that the sequence d{7,11}
would match a number from seven to eleven digits long. While
strictly true, the statement is misleading: when you use that
sequence within a real pattern match operator such as
/d{7,11}/
, it does not preclude there being
extra unmatched digits after the 11 matched digits! You often need
to anchor quantified patterns on either or both ends to get what you
expect.
We mentioned earlier that you can use parentheses to
group things for quantifiers, but you can also use parentheses to
remember bits and pieces of what you matched. A pair of parentheses
around a part of a regular expression causes whatever was matched by
that part to be remembered for later use. It doesn't change what the
part matches, so /d+/
and
/(d+)/
will still match as many digits as
possible, but in the latter case they will be remembered in a
special variable to be backreferenced later.
How you refer back to the remembered part of the string
depends on where you want to do it from. Within the same regular
expression, you use a backslash followed by an integer. The integer
corresponding to a given pair of parentheses is determined by
counting left parentheses from the beginning of the pattern,
starting with one. So for example, to match something similar to an
HTML tag like "<B>Bold</B>
", you
might use /<(.*?)>.*?</1>/
. This
forces the two parts of the pattern to match the exact same string,
such as the "B
" in this example.
Outside the regular expression itself, such as in the
replacement part of a substitution, you use a $
followed by an integer, that is, a normal scalar variable named by
the integer. So, if you wanted to swap the first two words of a
string, for example, you could use:
s/(S+)s+(S+)/$2 $1/
The right side of the substitution (between the second and third slashes) is mostly just a funny kind of double-quoted string, which is why you can interpolate variables there, including backreference variables. This is a powerful concept: interpolation (under controlled circumstances) is one of the reasons Perl is a good text-processing language. The other reason is the pattern matching, of course. Regular expressions are good for picking things apart, and interpolation is good for putting things back together again. Perhaps there's hope for Humpty Dumpty after all.
[22] A good source of information on regular expression concepts is Jeffrey Friedl's book, Mastering Regular Expressions (O'Reilly & Associates).
[23] This is very similar to what the Unix command grep
'http:' file
would do. On MS-DOS you could use the
find command, but it doesn't know how to do
more complicated regular expressions. (However, the misnamed
findstr program of Windows NT does know about
regular expressions.)
[24] Except that it won't normally match a newline. When you think about it, a "." doesn't normally match a newline in grep (1) either.
[25] Sorry, we didn't pick that notation, so don't blame us. That's just how negated character classes are customarily written in Unix culture.
[26] Don't feel bad. Even the authors get caught by this from time to time.