Chapter 9. Using Regular Expressions

Now that we’ve seen what goes inside a regular expression, let’s take what we’ve learned back into Perl.

Matches with m//

We’ve been writing patterns in pairs of forward slashes, like /fred/. But this is actually a shortcut for the m// (pattern match) operator. As we saw with the qw// operator, you may choose any pair of delimiters to quote the contents. So, we could write that same expression as m(fred), m<fred>, m{fred}, or m[fred] using those paired delimiters, or as m,fred,, m!fred!, m^fred^, or many other ways using nonpaired delimiters.[1]

The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. Since Perl folks love to avoid typing extra characters, you’ll see most pattern matches written using slashes, as in /fred/.

Of course, you should wisely choose a delimiter that doesn’t appear in your pattern.[2] If you wanted to make a pattern to match the beginning of an ordinary web URL, you might start to write /^http:/// to match the initial "http://". But that’s easier to read, write, maintain, and debug if you use a better choice of delimiter: m%^http://%.[3]

It’s common to use curly braces as the delimiter. If you use a programmers’ text editor, it probably has the ability to jump from an opening curly brace to the corresponding closing one, which can be handy in maintaining code.

Option Modifiers

There are several option modifier letters, sometimes called flags , which may be appended as a group right after the ending delimiter of a regular expression to change its behavior from the default.

Case-insensitive Matching with /i

To make a case-insensitive pattern match, so that you can match FRED as easily as fred or Fred, use the /i modifier:

print "Would you like to play a game? ";
chomp($_ = <STDIN>);
if (/yes/i) {  # case-insensitive match
  print "In that case, I recommend that you go bowling.
";
}

Matching Any Character with /s

Do you ever feel frustrated that the dot (.) won’t match newline? If you might have newlines in your strings, and you want the dot to be able to match them, the /s modifier will do the job. It changes every dot[4] in the pattern to act like the character class [dD] does, which is to match any character, even if it is a newline. Of course, you have to have a string with newlines for this to make a difference:

$_ = "I saw Barney
down at the bowling alley
with Fred
last night.
";
if (/Barney.*Fred/s) {
  print "That string mentions Fred after Barney!
";
}

Without the /s modifier, that match would fail, since the two names aren’t on the same line.

Combining Option Modifiers

If you have more than one option modifier to use on the same pattern, they may be used one after the other; their order isn’t significant:

if (/barney.*fred/si) {  # both /s and /i
  print "That string mentions Fred after Barney!
";
}

Other Options

There are many other option modifiers available. We’ll cover those as we get to them, or you can read about them in the perlop manpage and in the descriptions of m// and the other regular expression operators that we’ll see later in this chapter.

The Binding Operator, =~

Matching against $_ is merely the default; the binding operator (=~) tells Perl to match the pattern on the right against the string on the left, instead of matching against $_.[5] For example:

my $some_other = "I dream of betty rubble.";
if ($some_other =~ /rub/) {
  print "Aye, there's the rub.
";
}

The first time you see it, the binding operator looks like some kind of assignment operator. But it’s no such thing! It is simply saying, “this pattern match which would attach to $_ by default—make it work with this string on the left instead.” If there’s no binding operator, the expression is using $_ by default.

In the (somewhat unusual) example below, $likes_perl is set to a Boolean value according to what the user typed at the prompt. This is a little on the quick-and-dirty side, because the line of input itself is discarded. This code reads the line of input, tests that string against the pattern, then discards the line of input.[6] It doesn’t use or change $_ at all.

print "Do you like Perl? ";
my $likes_perl = (<STDIN> =~ /yes/i);
...  # Time passes...
if ($likes_perl) {
  print "You said earlier that you like Perl, so...
";
  ...
}

The parentheses around the pattern-test expression aren’t required, so the following line does the same thing as the one above—it stores the result of the test (and not the line of input) into the variable:

my $likes_perl = <STDIN> =~ /yes/i;

Interpolating into Patterns

The regular expression is double-quote interpolated, just as if it were a double-quoted string. This allows us to write a quick grep -like program like this:

#!/usr/bin/perl -w
my $what = "larry";

while (<>) {
  if (/^($what)/) {  # pattern is anchored at beginning of string
    print "We saw $what in beginning of $_";
  }
}

The pattern will be built up out of whatever’s in $what when we run the pattern match. In this case, it’s the same as if we had written /^(larry)/, looking for larry at the start of each line.

But we didn’t have to get the value of $what from a literal string; we could have gotten it instead from the command-line arguments in @ARGV:

my $what = shift @ARGV;

Now, if the first command-line argument were fred|barney, the pattern becomes /^(fred|barney)/, looking for fred or barney at the start of each line.[7] The parentheses (which weren’t really necessary when searching for larry) are important, now, because without them we’d be matching fred at the start or barney anywhere in the string.

With that line changed to get the pattern from @ARGV, this program resembles the Unix grep command. But we have to watch out for metacharacters in the string. If $what contains 'fred(barney', the pattern would look like /^(fred(barney)/, and you know that can’t work right—it’ll crash your program with an invalid regular expression error. With some advanced techniques,[8] you can trap this kind of error (or prevent the magic of the metacharacters in the first place) so that it won’t crash your program. But for now, just know that if you give your users the power of regular expressions, they’ll also need the responsibility to use them correctly.

The Match Variables

Do you remember the regular expression memories, which we used with backreferences in the previous chapter? Those memories are also available after the pattern match is done, after we return to Perl. They’re strings, so they are kept in scalar variables with names like $1 and $2. There are as many of these variables as there are pairs of memory parentheses in the pattern. As you’d expect, $4 means the string matched by the fourth set of parentheses. This is the same string that 4 referred to inside the pattern match.

Why are there two different ways to refer to that same string? They’re not really referring to the same string at the same time; $4 means the fourth memory of an already completed pattern match, while 4 is a backreference referring back to the fourth memory of the currently matching regular expression. Besides, backreferences work inside regular expressions only; once we’re back in the world of Perl, we’ll use $4.

These match variables are a big part of the power of regular expressions, because they let us pull out the parts of a string:

$_ = "Hello there, neighbor";
if (/s(w+),/) {             # memorize the word between space and comma
  print "the word was $1
";  # the word was there
}

Or you could use more than one memory at once:

$_ = "Hello there, neighbor";
if (/(S+) (S+), (S+)/) {
  print "words were $1 $2 $3
";
}

That tells us that the words were Hello there neighbor. Notice that there’s no comma in the output. Because the comma is outside of the memory parentheses in the pattern, there is no comma in memory two. Using this technique, we can choose exactly what we want in the memories, as well as what we want to leave out.

You could even have an empty match variable,[9] if that part of the pattern might be empty. That is, a match variable may contain the empty string:

my $dino = "I fear that I'll be extinct after 1000 years.";
if ($dino =~ /(d*) years/) {
  print "That said '$1' years.
";  # 1000
}

$dino = "I fear that I'll be extinct after a few million years.";
if ($dino =~ /(d*) years/) {
  print "That said '$1' years.
";  # empty string
}

The Persistence of Memory

These match variables generally stay around until the next successful pattern match.[10] That is, an unsuccessful match leaves the previous memories intact, but a successful one resets them all. But this correctly implies that you shouldn’t use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. The following (bad) example is supposed to print a word matched from $_. But if the match fails, it’s using whatever leftover string happens to be found in $1:

$wilma =~ /(w+)/;  # BAD! Untested match result
print "Wilma's word was $1... or was it?
";

This is another reason that a pattern match is almost always found in the conditional expression of an if or while:

if ($wilma =~ /(w+)/) {
  print "Wilma's word was $1.
";
} else {
  print "Wilma doesn't have a word.
";
}

Since these memories don’t stay around forever, you shouldn’t use a match variable like $1 more than a few lines after its pattern match. If your maintenance programmer adds a new regular expression between your regular expression and your use of $1, you’ll be getting the value of $1 for the second match, rather than the first. For this reason, if you need a memory for more than a few lines, it’s generally best to copy it into an ordinary variable. Doing this helps make the code more readable at the same time:

if ($wilma =~ /(w+)/) {
  my $wilma_word = $1;
  ...
}

Later, in Chapter 14, we’ll see how to get the memory value directly into the variable at the same time as the pattern match happens, without having to use $1 explicitly.

The Automatic Match Variables

There are three more match variables that you get for free,[11] whether the pattern has memory parentheses or not. That’s the good news; the bad news is that these variables have weird names.

Now, Larry probably would have been happy enough to call these by slightly-less-weird names, like perhaps $gazoo or $ozmodiar. But those are names that you just might want to use in your own code. To keep ordinary Perl programmers from having to memorize the names of all of Perl’s special variables before choosing their first variable names in their first programs,[12] Larry has given strange names to many of Perl’s builtin variables, names that “break the rules.” In this case, the names are punctuation marks: $&, $`, and $'. They’re strange, ugly, and weird, but those are their names.[13]

The part of the string that actually matched the pattern is automatically stored in $& :

if ("Hello there, neighbor" =~ /s(w+),/) {
  print "That actually matched '$&'.
";
}

That tells us that the part that matched was " there," (with a space, a word, and a comma). Memory one, in $1, has just the five-letter word there, but $& has the entire matched section.

Whatever came before the matched section is in $` , and whatever was after it is in $' . Another way to say that is that $` holds whatever the regular expression engine had to skip over before it found the match, and $' has the remainder of the string that the pattern never got to. If you glue these three strings together in order, you’ll always get back the original string:

if ("Hello there, neighbor" =~ /s(w+),/) {
  print "That was ($`)($&)($').
";
}

The message shows the string as (Hello)( there,)( neighbor), showing the three automatic match variables in action. This may seem familiar, and for good reason: These automatic memory variables are what the pattern test program (from Chapter 7) was using in its line of “mystery” code, to show what part of the string was being matched by the pattern:

print "Matched: |$`<$&>$'|
";  # The three automatic match variables

Any or all of these three automatic match variables may be empty, of course, just like the numbered match variables. And they have the same scope as the numbered match variables. Generally, that means that they’ll stay around until the next successful pattern match.

Now, we said earlier that these three are “free.” Well, freedom has its price. In this case, the price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly. Now, this isn’t a giant slowdown, but it’s enough of a worry that many Perl programmers will simply never use these automatic match variables.[14] Instead, they’ll use a workaround. For example, if the only one you need is $&, just put parentheses around the whole pattern and use $1 instead (you may need to renumber the pattern’s memories, of course).

Match variables (both the automatic ones and the numbered ones) are most often used in substitutions, which are the topic of the next section.

Substitutions with s///

If you think of the m// pattern match as being like your word processor’s “search” feature, the “search and replace” feature would have to be Perl’s s/// substitution operator. This simply replaces whatever part of a variable[15] matches a pattern with a replacement string:

$_ = "He's out bowling with Barney tonight.";
s/Barney/Fred/;  # Replace Barney with Fred
print "$_
";

If the match fails, nothing happens, and the variable is untouched:

# Continuing from above; $_ has "He's out bowling with Fred tonight."
s/Wilma/Betty/;  # Replace Wilma with Betty (fails)

Of course, both the pattern and the replacement string could be more complex. Here, the replacement string uses the first memory variable, which is set by the pattern match:

s/with (w+)/against $1/;
print "$_
";  # says "He's out bowling against Fred tonight."

Here are some other possible substitutions. (These are here only as samples; in the real world, it would not be typical to do so many unrelated substitutions in a row.)

$_ = "green scaly dinosaur";
s/(w+) (w+)/$2, $1/;  # Now it's "scaly, green dinosaur"
s/^/huge, /;            # Now it's "huge, scaly, green dinosaur"
s/,.*een//;             # Empty replacement: Now it's "huge dinosaur"
s/green/red/;           # Failed match: still "huge dinosaur"
s/w+$/($`!)$&/;        # Now it's "huge (huge !)dinosaur"
s/s+(!W+)/$1 /;       # Now it's "huge (huge!) dinosaur"
s/huge/gigantic/;       # Now it's "gigantic (huge!) dinosaur"

There’s a return value from s///; it’s true if a substitution was successful; otherwise it’s false:

$_ = "fred flintstone";
if (s/fred/wilma/) {
  print "Successfully replaced fred with wilma!
";
}

Global Replacements with /g

As you may have noticed in a previous example, s/// will make just one replacement, even if others are possible. Of course, that’s just the default. The /g modifier tells s/// to make all possible nonoverlapping[16] replacements:

$_ = "home, sweet home!";
s/home/cave/g;
print "$_
";  # "cave, sweet cave!"

A fairly common use of a global replacement is to collapse whitespace; that is, to turn any arbitrary whitespace into a single space:

$_ = "Input  data	 may have    extra whitespace.";
s/s+/ /g;  # Now it says "Input data may have extra whitespace."

Once we show collapsing whitespace, everyone wants to know about stripping leading and trailing whitespace. That’s easy enough, in two steps:[17]

s/^s+//;  # Replace leading whitespace with nothing
s/s+$//;  # Replace trailing whitespace with nothing

Different Delimiters

Just as we did with m// and qw//, we can change the delimiters for s///. But the substitution uses three delimiter characters, so things are a little different.

With ordinary (non-paired) characters, which don’t have a left and right variety, just use three of them, as we did with the forward slash. Here, we’ve chosen the pound sign[18] as the delimiter:

s#^https://#http://#;

But if you use paired characters, which have a left and right variety, you have to use two pairs: one to hold the pattern and one to hold the replacement string. In this case, the delimiters don’t have to be the same kind around the string as they are around the pattern. In fact, the delimiters of the string could even be non-paired. These are all the same:

s{fred}{barney};
s[fred](barney);
s<fred>#barney#;

Option Modifiers

In addition to the /g modifier,[19] substitutions may use the /i and /s modifiers that we saw in ordinary pattern matching. The order of modifiers isn’t significant.

s#wilma#Wilma#gi;  # replace every WiLmA or WILMA with Wilma
s{__END_  _.*}{}s;   # chop off the end marker and all following lines

The Binding Operator

Just as we saw with m//, we can choose a different target for s/// by using the binding operator:

$file_name =~ s#^.*/##s;  # In $file_name, remove any Unix-style path

Case Shifting

It often happens in a substitution that you’ll want to make sure that a replacement word is properly capitalized (or not, as the case may be). That’s easy to accomplish with Perl, by using some backslash escapes. The U escape forces what follows to all uppercase:

$_ = "I saw Barney with Fred.";
s/(fred|barney)/U$1/gi;  # $_ is now "I saw BARNEY with FRED."

Similarly, the L escape forces lowercase. Continuing from the previous code:

s/(fred|barney)/L$1/gi;  # $_ is now "I saw barney with fred."

By default, these affect the rest of the (replacement) string; or you can turn off case shifting with E :

s/(w+) with (w+)/U$2E with $1/i;  # $_ is now "I saw FRED with barney."

When written in lowercase (l and u ), they affect only the next character:

s/(fred|barney)/u$1/ig;  # $_ is now "I saw FRED with Barney."

You can even stack them up. Using u with L means “all lower case, but capitalize the first letter”:[20]

s/(fred|barney)/uL$1/ig;  # $_ is now "I saw Fred with Barney."

As it happens, although we’re covering case shifting in relation to substitutions, it’s available in any double-quotish string:

print "Hello, Lu$nameE, would you like to play a game?
";

The split Operator

Another operator that uses regular expressions is split , which breaks up a string according to a separator. This is useful for tab-separated data, or colon-separated, whitespace-separated, or anything-separated data, really.[21] So long as you can specify the separator with a regular expression (and generally, it’s a simple regular expression), you can use split. It looks like this:

@fields = split /separator/, $string;

The split operator[22] drags the pattern through a string and returns a list of fields (substrings) that were separated by the separators. Whenever the pattern matches, that’s the end of one field and the start of the next. So, anything that matches the pattern will never show up in the returned fields. Here’s a typical split pattern, splitting on colons:

@fields = split /:/, "abc:def:g:h";  # gives ("abc", "def", "g", "h")

You could even have an empty field, if there were two delimiters together:

@fields = split /:/, "abc:def::g:h";  # gives ("abc", "def", "", "g", "h")

Here’s a rule that seems odd at first, but it rarely causes problems: Leading empty fields are always returned, but trailing empty fields are discarded:[23]

@fields = split /:/, ":::a:b:c:::";  # gives ("", "", "", "a", "b", "c")

It’s also common to split on whitespace, using /s+/ as the pattern. Under that pattern, all whitespace runs are equivalent to a single space:

my $some_input = "This  is a 	        test.
";
my @args = split /s+/, $some_input;  # ("This", "is", "a", "test.")

The default for split is to break up $_ on whitespace:

my @fields = split;  # like split /s+/, $_;

This is almost the same as using /s+/ as the pattern, except that a leading empty field is suppressed—so, if the line starts with whitespace, you won’t see an empty field at the start of the list. (If you’d like to get the same behavior when splitting another string on whitespace, just use a single space in place of the pattern: split ' ', $other_string. Using a space instead of the pattern is a special kind of split.)

Generally, the patterns used for split are as simple as the ones you see here. But if the pattern becomes more complex, be sure to avoid using memory parentheses in the pattern; see the perlfunc manpage for more information.[24]

The join Function

The join function doesn’t use patterns. So why is it in this chapter? It’s here because, in a sense, join performs the opposite function of split : split breaks up a string into a number of pieces, and join glues together a bunch of pieces to make a single string. The join function looks like this:

my $result = join $glue, @pieces;

The first argument to join is the glue, which may be any string. The remaining arguments are a list of pieces. join puts the glue string between the pieces and returns the resulting string:

my $x = join ":", 4, 6, 8, 10, 12;  # $x is "4:6:8:10:12"

In that example, we had five items, so there are only four colons. That is, there are four pieces of glue. The glue shows up only between the pieces, never before or after them. So, there will be one fewer piece of glue than the number of items in the list.

This means that there may be no glue at all, if the list doesn’t have at least two elements:

my $y = join "foo", "bar";       # gives just "bar", since no fooglue is needed
my @empty;                       # empty array
my $empty = join "baz", @empty;  # no items, so it's an empty string

Using $x from above, we can break up a string and put it back together with a different delimiter:

my @values = split /:/, $x;  # @values is (4, 6, 8, 10, 12)
my $z = join "-", @values;   # $z is "4-6-8-10-12"

Although split and join work well together, don’t forget that the first argument to join is always a string, not a pattern.

Exercises

See Section A.8 for answers to the following exercises:

  1. [7] Make a pattern that will match three consecutive copies of whatever is currently contained in $what. That is, if $what is fred, your pattern should match fredfredfred. If $what is fred|barney, your pattern should match fredfredbarney or barneyfredfred or barneybarneybarney or many other variations. (Hint: You should set $what at the top of the pattern test program with a statement like my $what = 'fred|barney';.)

  2. [15] Write a program that looks through the perlfunc.pod file for lines that start with =item and some whitespace, followed by a Perl identifier name (made of letters, digits, and underscores, but never starting with a digit), like the lines below. (There may be more text on the line after the identifier name; just ignore it.) You can locate the perlfunc.pod file on your system with the command perldoc -l perlfunc, or ask your local expert. (Hint: You’ll need the diamond operator to open this file. How will it get the filename?) Have the program print each identifier name as it finds it; there will be hundreds of them, and many will appear more than once in the file.

    As an example, the following lines of input resemble what you’ll find in perlfunc.pod. For the first line, the program should print wilma. For the second, it should print fred (ignoring the word flintstone, since we’re interested only in the identifier name):

    =item wilma 
    =item fred flintstone
  3. [10] Modify the previous program to list only the identifier names that appear more than twice on those =item lines, and tell how many times each one appeared. (That is, we want to know which identifier names appear on at least three separate =item lines in the file.) There should be a couple of dozen, depending upon your version of Perl.



[1] Nonpaired delimiters are the ones that don’t have a different “left” and “right” variety; the same punctuation mark is used for both ends.

[2] If you’re using paired delimiters, you shouldn’t generally have to worry about using the delimiter inside the pattern, since that delimiter will generally be paired inside your pattern. That is, m(fred(.*)barney) and m{w{2,}} and m[wilma[ ]+betty] are all fine, even though the pattern contains the quoting character, since each “left” has a corresponding “right”. But the angle brackets (”<" and ">“) aren’t regular expression metacharacters, so they may not be paired; if the pattern were m{(d+)s*>=?s*(d+)}, quoting it with angle brackets would mean having to backslash the greater-than sign so that it wouldn’t prematurely end the pattern.

[3] Remember, the forward slash is not a metacharacter, so it doesn’t need to be backslashed when it’s not the delimiter.

[4] If you wish to change just some of them, and not all, you’ll probably want to replace just those few with [dD].

[5] The binding operator is also used with some other operations besides the pattern match, as we’ll see later.

[6] Remember, the line of input is not automatically stored into $_ unless the line-input operator (<STDIN>) is all alone in the conditional expression of a while loop.

[7] The astute reader will know that you can’t generally type fred|barney as an argument at the command line because the vertical bar is a shell metacharacter. See the documentation to your shell to learn about how to quote command-line arguments.

[8] In this case, you would use an eval block to trap the error, or you would quote the interpolated text using quotemeta (or its Q equivalent form) so that it’s no longer treated as a regular expression.

[9] As opposed to an undefined one. If you have three or fewer sets of parentheses in the pattern, $4 will be undef.

[10] The actual scoping rule is much more complex (see the documentation if you need it), but as long as you don’t expect the match variables to be untouched many lines after a pattern match, you shouldn’t have problems.

[11] Yeah, right. There’s no such thing as a free match. These are “free” only in the sense that they don’t require match parentheses. Don’t worry; we’ll mention their real cost a little later, though.

[12] You should still avoid a few classical variable names like $ARGV, but these few are all in all-caps. All of Perl’s builtin variables are documented in the perlvar manpage.

[13] If you really can’t stand these names, check out the English module, which attempts to give all of Perl’s strangest variables nearly normal names. But the use of this module has never really caught on; instead, Perl programmers have grown to love the punctuation-mark variable names, strange as they are.

[14] Most of these folks haven’t actually benchmarked their programs to see whether their workarounds actually save time, though; it’s as though these variables were poisonous or something. But we can’t blame them for not benchmarking—many programs that could benefit from these three variables take up only a few minutes of CPU time in a week, so benchmarking and optimizing would be a waste of time. But in that case, why fear a possible extra millisecond? By the way, the Perl developers are working on this problem, but there will probably be no solution before Perl 6.

[15] Unlike m//, which can match against any string expression, s/// is modifying data that must therefore be contained in what’s known as an lvalue. This is nearly always a variable, although it could actually be anything that could be used on the left side of an assignment operator.

[16] It’s nonoverlapping because each new match starts looking just beyond the latest replacement.

[17] It could be done in one step, but this way is better.

[18] With apologies to our British friends, to whom the pound sign is something else! Although the pound sign is generally the start of a comment in Perl, it won’t start a comment when the parser knows to expect a delimiter—in this case, immediately after the s that starts the substitution.

[19] We speak of the modifiers with names like "/i" , even if the delimiter is something different than a slash.

[20] The L and u may appear together in either order. Larry realized that people would sometimes get those two backwards, so he made Perl figure out that you want just the first letter capitalized and the rest lowercase. Larry is a pretty nice guy.

[21] Except “comma-separated values,” normally called CSV files. Those are a pain to do with split; you’re better off getting the Text::CSV module from CPAN.

[22] It’s an operator, even though it acts a lot like a function, and everyone generally calls it a function. But the technical details of the difference are beyond the scope of this book.

[23] This is merely the default. It’s this way for efficiency. If you worry about losing trailing empty fields, use -1 as a third argument to split and they’ll be kept; see the perlfunc manpage.

[24] And you might want to check out the nonmemory grouping-only parenthesis notation as well, in the perlre manpage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset