Chapter 8. More About Regular Expressions

In the previous chapter, we saw the beginnings of what regular expressions can do. Here we’ll see some of their other common features.

Character Classes

A character class, a list of possible characters inside square brackets ([]), matches any single character from within the class. It matches just one single character, but that one character may be any of the ones listed.

For example, the character class [abcwxyz] may match any one of those seven characters. For convenience, you may specify a range of characters with a hyphen (-), so that class may also be written as [a-cw-z]. That didn’t save much typing, but it’s more usual to make a character class like [a-zA-Z], to match any one letter out of that set of 52.[1] You may use the same character shortcuts as in any double-quotish string to define a character, so the class [00-177] matches any seven-bit ASCII character.[2]

Of course, a character class will be just part of a full pattern; it will never stand on its own in Perl. For example, you might see code that says something like this:

$_ = "The HAL-9000 requires authorization to continue.";
if (/HAL-[0-9]+/) {
  print "The string mentions some model of HAL computer.
";
}

Sometimes, it’s easier to specify the characters left out, rather than the ones within the character class. A caret (”^“) at the start of the character class negates it. That is, [^def] will match any single character except one of those three. And [^n-z] matches any character except for n, hyphen, or z. (Note that the hyphen is backslashed, because it’s special inside a character class. But the first hyphen in /HAL-[0-9]+/ doesn’t need a backslash, because hyphens aren’t special outside a character class.)

Character Class Shortcuts

Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9], may be abbreviated as d . Thus, the pattern from the example about HAL could be written /HAL-d+/ instead.

The shortcut w is a so-called “word” character: [A-Za-z0-9_]. If your “words” are made up of ordinary letters, digits, and underscores, you’ll be happy with this. Most of the rest of us have words made up of ordinary letters, hyphens, and apostrophes,[3] and we’d like to change this. As of this writing, the Perl developers are working on it, but it’s not available yet.[4] So use this one only when you want ordinary letters, digits, and underscores.

Of course, w doesn’t match a “word”; it merely matches a single “word” character. To match an entire word, though, the plus modifier is handy. A pattern like /fred w+ barney/ will match fred and a space, then a “word”, then a space and barney. That is, it’ll match if there’s one word[5] between fred and barney, set off by single spaces.

As you may have noticed in that previous example, it might be handy to be able to match spaces more flexibly. The s shortcut is good for whitespace; it’s the same as [f ]. That is, it’s the same as a class containing the five whitespace characters form-feed, tab, newline, carriage return, and the space character itself. These are the characters that merely move the printing position around; they don’t use any ink. Still, like the other shortcuts we’ve just seen, s matches just a single character from the class, so it’s usual to use either s* for any amount of whitespace (including none at all), or s+ for one or more whitespace characters. (In fact, it’s rare to see s without one of those quantifiers.) Since all of those whitespace characters look about the same to us humans, we can treat them all in the same way with this shortcut.

Negating the Shortcuts

Sometimes you may want the opposite of one of these three shortcuts. That is, you may want [^d], [^w], or [^s], meaning a nondigit character, a nonword character, or a nonwhitespace character. That’s easy enough to accomplish by using their uppercase counterparts: D , W , or S . These match any character that their counterpart would not match.

Any of these shortcuts will work either in place of a character class (standing on their own in a pattern), or inside the square brackets of a larger character class. That means that you could now use /[dA-Fa-f]+/ to match hexadecimal (base 16) numbers, which use letters ABCDEF (or the same letters in lowercase) as additional digits.

Another compound character class is [dD], which means any digit, or any non-digit. That is to say, any character at all! This is a common way to match any character, even a newline. (As opposed to ., which matches any character except a newline.) And then there’s the totally useless [^dD], which matches anything that’s not either a digit or a non-digit. Right—nothing!

General Quantifiers

A quantifier in a pattern means to repeat the preceding item a certain number of times. We’ve already seen three quantifiers: *, +, and ?. But if none of those three suits your needs, just use a comma-separated pair of numbers inside curly braces ({}) to specify exactly how few and how many repetitions are allowed.

So the pattern /a{5,15}/ will match from five to fifteen repetitions of the letter a. If the a appears three times, that’s too few, so it won’t match. If it appears five times, it’s a match. If it appears ten times, that’s still a match. If it appears twenty times, just the first fifteen will match, since that’s the upper limit.

If you omit the second number (but include the comma), there’s no upper limit to the number of times the item will match. So, /(fred){3,}/ will match if there are three or more instances of fred in a row (with no extra characters, like spaces, allowed between each fred and the next). There’s no upper limit, so that would match eighty-eight instances of fred, if you had a string with that many.

If you omit the comma as well as the upper bound, the number given is an exact count: /w{8}/ will match exactly eight word characters (occuring as part of a larger string, perhaps).

In fact, the three quantifier characters that we saw earlier are just common shortcuts. The star is the same as the quantifier {0,}, meaning zero or more. The plus is the same as {1,}, meaning one or more. And the question mark could be written as {0,1}. In practice, it’s unusual to need any curly-brace quantifiers, since the three shortcut characters are nearly always the only ones needed.

Anchors

By default, if a pattern doesn’t match at the start of the string, it can “float” on down the string, trying to match somewhere else. But there are a number of anchors that may be used to hold the pattern at a particular point in a string.

The caret[6] anchor (^) marks the beginning of the string, while the dollar sign ($) marks the end.[7] So the pattern /^fred/ will match fred only at the start of the string; it wouldn’t match manfred mann. And /rock$/ will match rock only at the end of the string; it wouldn’t match knute rockne.

Sometimes, you’ll want to use both of these anchors, to ensure that the pattern matches an entire string. A common example is /^s*$/, which matches a blank line. But this “blank” line may include some whitespace characters, like tabs and spaces, which are invisible to you and me. Any line that matches that pattern looks just like any other one on paper, so this pattern treats all blank lines as equivalent. Without the anchors, it would match nonblank lines as well.

Word Anchors

Anchors aren’t just at the ends of the string. The word-boundary anchor,  , matches at either end of a word.[8] So we can use /fred/ to match the word fred but not frederick or alfred or manfred mann. This is similar to the feature often called something like “match whole words only” in a word processor’s search command.

Alas, these aren’t words as you and I are likely to think of them; they’re those w-type words made up of ordinary letters, digits, and underscores. The  anchor matches at the start or end of a group of w characters.

In Figure 8-1, there’s a grey underline under each “word,” and the arrows show the corresponding places where  could match. There are always an even number of word boundaries in a given string, since there’s an end-of-word for every start-of-word.

The “words” are sequences of letters, digits, and underscores; that is, a word in this sense is what’s matched by /w+/. There are five words in that sentence: That, s, a, word, and boundary.[9] Notice that the quote marks around word don’t change the word boundaries; these words are made of w characters.

Each arrow points to the beginning or the end of one of the grey underlines, since the word boundary anchor  matches only at the beginning or the end of a group of word characters.

Word-boundary matches with 
Figure 8-1. Word-boundary matches with 

The word-boundary anchor is useful to ensure that we don’t accidentally find cat in delicatessen, dog in boondoggle, or fish in selfishness. Sometimes you’ll want just one word-boundary anchor, as when using /hunt/ to match words like hunt or hunting or hunter, but not shunt, or when using /stone/ to match words like sandstone or flintstone but not capstones.

The nonword-boundary anchor is B ; it matches at any point where  would not match. So the pattern /searchB/ will match searches, searching, and searched, but not search or researching.

Memory Parentheses

You remember that parentheses (”( )“) may be used for grouping together parts of a pattern. They also have a second function: they tell the regular expression engine to remember what was in the substring matched by the pattern in the parentheses. That is to say, it doesn’t remember what was in the pattern itself; it remembers what was in the corresponding part of the string. Whenever you use parentheses for grouping, they automatically work as memory parentheses as well.

So, if you use /./, you’ll match any single character (except newline); if you use /(.)/, you’ll still match any single character, but now it will be kept in a regular expression memory. For each pair of parentheses in the pattern, you’ll have one regular expression memory.

Backreferences

A backreference refers back to a memory that was saved earlier in the current pattern’s processing. Backreferences are made with a backslash, which is easy to remember. For example, 1 contains the first regular expression memory (that is, the part of the string matched by the first pair of parentheses).

Backreferences are used to go back and match the exact same[10] string that was matched earlier in the pattern. So, /(.)1/ means to match any one character, remember it as memory one, then match memory one again. In other words, match any character, followed by the same character. So, this pattern will match strings with doubled-letters, as in bamm-bamm and betty. Of course, the dot will match characters other than letters, so if a string has two spaces in a row, two tabs in a row, or two asterisks in a row, it will match.

That’s not the same as the pattern /../, which will match any character followed by any character—those two could be the same, or they could be different. /(.)1/ means to match any character followed by the same character.

A typical usage of these memories might be if you have some HTML-like[11] text to process. For example, maybe you want to match a tag like these two, which may use either single quotes or double quotes:

<image source='fred.png'>
<image source="fred's-birthday.png">

The tag may have either single quotes or double quotes, since the quoted data may include the other kind of mark (as with the apostrophe in the second example tag). So the pattern might look like this: /<image source=(['"]).*1>/. That says that the opening quote mark may be of either type, but there must be a matching mark at the end of the quote.[12]

If you have more sets of parentheses, you can have more backreferences. As you might guess, 17 is the contents of the seventeenth regular expression memory, if you have at least that many sets of parentheses.[13]

In numbering backreferences, you can just count the left (opening) parentheses. The pattern/((fred|wilma) (flintstone)) 1/ says to match strings like fred flintstone fred flintstone, since the first opening parenthesis and its corresponding closing parenthesis hold a pattern that matches fred flintstone.[14]

If we wrote /((fred|wilma) (flintstone)) 2/ instead, we would match strings like fred flintstone fred; memory two is the choice of fred or wilma. (Notice that it wouldn’t match fred flintsone wilma, since the backreference can match only the same name that was matched earlier: either fred or wilma. But it could match wilma flintstone wilma, since that one uses the same name.) And the pattern /((fred|wilma) (flintstone)) 3/ would match strings like fred flintstone flintstone. It’s uncommon to have a literal string like flintstone in memory parentheses, though; we did that one just to have a third example.

Memory Variables

When we get to the next chapter and back into the world of Perl, we’ll see that the contents of these regular expression memories are available to us in special variables like $1 after the pattern match is done. We mention this here just so you’ll know that the memories aren’t merely used for backreferences; if you see what seem to be unnecessary parentheses in a pattern, they may actually be setting up those memories.

Precedence

With all of these metacharacters in regular expressions, you may feel that you can’t keep track of the players without a scorecard. That’s the precedence chart, which shows us which parts of the pattern “stick together” the most tightly. Unlike the precedence chart for operators, the regular expression precedence chart is simple, with only four levels. As a bonus, this section will review all of the metacharacters that Perl uses in patterns.

  1. At the top of the precedence chart are the parentheses, (”( )“), used for grouping and memory. Anything in parentheses will “stick together” more tightly than anything else.

  2. The second level is the quantifiers. These are the repeat operators—star (*), plus (+), and question mark (?)—as well as the quantifiers made with curly braces, like {5,15}, {3,}, and {5}. These always stick to the item they’re following.

  3. The third level of the precedence chart holds anchors and sequence. The anchors are the caret (^) start-of-string anchor, the dollar-sign ($) end-of-string anchor, the  word-boundary anchor, and the B nonword-boundary anchor. Sequence (putting one item after another) is actually an operator, even though it doesn’t use a metacharacter. That means that letters in a word will stick together just as tightly as the anchors stick to the letters.

  4. The lowest level of precedence is the vertical bar (|) of alternation. Since this is at the bottom of the chart, it effectively cuts the pattern into pieces. It’s at the bottom of the chart because we want the letters in the words in /fred|barney/ to stick together more tightly than the alternation. If alternation were higher priority than sequence, that pattern would mean to match fre, followed by a choice of d or b, followed by arney. So, alternation is at the bottom of the chart, and the letters within the names stick together.

Besides the precedence chart, there are the so-called atoms that make up the most basic pieces of the pattern. These are the individual characters, character classes, and backreferences.

Examples of Precedence

When you need to decipher a complex regular expression, you’ll need to do as Perl does, and use the precedence chart to see what’s really going on.

For example, /^fred|barney$/ is probably not what the programmer intended. That’s because the vertical bar of alternation is very low precedence; it cuts the pattern in two. That pattern matches either fred at the beginning of the string or barney at the end. It’s much more likely that the programmer wanted /^(fred|barney)$/, which will match if the whole line has nothing but fred, or nothing but barney.[15]

And what will /(wilma|pebbles?)/ match? The question mark applies to the previous character,[16] so that will match either wilma or pebbles or pebble, perhaps as part of a larger string (since there are no anchors).

The pattern /^(w+)s+(w+)$/ matches lines that have a “word,” some required whitespace, and another “word,” with nothing else before or after. That might be used to match lines like fred flintstone , for example. The parentheses around the words aren’t needed for grouping, so they may be intended to save those substrings into the regular expression memories, which we’ll see more about in the next chapter.

When you’re trying to understand a complex pattern, it may be helpful to add parentheses to clarify the precedence. That’s okay, but remember that grouping parentheses are also automatically memory parentheses; you may need to change the numbering of other memories when you add the parentheses.[17]

And There’s More

Although we’ve covered all of the regular expression features that most people are likely to need for everyday programming, there are more features being added all the time. Check the perlre , perlrequick , and perlretut manpages for the latest news about what patterns in Perl can do.[18]

Exercises

See Section A.7 for answers to the following exercises. These exercises are among the most challenging in the entire book. But don’t get too discouraged! The following chapters will actually be easier, partly because you’ll have the power of regular expressions to help you.

  1. [4] Using the test program from the previous chapter, make a pattern that matches only lines containing either the word fred or wilma, followed by some whitespace, and then the word flintstone. So it should match the string I am fred flintstone (with one or more spaces or tabs between the names).

  2. [10] Here, we give you the answer; you decide what problem it’s trying to solve. What do these real-world patterns match? What might they be used for?

    /"([^"]*)"/
    /^0?[0-3]?[0-7]{1,2}$/
    /^[w.]{1,12}$/

    Try each of them in the test program. It may help to find some strings that match (and that fail to match) each one.

  3. [8] Make a pattern that will match a string containing nothing but a scalar variable’s name (not its value!), like $fred, $barney, or $_ (but you shouldn’t match special variables like $0). That is, if the line of input has the six characters $wilma, the pattern should match. If the input says wilma, it should not match.

  4. [12] Make a pattern that matches any line of input that has the same word repeated two or more times in a row. Words in this problem can be considered to be sequences of letters a to z or A to Z, digits, and underscores. Whitespace between words may differ. For example, the classic observation-test string Paris in the the spring should match, since it has a doubled word. Also, I think that that is the problem should match, even though that may be a correct use of a doubled word. Does your pattern match all three words in I think that that that is the problem (with extra spaces between only some of the words)? Does it match This is a test? How about This shouldn't match, according to the theory of regular expressions?



[1] Notice that those 52 don’t include letters like Å and É and Î and Ø and Ü. But when Unicode processing is available, that particular character range is noticed and enhanced to automatically do the right thing.

[2] At least, if you use ASCII and not EBCDIC.

[3] At least, in usual English we do. In other languages, you may have different components of words. And when looking at ASCII-encoded English text, we have the problem that the single quote and the apostrophe are the same character, so it’s not possible in isolation to tell whether cats' is a word with an apostrophe or a word at the end of a quotation. This is probably one reason that computers haven’t been able to take over the world yet.

[4] Except to a limited (but nevertheless useful) extent in connection with locales; see the perllocale manpage.

[5] We’re going to stop saying “word” in quotes so much; you know by now that these letter-digit-underscore words are the ones we mean.

[6] Yes, you’ve seen that caret is already used in another way in patterns. As the first character of a character class, it negates the class. But outside of a character class, it’s a metacharacter in a different way, being the start-of-string anchor. There are only so many characters, so we have to use some of them twice.

[7] Actually, it matches either the end of the string, or at a newline at the end of the string. That’s so that you can match the end of the string whether it has a trailing newline or not. Most folks don’t worry about this distinction much, but once in a long while it’s important to remember that /^fred$/will match either "fred" or "fred " with equal ease.

[8] Some regular expression implementations have one anchor for start-of-word and another for end-of-word, but Perl uses  for both.

[9] You can see why we wish that we could change the definition of “word”; That's should be one word, not two words with an apostrophe in-between. And even in text that may be mostly ordinary English, it’s normal to find a soupçon of other characters spicing things up.

[10] Well, if the pattern is case-insensitive, as we’ll learn in the next chapter, the capitalization doesn’t have to match. Other than that, though, the string must be the same.

[11] These examples are intentionally not HTML, because there are too many tricky things that crop up in real HTML, or any similar markup language like XML or SGML. If you need to work with HTML, don’t use simple patterns like these. Get a robust module from CPAN, so that you can start with code that’s already written and debugged. If you don’t, we promise that you’ll be sorry. Don’t say we didn’t warn you.

[12] If you realize that there may be problems with using this pattern on a markup language like HTML, that’s okay. There are lots of problems with that! This is just an example to illustrate a use of a backreference. You shouldn’t use simple patterns to parse anything as complex as HTML anyway.

[13] If you don’t have that many sets of parentheses before that point in the pattern, backreferences 10 and beyond will be treated as octal character escapes. To keep an octal character escape like 12 from accidentally meaning a backreference, just use a leading zero: 12 is always a character, never a backreference.

[14] This pattern would also match wilma flintstone wilma flintstone.

[15] And, perhaps, a newline at the end of the string, as we mentioned earlier in connection with the $ anchor.

[16] Because a quantifier sticks to the letter s more tightly than the s sticks to the other letters in pebbles.

[17] But look in the perlre manpage for information about nonmemory parentheses, which are used for grouping without memory.

[18] And check out YAPE::Regexp::Explain in CPAN as a regular-expression-to-English translator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset