Basic Regular Expressions

Suppose that you have a configuration file in which any line beginning with # is a comment. You want to read the file and throw away comments. How do you tell which lines are comments?

The answer is to use Perl’s regular expression operators. A regular expression is a powerful pattern matching tool. It enables you to match, and replace if needed, virtually any text you can imagine. Regular expressions are enclosed in slashes (//). The slashes are similar to the double quotes ("") used for strings. This chapter starts with a simple regular expression and builds up to more complex comment matching ones. The first regular expression is

/#/

This matches any string with a hash character (#) anywhere within the string. For example, it matches

# Starts with a hash 
There is # in the middle 
One at the end # 
Two # in the # line

but it does not match the string
No hash today

The operator =~ checks to see whether a string (on the left-hand side) matches a regular expression on the right-hand side. For example, the expression

"A # line" =~ /#/

evaluates to true. Putting it all together, to check a line to see whether it has a hash mark in it, you need the following expression:
if ($line =~ /#/)

Listing 4.1 contains a program that checks a set of strings to see whether they are comments.

Listing 4.1. comment.pl
use strict; 
use warnings; 

my @strings = (
    "# Starts with a hash", 
    "This one's got nothing", 
    "There is # in the middle", 
    "One at the end #", 
    "Two # in the # line", 
    "No hash"); 
my $i; 

for ($i = 0; $i <= $#strings; $i++) {
    if ($strings[$i] =~ /#/) {
        print "COMMENT: $strings[$i]
"; 
    } else {
        print "CODE:    $strings[$i]
"; 
    } 
}

When run, this program prints

COMMENT: # Starts with a hash 
CODE:    This one's got nothing 
COMMENT: There is # in the middle 
COMMENT: One at the end # 
COMMENT: Two # in the # line 
CODE:    No hash

In the regular expression /#/, the # character is called an atom. It matches the character “#”. Letters and digits are atoms that match themselves. For example, if you want to see whether a line contains the word “Error,” you can use the statement

if ($line =~ /Error/)

Back to the comment eliminator—there’s a problem with the code. It matches any line that contains a # anywhere in the line. You want to match only lines that begin with the # character.

For that, you need to tell Perl that the # must appear at the beginning of the line. The ^ character is used to anchor a pattern at the beginning of a line. So /#/ means match # anywhere, and /^#/ means match # only if it occurs at the beginning of a line.

So to check for a comment, you need the statement

if ($line =~ /^#/)

So this expression will match

# This is a test

but will not match
SRC=ccm.c 
     # Indented comment (not caught yet) 
A line with a # in the middle

But what if you want to check for empty lines as well? The $ character matches the end of a line. So if you want to check for an empty line, use the following statement:

if ($line =~ /^$/)

This matches any line that has a beginning (^) followed immediately by an ending ($). Because the only way you can have the beginning of the line in the same place as the end is when the line is empty, this matches empty lines.

Vim (Vi Improved) and Regular Expressions

The Vim editor (a Vi clone available from http://www.vim.org) is extremely useful when it comes to learning regular expressions. When you search in Vim (using the /, ?, or related commands), the search string you specify is a regular expression very similar to the regular expression used by Perl.

Vim has an option called highlight search. When you do a search with this option enabled, all matching strings are highlighted. Thus if you don’t know exactly what the regular expression /the+/ will match, you edit the file with Vim, search for “the+”, and see what’s highlighted.

To turn on the highlight search option, execute the Vim command:

:set hlsearch


Modifiers

Suppose that you want to make your comment structure a little more flexible. You want to let the user type a few spaces before the # in a comment. The following lines are all considered comments:

#  This is a comment 
<space># This is too 
<space><space><space># And this is also

Note that I’ve spelled out a space as <space> in this example, because real spaces are difficult to see on the page. I mean real spaces.

To match this, you need to tell Perl that a comment has the following characteristics:

  • Beginning of line

  • Zero or more spaces

  • A #

Now to translate these rules into Perl: Rule one is “Beginning of a line.” That’s “^” in Perl speak.

Zero or more spaces is trickier. A single space is easy to do—that’s “ ”. But you need a modifier to tell Perl that the space can be repeated zero or more times. For that, use the star (*) modifier. So zero or more spaces translates to “ *”.

Finally, there’s the # itself.

So the full regular expression is

if ($line =~ /^ *#/)

This successfully matches the three comment lines.

Character Sets

One problem with the current approach is that it only takes into account leading spaces. But what happens if someone decides to start a comment line with a <tab>? The regular expression won’t work.

You need to tell Perl that a line may start with a combination of spaces or tabs. By enclosing a set of characters in square brackets ([]), you tell Perl to match any one of the enclosed characters. For example, if you want to match a space or tab, use the expression [ ] ( is the escape sequence for tab).

So the new regular expression is

if ($line =~ /^[ 	]*#/)

This matches

# Hash at the beginning of the line 
    # 4 spaces before the hash 
        # A tab starts this line

This expression is starting to get a little complex. It would be a good idea to comment it. So the expression, including comments looks like the following:

#             +–––––––––– Beginning of line (^) 
#             |+++++––––– Space or tab ( ) and (	) 
#             ||||||+–––– Repeated 0 or more times (*) 
#             |||||||+––– The character # 
if ($line =~ /^[ 	]*#/)

Commenting regular expressions like this serves several purposes. First it helps you organize your thoughts. Thinking about what you are doing can be a big help when creating regular expressions. (It’s not a bad lesson for life either!)

But also, it helps you make sure that you include all the elements you expected to. Because of the compact nature of regular expressions, it’s easy to leave out something.

Finally, it leaves notes around for the people who come after you and maintain the code. Because regular expressions are both dense and cryptic, these notes will be extremely useful to the poor slob who inherits your code.

More on the Character Set Operator ([])

The character set operator matches any character enclosed in the square brackets ([]). For example, to match the lowercase vowels, you can use the expression /[aeiou]/. Suppose that you want to match the word “the”. The regular expression would be /the/. But the word can be capitalized. That means that you want to look for “The” or “the”. The regular expression for this is /[tT]he/.

Although this matches “the” or “The”, it does not match “THE”. For that you need /[tT][hH][eE]/. (Yuck!)

Also you can specify a range of characters. For example, [a–z] matches any single lowercase letter.

Now you’ll design a regular expression to see whether a variable contains a number. A number consists of one or more digits. So breaking things down into components, your regular expression is

  • ^ — Match the start of the line (that way the regular expression won’t skip junk at the beginning of the line).

  • [0–9] — Match a single digit.

  • [0–9]* — Zero or more additional digits.

  • $ — Match the end of the string. (The number can’t have any trailing garbage.)

The regular expression is /^[0–9][0–9]*$/.

So this matches things like

0 
14 
568 
333 
000 
0524

but does not match
x45        # Does not begin with a digit 
45x        # Last digit is not at end of line. 
45x45      # Letter in the middle breaks the string of digits.

The modifier + acts like the *, only it tells Perl to repeat the previous atom one or more times. Thus with a +, your new regular expression is /^[0–9]+$/.

Whitespace (s)

Back to the comment matching statement—you need to make one final change. Currently, the expression [ ] matches whitespace. It’s simpler and easier to use escape code (s) instead. This is a special regular expression atom that matches one whitespace character. So your final comment matching code looks like the following:

#             +––––––– Beginning of line (^) 
#             |++––––– Match whitespace 
#             |||+–––– Repeated 0 or more times 
#             ||||+––– The character # 
if ($line =~ /^s*#/)

Using Grouping to Split a Line

So far, you’ve created regular expressions that can determine whether a line is a blank line or a comment line. But what about a line that’s half data and half comment? For example:

CC = gcc        # Use the gnu compiler

Suppose that you want to separate this line into its two components, executable code and a comment. For this you use the grouping operator (parentheses).

First, you want to match the noncomment part of the line. That’s everything, up to but not including the #. When a character range begins with a caret (^), it tells Perl to match all the characters except for those in the range. So if you want to match anything except a #, you use the expression [^#].

Similarly, you can use the caret (^) and a range to exclude a range of characters. For example, the expression [^0–9] matches everything except a digit.

The Two Meanings of

Note that /^[a–z]/ is different from /[^a–z]/. In fact, the character caret (^) means different things in the two expressions. In the first one, /^[a–z]/, the caret (^) is outside any range specification ([]), so it means “start of line.” So /^[a–z]/ means “start of line” (^) followed by a lowercase letter ([a–z]).

In the second case, /[^a–z]/, the caret (^) is inside the range specification ([]), so it means “anything except the following.” So /[^a–z]/ means match a single character that’s not a lowercase letter.

The dollar sign ($) has a similar double meaning. Outside a range specification ([]), it means “match end of line.” Inside it means the literal character dollar sign.


The comment then starts with # and continues on for the end of the line.

To capture the two parts of the line, put parentheses around the first part and another set of parentheses around the second. The data matched by the regular expression contained in the first parentheses will be placed in the special Perl variable $1. The result of the second match will be put in $2.

The following shows how this works:

#              ++++–––––––––– Any character except # 
#              ||||+––––––––– Repeated 0 or more times (*) 
#             +|||||+–––––––– Put the enclosed in $1 
#             ||||||| 
#             ||||||| #–––––– The character hash (#) 
#             ||||||| |+––––– Any character (.) 
#             ||||||| ||+–––– Repeated 0 or more times (*) 
#             |||||||+|||+––– Put the enclosed in $2 
if ($line =~ /([^#]*)(#.*)/) {
    $code = $1; 
    $comment = $2;

Now see how this operates on your test statement:

CC = gcc        # Use the gnu compiler

The first part of the regular expression says match everything except a #:

[^#]*

This matches the underlined part of the line:

							CC = gcc        # Use the gnu compiler

Because this portion of the regular expression is enclosed in the first set of parentheses, the matching string is put in the variable $1. So $1 is now

“CC = gcc          .”

The next part of the regular expression tells Perl to match a # (#) followed by any character (.) repeated zero or more times (*), followed by the end of line ($):

#.*$

The second string is enclosed in the second set of parentheses, so the matching string gets placed inside the variable $2.

CC = gcc        # Use the gnu compiler

In this case, $2 becomes “# Use the gnu compiler”.

Dealing with Alternatives (|) and Limited Matches

Regular expressions work in almost all cases. There is a problem, however, if you have a line like

MACRO = "String with # in it"   # Comment

In this case, you don’t want to break the line at the first # since it’s in the middle of a quoted string. So how do you write a regular expression that excludes it?

The solution is to refine the regular expression so that you consider the elements command to be

  • Any character other than #

  • A quoted string

  • Either one, or both repeated zero or more times

This introduces a new operator or (|). This causes Perl to match either one of two different values. The two values in this case are any character other than a # or " ([^#"]) or a quoted string (".*"). Note that you exclude the quotation mark from the first element because it’s taken care of in the second.

The new regular expression looks like

#             +––––––––––––––––––––––––
 Match start of line 
#             |  +++++––––––––––––––––– Match all except
 # or " 
#             |  |||||+–––––––––––––––– Repeat 0 or more
 times (*) 
#             |  |||||| 
#             |  ||||||+––––––––––––––– Match either pattern
 (|) 
#             |  ||||||| 
#             |  |||||||+–––––––––––––– Match " 
#             |  ||||||||+––––––––––––– Any character (.) 
#             |  |||||||||+–––––––––––– Repeat 0 or more times (*) 
#             |  ||||||||||+––––––––––– Match " 
#             |  ||||||||||| 
#             | +|||||||||||+–––––––––– Group the following and 
#             | |||||||||||||           put into $2 
#             | |||||||||||||+––––––––– Repeated 0 or more times (*) 
#             |+||||||||||||||+–––––––– Put the encoded in $1 
#             ||||||||||||||||| 
#             ||||||||||||||||| #–––––– The character hash (#) 
#             ||||||||||||||||| |+––––– Any character (.) 
#             ||||||||||||||||| ||+–––– Repeated 0 or more times (*) 
#             |||||||||||||||||+|||+––– Put the enclosed in $3 
if ($line =~ /^(([^#"]*|".*")*)(#.*)/)

You have to use the grouping operators (()) to get the expression to work. But that causes you to put the statement in $1 and the comment part in $3. You want it in $2, so you have to tell Perl “group these, but don’t put the result in a $<number > variable.”You do that by changing the grouping operator (...) to (?:....). The result is

                +++|||||||||||+–––––––– Group (but no $n placement) 
if ($line =~ /^((?:[^#"]*|".*")*)(#.*)/)

But there’s still a problem. What happens when you see a line like

MACRO = "String"    #  "String" used

The regular expression ".*" tries to match as much as possible. The result is that it matches

MACRO = "String"    #  "String" used

You want it to match only the first part. A solution to this is to follow the star (*) with a question mark (?). The question mark (?) tells Perl to match as little as possible. So the revised expression is

                           +––––––––––––– Match any character (.) 
                           |+–––––––––––– repeat 0 or more times (*) 
                           ||+––––––––––– as little as possible (?) 
                           |||+–––––––––– Match quote (") 
if ($line =~ /^((?:[^#"]*|".*?")*)(#.*)/)

This change tells Perl that the string stops when the first double quote (") is seen.

Conditionals

Now add a final refinement to the code. You want to match strings that contain an escaped double quote (") in them, such as

MACRO = "string with " in it"  # Comment

Now when you look for a string, you want to match “all the text from one double quote to another, but don’t match double quotes with an escape in front of them.”

Fortunately, Perl has a “don’t match if ” operator. Actually, the Perl term for this is “negative lookbehind assertion.”

So your string consists of

  • A starting quote (")

  • Zero or more inside characters (.*)

  • An ending quote (") that works only if

  • It is not preceded by a (?<!\)

The updated expression is

                              ++––––––––– Character (\) 
                              ||          (The thing not to match) 
                          ++++||+–––––––– Match next (") if 
                          |||||||         enclosed () does not 
                          |||||||         occur here. 
                          |||||||+––––––– Match double quote (") 
if( $line =~ /^((?:[^#"]*|".*?(?<!\)")*)(#.*)/)

Now take the revised expression and use it in a program that parses Makefiles and parses the lines into their various parts. Listing 4.2 contains the results.

Listing 4.2. make.pl
use strict; 
use warnings; 

my $line; 
while ($line = <STDIN>) {
    # No 
 because it's already in $line 
    print "Line: $line"; 

    if ($line =~ /^s*#/) {
        next; 
    } 
    #               
 +–––––––––––––––––––––––––––––––––
 Match start of line 
    #                |   
 +++++–––––––––––––––––––––––– Match all
 except # or " 
    #                |   
 |||||+––––––––––––––––––––––– Repeat 0 or
 more times (*) 
    #                |    |||||| 
    #                |   
 ||||||+–––––––––––––––––––––– Match either
 pattern (|) 
    #                |    ||||||# 
    #                |   
 ||||||#+––––––––––––––––––––– Match " 
    #                |   
 ||||||#|+–––––––––––––––––––– Any character (.) 
    #                |   
 ||||||#||+––––––––––––––––––– Repeat 0 or more times
 (*) 
    #                |   
 ||||||#|||+–––––––––––––––––– As little as possible (?) 
    #                |    ||||||#||||+++++––––––––––––– Lookback
 conditional 
    #                |    ||||||#||||^||||              Must not be preceded by: 
    #                |    ||||||#||||^||||++––––––––––– The character  
    #                |    ||||||#||||^|||||^+–––––––––– Match " 
    #                |    ||||||#||||^|||||^| 
    #                | +++||||||#||||^|||||^|+––––––––– Group the
 following but 
    #                | |||||||||#||||^|||||^||          put do not put into $n ( (?: ) 
    #                | |||||||||#||||^|||||^||+–––––––– Repeated 0 or more
 times (*) 
    #                |+|||||||||#||||^|||||^|||+––––––– Put the enclosed in $1 
    #                |^|||||||||#||||^|||||^|||^ 
    #                |^|||||||||#||||^|||||^|||^ +––––– The character hash (#) 
    #                |^|||||||||#||||^|||||^|||^ |+–––– Any character (.) 
    #                |^|||||||||#||||^|||||^|||^ ||+––– Repeated 0 or more times (*) 
    #                |^|||||||||#||||^|||||^|||^+|||+–– Put the enclosed in $3 
    elsif ($line =~ /^((?:[^#"]*|".*?(?<!\)")*)(#.*)/) {
        print "Type: Command with Comment
"; 
        print "Command: $1
"; 
        print "Comment: $2
"; 
    } else {
        print "Type: Command only
"; 
    } 
    print "
"; 
}

It’s interesting to note that this example has 24 lines of comments for one line of code. When you deal with something as compact and as powerful as regular expressions, comments like this occur.

The designers of Perl recognized that complex regular expressions can be difficult to explain, so they added a feature that allows you to comment them. The x modifier tells Perl to ignore spaces (escaped spaces are still valid) and to allow comments beginning with #.

To turn an ordinary regular expression into an extended one, you need to escape any space or # characters. The first # occurs in a character range ([^#"]), so it does not have to be escaped. The second one, in the #.* subexpression, does. So the new regular expression is

if( $line =~ /^((?:[^#"]*|".*?(?<!\)")*)(#.*)/x )

Now you can add comments and whitespace to make the thing clearer:

if( $line =~ 
     /^                   # Match beginning of the line 
        (                 # Put enclosed in $1 
            (?:           # Group the enclosed expression 
                    [^#"] # A series of characters excluding # and " 
                          # (i.e. everything except a comment or 
                          # string) 
                    *     # Repeated zero or more times 
                |         # command *OR* string 
                    "     # String, starts with " 
                    .     # Match any character (inside string) 
                    *?    # Zero or more times, but as 
                          # little as possible. 
                    (?<!  # Don't match if the next quote 
                          # is preceded by a 
                       \ # Backslash 
                    )     # End negative assertion 
                    "     # Quote (string end) 
                          # Must not have  in front of it 
            )             # End of command or string 
            *             # Repeat expression 0 or more times 
        )                 # End of $1 (command part) 

        (                 # Put enclosed in $2 
            #            # Begin with # 
            .*            # Any charter zero or more times 
        )                 # End of $2 
    /x )                  # Allow comments in the regular expression

Now are these 29 lines of comment regular expression easier to read than the line regular expression with its 24 lines of comments? Not that much really. It seems that no matter how you arrange a mess, it still looks like a mess.

Probably the only major advantage of the extended version of the regular expression is the fact that to comment it, you didn’t have to draw all those vertical bars (|).

Common Mistake: Putting / in a Regular Expression Comment

Take a look at the following extended regular expression:

if ($line =~ / 
    //            # C / C++ Single line comment Start 
    .*              # Goes to the end of line 
    /x )            # End of extended regular expression

At first glance, everything looks okay. The regular expression begins with / and ends with /x, so it is an extended regular expression. But there’s a problem.

When Perl sees this, it doesn’t know it’s an extended regular expression, so it parses things until it finds the end of the expression. The first slash is the slash in the comment:

//            # C / C++ Single line comment Start

Perl sees this and, because it doesn’t know that this is an extended regular expression yet, concludes that the / in “C / C++” is the end of an ordinary regular expression.

For this reason, you can’t use / in a regular expression comment.


Using the Regular Expression Debugging Package

Regular expressions use a code all their own. This makes them tricky to write and debug. However, you can use a few tricks to make your life easier.

First, the best way of debugging a regular expression is to avoid making a mistake in the first place. This is difficult, especially for a novice, but if you design and comment your regular expressions, you’ll reduce errors before you even start.

But in spite of how clever or careful you are, some errors will get through. In that case, you can use Perl’s regular expression debugging module. Regular expression debugging is turned on with the line

use re 'debug';

It is turned off with the statement

no re 'debug';

When debug is turned on, Perl outputs some debugging messages whenever a regular expression is compiled. Take a look at how this works with a simple regular expression /#/. The result of the compilation is

Compiling REx `#' 
size 3 first at 1 
   1: EXACT <#>(3) 
   3: END(0) 
anchored `#' at 0 (checking anchored isall) minlen 1

The first line tells you what is being compiled. The second tells you how large the compiled result is and which node starts the expression (in this case #1).

The next two lines are the parsed version of the regular expression. Each line consists of a single element of the expression called a node. In this case, the expression consists of two nodes, an exact match for the character # and the end of the expression:

1: EXACT <#>(3) 
3: END(0)

The number at the beginning of each line is an identification number. The number at the end of each line indicates the next node to be evaluated. This is usually the next node for simple regular expressions, but things can jump around if you have moderately complex code.

Finally, one final line displays optimizer information:

anchored `#' at 0 (checking anchored isall) minlen 1

This line can safely be ignored.

But the compilation of a regular expression is only part of the job. Perl also must perform matches. With debugging enabled, you can watch the progress of matching the string # Starts with a hash. For example, the first comparison results in

Guessing start of match, REx `#' against `# Starts with a hash'... 
Found anchored substr `#' at offset 0... 
Guessed: match at offset 0

In this example, the optimizer guessed that the string matches the regular expression at offset 0. (In other words, the thing figured out that the first character of # Starts with a hash is #.) Now you go from the simple to the complex. Take a look at what happens when you turn on debugging for the make regular expression in Listing 4.2. The result of the compilation can be seen in the following debug output:

Compiling REx `^((?:[^#"]*|".*?(?<!\)")*)(#.*)' 
size 44 first at 2 
   1: BOL(2) 
   2: OPEN1(4) 
   4:   CURLYX {0,32767}(33) 
   6:     BRANCH(17) 
   7:       STAR(32) 
   8:         ANYOF[–!$–377](0) 
  17:     BRANCH(31) 
  18:       EXACT <">(20) 
  20:       MINMOD(21) 
  21:       STAR(23) 
  22:         REG_ANY(0) 
  23:       UNLESSM[–1](29) 
  25:         EXACT <>(27) 
  27:         SUCCEED(0) 
  28:         TAIL(29) 
  29:       EXACT <">(32) 
  31:     TAIL(32) 
  32:     WHILEM[1/1](0) 
  33:   NOTHING(34) 
  34: CLOSE1(36) 
  36: OPEN2(38) 
  38:   EXACT <#>(40) 
  40:   STAR(42) 
  41:     REG_ANY(0) 
  42: CLOSE2(44) 
  44: END(0) 
floating `#' at 0..2147483647 (checking floating) anchored(BOL) minlen 1

Take a look at what each of these nodes means.

This line

1: BOL(2)

matches the beginning of the line. (This is generated by the ^ character.)

The next line

2: OPEN1(4)

opens (OPEN) group 1.

The following line repeats what follows 0 to 32767 times:

4:   CURLYX {0,32767}(33)

Note that 33 is the next token. That means that the CURLYX matches up to the NOTHING node at 33. So what the CURLYX applies to is the (?:[^#"]*|".*?(?<!\)") part of the regular expression. (Note: The CURLYX is generated because of the * at the end of this expression. Although * means 0 to infinity, Perl translates this into 0 to 32767.)

This line indicates that you can make a choice:

6:     BRANCH(17)

You can choose this branch or the selection that starts at node #17.

These lines

7:       STAR(32) 
8:         ANYOF[–!$–377](0)

are Perl’s way of saying [^#]*. The ANYOF indicates a match of any of the selected characters. Perl takes the exclude specification ([^#]) and turns it into an include specification ([–!$–377]) that includes everything but #. This is modified by the STAR (*) specification

You may wonder why the next node after the STAR is #32. That’s because it’s the end of the alternate branch starting at node #17.

In line 17

17:     BRANCH(31)

again you have a new branch that might be taken. This one includes all the nodes from 17 to 29 (the node before 31).

In this branch

18:       EXACT <">(20)

you are looking for a double quote (").

This line

20:       MINMOD(21)

indicates that the next set of operators is to match as little as possible (triggered by the *? operator).

The following two lines

21:       STAR(23) 
22:         REG_ANY(0)

show the parsed version of .*—anything (REG_ANY), repeated zero or more times (STAR).

These lines

23:       UNLESSM[–1](29) 
25:         EXACT <>(27)

are the negative assertion (?<!\), which tells Perl to fail the following (EXACT <>) matches.

If it doesn’t fail, this portion of the expression succeeds:

27:         SUCCEED(0)

This is a nop:

28:         TAIL(29)

It’s inserted to allow people from outside a place to branch into the function (for example, node 17).

This is an exact character match:

29:       EXACT <">(32)

This line is another node used as a branch destination:

31:     TAIL(32)

These lines

32:     WHILEM[1/1](0) 
33:   NOTHING(34)

perform curly brace matching (WHILEM) if the rest matches (NOTHING). This defines an anchor point for the expression.

This line

34: CLOSE1(36)

closes the $1 processing.

This line starts the $2 processing:

36: OPEN2(38)

The expression for $2 starts with a #:

38:   EXACT <#>(40)

Then a series of any characters (REG_ANY) zero or more times (STAR):

40:   STAR(42) 
41:     REG_ANY(0)

This line closes $2:

42: CLOSE2(44)

This line is the end:

44: END(0)

Tracking the Execution of the Match

Now look at what happens when you try the following string against the big regular expression in Listing 4.2:

CC = gcc   # Code and comment

First, the optimizer tries a few guesses. Because # is key to this regular expression, it first sees whether it can find it, and it does. It then sees whether the first part (CC = gcc) matches the first part of the regular expression. For example:

Guessing start of match, REx `^((?:[^#"]*|".*?(?<!\)")*)(#.*)' against `CC = gcc   # Code
 and comment
'... 
Found floating substr `#' at offset 11... 
Guessed: match at offset 0 
Matching REx `^((?:[^#"]*|".*?(?<!\)")*)(#.*)' against `CC = gcc   # Code and comment
'

The system now parses the first part of the expression:

Setting an EVAL scope, savestack=15 
0 <> <CC = gcc   #>    |  1:  BOL 
0 <> <CC = gcc   #>    |  2:  OPEN1 
0 <> <CC = gcc   #>    |  4:  CURLYX {0,32767} 
0 <> <CC = gcc   #>    | 32:    WHILEM[1/1] 
                           0 out of 0..32767  cc=7b0419c0

At this point, the system sees the | symbol and realizes that it has two different possible options, so it saves its current location and tries the first branch:

Setting an EVAL scope, savestack=24 
 0 <> <CC = gcc   #>    |  6:      BRANCH 
Setting an EVAL scope, savestack=24

Now it tries to match using the regular expression [^#]*:

0 <> <CC = gcc   #>    |  7:        STAR 
ANYOF[–!$–377] can match 11 times out of 32767...

Turns out that this matches 11 characters. But you’re not finished. If the other match is longer, it will win, so Perl needs to evaluate it too:

Setting an EVAL scope, savestack=24 
11 <cc   > <# Code >    | 32:          WHILEM[1/1] 
                                  1 out of 0..32767  cc=7b0419c0

The second branch starts here:

Setting an EVAL scope, savestack=33 
11 <cc   > <# Code >    |  6:            BRANCH 
Setting an EVAL scope, savestack=33 
11 <cc   > <# Code >    |  7:              STAR 
ANYOF[–!$–377] can match 0 times out of 32767...

The preceding doesn’t match anything. Then it proceeds on:

Setting an EVAL scope, savestack=33 
11 <cc   > <# Code >    | 32:                WHILEM[1/1] 
2 out of 0..32767  cc=7b0419c0 
empty match detected,try continuation...

So the match fails. But the first branch succeeded, so clean up and use that one to determine what portion matches:

11 <cc   > <# Code >    | 33:                  NOTHING 
11 <cc   > <# Code >    | 34:                  CLOSE1

Now take care of the second part of the regular expression, which matches from the # to the end:

11 <cc   > <# Code >    | 36:                  OPEN2 
11 <cc   > <# Code >    | 38:                  EXACT <#> 
12 <c   #> < Code a>    | 40:                  STAR 
                         REG_ANY can match 17 times out of 32767...

Finally, with a little more cleanup, you’re finished:

  Setting an EVAL scope, savestack=33 
  29 <and comment> <
>    | 42:                    CLOSE2 
  29 <and comment> <
>    | 44:                    END 
Match successful!

So the string matches. +

Debug Summary

Regular expressions are complex and tricky. The debug output is supposed to make things better, but it can itself sometimes be just as difficult to read as the expression it’s trying to clarify.

A full discussion of the debug output for regular expressions can be found in the online document perldoc perldebguts.

For those of you who want to see the regular expression matching system working really hard, the following listing contains the output of the program dissecting a line that contains not only a quoted string but also one with a " in it.

Those of you who really want to get into the true guts of the regular expression parser can go through this listing as it goes through many different branches and backtrack while parsing this string.

QSTR = "Str with " quote" # Quoted string 

Guessing start of match, REx `^((?:[^#"]*|".*?(?<!\)")*)(#.*)' against `QSTR = "Str with 
" quote" # Quoted string
'... 
Found floating substr `#' at offset 27... 
Guessed: match at offset 0 
Matching REx `^((?:[^#"]*|".*?(?<!\)")*)(#.*)' against `QSTR = "Str with " quote" #
 Quoted string
' 
  Setting an EVAL scope, savestack=15 
   0 <> <QSTR = "Str >    |  1:  BOL 
   0 <> <QSTR = "Str >    |  2:  OPEN1 
   0 <> <QSTR = "Str >    |  4:  CURLYX {0,32767} 
   0 <> <QSTR = "Str >    | 32:    WHILEM[1/1] 
                              0 out of 0..32767  cc=7b0419c0 
  Setting an EVAL scope, savestack=24 
   0 <> <QSTR = "Str >    |  6:      BRANCH 
  Setting an EVAL scope, savestack=24 
   0 <> <QSTR = "Str >    |  7:        STAR 
   ANYOF[–!$–377] can match 7 times out of 32767... 
  Setting an EVAL scope, savestack=24 <"Str wi>    | 32:          WHILEM[1/1] 
                                    1 out of 0..32767  cc=7b0419c0 
  Setting an EVAL scope, savestack=33 
   7 <TR = > <"Str wi>    |  6:            BRANCH 
  Setting an EVAL scope, savestack=33 
   7 <TR = > <"Str wi>    |  7:              STAR 
   ANYOF[–!$–377] can match 0 times out of 32767... 
  Setting an EVAL scope, savestack=33 
   7 <TR = > <"Str wi>    | 32:                WHILEM[1/1] 
   2 out of 0..32767  cc=7b0419c0 
   empty match detected, try continuation... 
   7 <TR = > <"Str wi>    | 33:                  NOTHING 
   7 <TR = > <"Str wi>    | 34:                  CLOSE1 
   7 <TR = > <"Str wi>    | 36:                  OPEN2 
   7 <TR = > <"Str wi>    | 38:                  EXACT <#> 
                                            failed... 
                                          failed... 
                                        failed... 
   7 <TR = > <"Str wi>    | 18:              EXACT <"> 
   8 <R = "> <Str wit>    | 20:              MINMOD 
   8 <R = "> <Str wit>    | 21:              STAR 
  Setting an EVAL scope, savestack=33 
   8 <R = "> <Str wit>    | 23:                UNLESSM[–1] 
   7 <TR = > <"Str wi>    | 25:                  EXACT <> 
                                            failed... 
   8 <R = "> <Str wit>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
   9 < = "S> <tr with>    | 23:                UNLESSM[–1] 
   8 <R = "> <Str wit>    | 25:                  EXACT <> 
                                            failed... 
   9 < = "S> <tr with>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  10 <= "St> <r with >    | 23:                UNLESSM[–1] 
   9 < = "S> <tr with>    | 25:                  EXACT <> 
                                            failed... 
  10 <= "St> <r with >    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  11 < "Str> < with >    | 23:                UNLESSM[–1] 
  10 <= "St> <r with >    | 25:                  EXACT <> 
                                            failed... 
  11 < "Str> < with >    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  12 <"Str > <with ">    | 23:                UNLESSM[–1] 
  11 < "Str> < with >    | 25:                  EXACT <> 
                                            failed... 
  12 <"Str > <with ">    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  13 <Str w> <ith " >    | 23:                UNLESSM[–1] 
  12 <"Str > <with ">    | 25:                  EXACT <> 
                                            failed... 
  13 <Str w> <ith " >    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  14 <tr wi> <th " q>    | 23:                UNLESSM[–1] 
  13 <Str w> <ith " >    | 25:                  EXACT <> 
                                            failed... 
  14 <tr wi> <th " q>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  15 <r wit> <h " qu>    | 23:                UNLESSM[–1] 
  14 <tr wi> <th " q>    | 25:                  EXACT <> 
                                            failed... 
  15 <r wit> <h " qu>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  16 < with> < " quo>    | 23:                UNLESSM[–1] 
  15 <r wit> <h " qu>    | 25:                  EXACT <> 
                                            failed... 
  16 < with> < " quo>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  17 <with > <" quot>    | 23:                UNLESSM[–1] 
  16 < with> < " quo>    | 25:                  EXACT <> 
                                            failed... 
  17 <with > <" quot>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  18 <ith > <" quote>    | 23:                UNLESSM[–1] 
  17 <with > <" quot>    | 25:                  EXACT <> 
  18 <ith > <" quote>    | 27:                  SUCCEED 
                                            could match... 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  19 <th "> < quote">    | 23:                UNLESSM[–1] 
  18 <ith > <" quote>    | 25:                  EXACT <> 
                                            failed... 
  19 <th "> < quote">    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  20 <h " > <quote" >    | 23:                UNLESSM[–1] 
  19 <th "> < quote">    | 25:                  EXACT <> 
                                            failed... 
  20 <h " > <quote" >    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  21 < " q> <uote" #>    | 23:                UNLESSM[–1] 
  20 <h " > <quote" >    | 25:                  EXACT <> 
                                            failed... 
  21 < " q> <uote" #>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  22 <" qu> <ote" # >    | 23:                UNLESSM[–1] 
  21 < " q> <uote" #>    | 25:                  EXACT <> 
                                            failed... 
  22 <" qu> <ote" # >    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  23 <" quo> <te" # Q>    | 23:                UNLESSM[–1] 
  22 <" qu> <ote" # >    | 25:                  EXACT <> 
                                            failed... 
  23 <" quo> <te" # Q>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  24 < quot> <e" # Qu>    | 23:                UNLESSM[–1] 
  23 <" quo> <te" # Q>    | 25:                  EXACT <> 
                                            failed... 
  24 < quot> <e" # Qu>    | 29:                EXACT <"> 
                                          failed... 
                           REG_ANY can match 1 times out of 1... 
  25 <quote> <" # Quo>    | 23:                UNLESSM[–1] 
  24 < quot> <e" # Qu>    | 25:                  EXACT <> 
                                            failed... 
  25 <quote> <" # Quo>    | 29:                EXACT <"> 
  26 <uote"> < # Quot>    | 32:                WHILEM[1/1] 
                                          2 out of 0..32767  cc=7b0419c0 
  Setting an EVAL scope, savestack=46 
  26 <uote"> < # Quot>    |  6:                  BRANCH 
  Setting an EVAL scope, savestack=46 
  26 <uote"> < # Quot>    |  7:                    STAR 
                           ANYOF[–!$–377] can match 1 times out of 32767... 
  Setting an EVAL scope, savestack=46 
  27 <ote" > <# Quote>    | 32:                      WHILEM[1/1] 
                                                3 out of 0..32767 
  Setting an EVAL scope, savestack=59 
  27 <ote" > <# Quote>    |  6:                        BRANCH 
  Setting an EVAL scope, savestack=59 
  27 <ote" > <# Quote>    |  7:                          STAR 
                           ANYOF[–!$–377] can match 0 times out of 32767... 
  Setting an EVAL scope, savestack=59 
  27 <ote" > <# Quote>    | 32:                            WHILEM[1/1] 
                                                      4 out of 0..32767 cc=7b0419c0 
                                                      empty match detected, try
 continuation... 
  27 <ote" > <# Quote>    | 33:                              NOTHING 
  27 <ote" > <# Quote>    | 34:                              CLOSE1 
  27 <ote" > <# Quote>    | 36:                              OPEN2 
  27 <ote" > <# Quote>    | 38:                              EXACT <#> 
  28 <te" #> < Quoted>    | 40:                              STAR 
                           REG_ANY can match 14 times out of 32767... 
  Setting an EVAL scope, savestack=59 
  42 <uoted string> <
>    | 42:                                CLOSE2 
  42 <uoted string> <
>    | 44:                                END 
Match successful!

Regular Expression Element Summary

In this section, you saw how to use all the different elements of a regular expression, including the following:

  • Atoms— These can be simple things such as letters, digits, and other single charters, or they can be a character range such as [A–Z] or [AEIOU]. They can also be any other regular expression that has been turned into a group by () or related operators.

  • Ranges— A range is a set of charters enclosed in [].

  • Anchors— Anchors serve to position a pattern on the line. You’ve used the beginning of line (^) anchor to make sure that your regular expression is at the right place. Other anchors include things such as the end of line ($) and word boundaries (B).

  • Modifiers— Modifiers tell Perl how many of an item to expect. You used the * (zero or more times) modifier extensively in this example. Another common modifier is +, which stands for “one or more times.”

  • Grouping— Grouping occurs when you want to split a line and get the result in the $n Perl variables.

  • Conditionals— Conditionals give Perl a “match if ” capability. You used one to “match # if it’s not escaped” ((?<!\)#).

By putting together all these elements, you can easily build up a very powerful string-matching capability.

Regular Expression Construction

You should remember a few things when you create regular expressions:

  • Start slowly— Regular expressions are complex and tricky. Start with something simple and build on it later.

  • One step at a time— Because one character out of place in a regular expression can change the whole meaning of the operation, you should build slowly on what works to make sure that what you are doing is right.

  • Test often— What you think you’ve written and what you really wrote can easily be different. Make sure that you check things out carefully at each stage of the construction. (Vim is extremely helpful when it comes to verifying regular expressions.)

  • Comment verbosely— Every character in a regular expression has its own meaning. It was put there for a purpose. When you construct it, write down the purpose by sticking it in a comment. It’s easy for even the programmer who wrote it to forget why a regular expression was constructed the way it was, so write down everything and stick that information in the comments. The programmer’s life you save may be your own.

  • Check the documentation regularly— A ton of things can go into a regular expression. To make sure that you are using the right ones, and to make sure that you aren’t missing things, check out the online documentation (perlre) regularly. There may be something inside that will surprise you.

    Finally, now is a good time to take a break and depressurize. You’ve probably never had such a concentrated session of powerful and cryptic syntax thrown at you at one time. Take a break, let the concepts settle in a little, and continue after a cup of coffee.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset