Regular Expressions

The comparison and manipulation features provided by the String and String-Builder classes are adequate for simple operations. However, when complex string manipulations are required or large amounts of text need to be processed, regular expressions can provide a more efficient solution. Although regular expressions are new to Java version 1.4, many third-party Java regular expression libraries have been available for some time. .NET includes regular expression functionality as part of the standard class libraries in the System.Text.RegularExpressions namespace.

The .NET regular expression implementation is refreshingly straightforward and contains some functionality not available in the Java implementation. In the following sections, we’ll cover the use of the .NET regular expression classes; where appropriate, we’ll contrast them with those provided in Java version 1.4.

Compiling Regular Expressions

The java.util.regex.Pattern class represents a compiled regular expression; the .NET equivalent is System.Text.RegularExpressions.Regex. Whereas Java provides a static factory method for creating Pattern instances, Regex uses a constructor; we contrast these approaches in Table 7-12. Both implementations offer two overload versions with similar signatures and return an immutable representation of a compiled regular expression. The first version takes a string containing a regular expression. The second takes both a regular expression and a bit mask containing compilation flags; we discuss these flags later in this section.

Table 7-12. Regular Expression Creation in Java and .NET

java.util.regex.Pattern

System.Text.RegularExpressions.Regex

Pattern.compile(String)

Regex(string)

Pattern.compile(String, flags)

Regex(string, RegexOptions)

The following example shows the Java and C# statements required to compile a regular expression that matches name/value pairs of the form author = Adam Freeman:

// Java
Pattern p = Pattern.compile("\b\w+\s*=\s*.*");

// C#
Regex r = new Regex(@"w+s*=s*.*");

Note the following:

  • The C# example uses the @ symbol to indicate a verbatim string. This permits the inclusion of characters without the need to escape them as seen in the Java equivalent.

  • The regular expression "w+s*=s*.* " used in the preceding example can be broken down as follows:

    •  matches a word boundary.

    • w+ matches one or more word characters.

    • s* matches zero or more white-space characters.

    • = matches the equal sign.

    • .* matches any character except the newline character .

More Info

For a complete description of the syntax supported by .NET regular expressions, refer to the .NET documentation.

Regular Expression Compilation Flags

When a regular expression is constructed, compilation flags can be provided to modify its behavior. In .NET, these flags are specified as a bit mask using members of the System.Text.RegularExpressions.RegexOptions enumeration. Table 7-13 summarizes the more useful flags alongside their Java equivalents.

Table 7-13. Regular Expression Compilation Flags in Java and .NET

Java

.NET

Description

COMMENTS

IgnorePatternWhitespace

Ignores any white space and comments without the need to escape them.

N/A

Compiled

Compiles the regular expression to MSIL code. See the next section for more details.

N/A

ExplicitCapture

Only explicitly named or numbered groups are valid captures.

CASE_INSENSITIVE

UNICODE_CASE

IgnoreCase

The Java CASE_INSENSITIVE flag works only for ASCII characters. It must be used in conjunction with UNICODE_CASE to support case-insensitive Unicode.

MULTILINE

Multiline

Specifies multiline mode.

N/A

RightToLeft

Specifies that searches should go from right to left.

DOTALL

Singleline

Specifies single-line mode.

N/A

ECMAScript

Enables ECMAScript-compliant behavior.

The RegexOptions.Compiled Flag

Given that all regular expressions must be compiled, the name of this flag is a little confusing. However, a regular expression is normally compiled to an intermediate form that is interpreted by the regular expressions engine at run time. The RegexOptions.Compiled flag forces compilation of the regular expression down to MSIL code. This results in faster execution but slower loading.

Use the Compiled flag sparingly. The common language runtime (CLR) cannot unload generated MSIL code without unloading the entire application domain. Using the Compiled flag frequently will result in compiled regular expressions consuming system resources.

Manipulating Text

Once a compiled regular expression instance has been created, it can be used against input text to

  • Locate matching substrings.

  • Perform substring replacement.

  • Split the input text into component parts.

In .NET, these actions are all initiated through methods of the Regex instance. In Java, splitting the input text is initiated from the Pattern instance, but matching and replacing requires the instantiation of a java.util.regex.Matcher object using the Pattern.matcher factory method. This is where the .NET and Java models diverge significantly.

We’ll cover matching, replacing, and splitting of input text in the following sections.

Matching Regular Expressions

If we are concerned only with determining whether an input text contains an occurrence of the regular expression, we use the Regex.IsMatch method. This returns a bool indicating whether a match was found but does not give access to any further match details. For example:

// Create an input text string
string input = "author = Allen Jones, author = Adam Freeman";

// Compile regular expression to find "name = value" pairs
Regex r = new Regex(@"w+s*=s*.*");

// Test for a match
bool b = r.IsMatch(input);        // b = true;

If we need access to the number and location of matches found, two approaches are available. First, the Regex.Match method returns an instance of Match, which represents the result of a single match operation. The Match.Success property signals whether the match was successful. Match.NextMatch returns a new Match instance representing the next match in the input text. The use of Regex.Match, Match.Success, and Match.NextMatch enables the programmer to sequentially step through the matches in an input text. This is similar to the Java model of using repeated calls to Matcher.find.

Alternatively, the Regex.Matches method returns an instance of MatchCollection containing an enumerable set of Match instances representing all matches in the input text. Either the MatchCollection indexer or MatchCollection.GetEnumerator can be used to iterate across the set of Match instances. We demonstrate both the use of Match.NextMatch and MatchCollection.GetEnumerator in the following example:

// Create an input text string
string input = "author = Allen Jones 
 author = Adam Freeman";

// Compile regular expression to find "name = value" pairs
Regex r = new Regex(@"w+s*=s*.*");

// Using Match.NextMatch() to process all matches
Match m = r.Match(input);
while (m.Success) {
    System.Console.WriteLine(m.Value);
    m = m.NextMatch();
}

// Using MatchCollection to process all matches
MatchCollection mc = r.Matches(input);
foreach (Match x in mc) {
    System.Console.WriteLine(x.Value);
}

Both loops in this example produce the same output, resulting in the following display:

author = Allen Jones
author = Adam Freeman
author = Allen Jones
author = Adam Freeman

The Match instance provides access to details of the match, including the results of each capture group and subexpression capture. The members of the Match and MatchCollection classes are summarized in Table 7-14 and Table 7-15.

Table 7-14. System.Text.RegularExpressions.Match Member Summary

Member

Description

Properties

 

Captures

Returns a CaptureCollection containing a set of all subexpression captures represented by Capture objects.

Groups

Returns a GroupCollection containing a set of Group objects representing the groups matched by the regular expression.

Index

The position in the input string where the first character of the match was located.

Length

The length of the captured substring.

Success

Indicates whether the match was a success.

Value

The captured substring.

Methods

 

NextMatch()

Returns a Match instance that represents the next match in the input text.

Result()

Returns the expansion of a specified replacement pattern.

Synchronized ()

Returns a thread-safe instance of the Match object.

Table 7-15. System.Text.RegularExpressions.MatchCollection Member Summary

Member

Description

Indexers

 

<MatchCollection>[key]

Gets the Match object at the specified index.

Properties

 

Count

Gets the number of Match instances contained.

Methods

 

GetEnumerator()

Gets an IEnumerator that is used to iterate over the collection of Match objects.

The Match class is derived from Group, which in turn is derived from Capture. The Capture and Group instances retrievable through the Match class represent the specific group and subexpression matches that constitute a successful match. The CaptureCollection and GroupCollection classes provide the same functionality for the Group and Capture objects that the MatchCollection provides for the Match object. The members of Capture, Group, CaptureCollection, and GroupCollection are similar to those of Match and MatchCollection, discussed previously, and will not be covered in detail. Refer to the .NET documentation for complete details.

Replacing Substrings

There are two approaches to substring replacement. First, the Regex.Replace method replaces any matches in an input text with a specified substitution string. Overloaded versions of Replace allow the specification of a maximum number of replacements to make and a search starting position in the input text.

Alternatively, an overloaded version of the Replace method takes a MatchEvaluator delegate as an argument. For each match that occurs, the delegate is invoked. The delegate is passed a Match instance that represents the current match. The delegate implements any decision-making logic required and returns a string that will be used as the substitution string.

The MatchEvaluator delegate has the following signature:

public delegate string MatchEvaluator(Match match);

The following example demonstrates both of these approaches:

using System;
using System.Text.RegularExpressions;

public class REReplace {

    // Declare MatchEvaluator delegate target method
    public static string MyEval(Match match) {
        switch (match.Value) {
            case "fox" : return "cow";
            case "dog" : return "pig";
            default : return match.Value;
        }
    }

    public static void Main() {

        // Create an input text
        string text =
            "the quick red fox jumped over the lazy brown dog.";
        // Perform a complete replacement of "the" with "a"
        Regex r = new Regex("the");
        System.Console.WriteLine(r.Replace(text, "a"));

        // Perform evaluated replacement of any word that
        // has the lower case letter "o" in, but not at
        // the beginning or end.
        r = new Regex(@"w+ow+");
        System.Console.WriteLine(r.Replace(text,
            new MatchEvaluator(REReplace.MyEval)));
    }
}

The output from the example with the replacements highlighted is

a  quick red fox jumped over a lazy brown dog
the quick red cow jumped over the lazy brown pig

Note that in the second line of output, although brown matches our regular expression, it is not replaced based on the logic in the MyEval delegate.

Splitting Strings

The splitting of an input text around a regular expression is handled using the Regex.Split method. Split takes an input string and an optional integer that sets the maximum number of splits to perform and returns a string array containing the extracted substrings. This is demonstrated in the following code fragment:

// Split the String at the first two occurrences of the regex "and"
string input = "bill and bob and betty and dave";
Regex r = new Regex(" and ");
string[] result = r.Split(input, 3);
foreach (string s in result) {
    System.Console.WriteLine(s);
}

The code produces the following output:

bill
bob
betty and dave

Note that we specified a maximum of three splits in Regex.Split, so betty and dave remains unsplit.

Ad Hoc Regular Expressions

Both Java and .NET provide support for ad hoc regular expression usage without the need to explicitly instantiate any regular expression objects. Java exposes these capabilities predominantly through the String class, whereas .NET provides static methods in the Regex class. If access to the match data isn’t required, both platforms offer the same capabilities. However, the .NET static methods provide a better solution if access to the match results is required. Both approaches are equivalent to compiling a regular expression, using it, and discarding it, so if the regular expression is to be used more than once, explicit instantiation and reuse is more efficient. These methods are contrasted in Table 7-16.

Table 7-16. Ad Hoc Regular Expression Functionality in Java and .NET

Java

.NET

Description

Pattern.matches()

String.matches()

Regex.IsMatch()

Searches an input string for an occurrence of a regular expression and returns a bool value indicating whether a match was found. Access to the match results is not possible.

N/A

Regex.Match()

Returns the first regular expression match in an input string.

N/A

Regex.Matches()

Returns a MatchCollection of all regular expression matches in an input string.

String.replaceAll()

Regex.Replace()

Replaces all occurrences of a matched expression with a provided string. An overloaded version also supports the use of a delegate as a callback mechanism to provide per-match decision making capabilities.

String.replaceFirst()

N/A

Not directly supported but can be achieved using the correct regular expression syntax.

String.split()

Regex.Split()

Splits an input string into a string array around matches.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset