The .NET Framework Class Library (FCL) includes the System.Text.RegularExpressions
namespace, which is devoted to creating, executing, and obtaining results from regular expressions executed against a string.
Regular expressions take the form of a pattern that matches zero or more characters within a string. The simplest of these patterns, such as .*
(which matches anything except newline characters) and [A-Za-z]
(which matches any letter) are easy to learn, but more advanced patterns can be difficult to learn and even more difficult to implement correctly. Learning and understanding regular expressions can take considerable time and effort, but the work will pay off.
Two books that will help you learn and expand your understanding of regular expressions are Michael Fitzgerald’s Introducing Regular Expressions and Jan Goyvaerts and Steven Levithan’s Regular Expressions Cookbook, both from O’Reilly.
Regular expression patterns can take a simple form—such as a single word or character—or a much more complex pattern. The more complex patterns can recognize and match such items as the year portion of a date, all of the <SCRIPT>
tags in an ASP page, or a phrase in a sentence that varies with each use. The .NET regular expression classes provide a very flexible and powerful way to perform tasks such as recognizing text, replacing text within a string, and splitting up text into individual sections based on one or more complex delimiters.
Despite the complexity of regular expression patterns, the regular expression classes in the FCL are easy to use in your applications. Executing a regular expression consists of the following steps:
Create an instance of a Regex
object that contains the regular expression pattern along with any options for executing that pattern.
Retrieve a reference to an instance of a Match
object by calling the Match
instance method if you want only the first match found. Or, retrieve a reference to an instance of the MatchesCollection
object by calling the Matches
instance method if you want more than just the first match found. If, however, you want to know only whether the input string was a match and do not need the extra details on the nature of the match, you can use the Regex.IsMatch
method.
If you’ve called the Matches
method to retrieve a MatchCollection
object, iterate over the MatchCollection
using a foreach
loop. Each iteration will allow access to every Match
object that the regular expression produced.
You have a regular expression that contains one or more named groups (also known as named capture groups), such as the following:
\\(?<TheServer>w*)\(?<TheService>w*)\
where the named group TheServer
will match any server name within a UNC string, and TheService
will match any service name within a UNC string.
This pattern does not match the UNCW format.
You need to store the groups that are returned by this regular expression in a keyed collection (such as a Dictionary<string, Group>
) in which the key is the group name.
The ExtractGroupings
method shown in Example 7-1 obtains a set of Group
objects keyed by their matching group name.
using System; using System.Collections; using System.Collections.Generics; using System.Text.RegularExpressions; public static List<Dictionary<string, Group>> ExtractGroupings(string source string matchPattern, bool wantInitialMatch) { List<Dictionary<string, Group>> keyedMatches = new List<Dictionary<string, Group>>(); int startingElement = 1; if (wantInitialMatch) { startingElement = 0; } Regex RE = new Regex(matchPattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); foreach(Match m in theMatches) { Dictionary<string, Group> groupings = new Dictionary<string, Group>(); for (int counter = startingElement; counter < m.Groups.Count; counter++) { // If we had just returned the MatchCollection directly, the // GroupNameFromNumber method would not be available to use. groupings.Add(RE.GroupNameFromNumber(counter), m.Groups[counter]); } keyedMatches.Add(groupings); } return (keyedMatches); }
The ExtractGroupings
method can be used in the following manner to extract named groups and organize them by name:
public static void TestExtractGroupings() { string source = @"Path = ""\MyServerMyServiceMyPath; \MyServer2MyService2MyPath2"""; string matchPattern = @"\\(?<TheServer>w*)\(?<TheService>w*)\"; foreach (Dictionary<string, Group> grouping in ExtractGroupings(source, matchPattern, true)) { foreach (KeyValuePair<string, Group> kvp in grouping) Console.WriteLine($"Key/Value = {kvp.Key} / {kvp.Value}"); Console.WriteLine(""); } }
This test method creates a source
string and a regular expression pattern in the MatchPattern
variable. The two groupings in this regular expression are highlighted here:
string matchPattern = @"\\(?<TheServer>w*)\(?<TheService>w*)\";
The names for these two groups are TheServer
and TheService
. Text that matches either of these groupings can be accessed through these group names.
The source
and matchPattern
variables are passed in to the ExtractGroupings
method, along with a Boolean value, which is discussed shortly. This method returns a List<T>
containing Dictionary<string,Group>
objects. These Dictionary<string,Group>
objects contain the matches for each of the named groups in the regular expression, keyed by their group name.
This test method, TestExtractGroupings
, returns the following:
Key / Value = 0 / \MyServerMyService Key / Value = TheService / MyService Key / Value = TheServer / MyServer Key / Value = 0 / \MyServer2MyService2 Key / Value = TheService / MyService2 Key / Value = TheServer / MyServer2
If the last parameter to the ExtractGroupings
method were to be changed to false
, the following output would result:
Key / Value = TheService / MyService Key / Value = TheServer / MyServer Key / Value = TheService / MyService2 Key / Value = TheServer / MyServer2
The only difference between these two outputs is that the first grouping is not displayed when the last parameter to ExtractGroupings
is changed to false
. The first grouping is always the complete match of the regular expression.
Groups within a regular expression can be defined in one of two ways. The first way is to add parentheses around the subpattern that you wish to define as a grouping. This type of grouping is sometimes labeled as unnamed. Later you can easily extract this grouping from the final text in each returned Match
object by running the regular expression. The regular expression for this recipe could be modified, as follows, to use a simple unnamed group:
string matchPattern = @"\\(w*)\(w*)\";
After running the regular expression, you can access these groups using a numeric integer value starting with 1.
The second way to define a group within a regular expression is to use one or more named groups. You define a named group by adding parentheses around the subpattern that you wish to define as a grouping and adding a name to each grouping, using the following syntax:
(?<Name>w*)
The Name
portion of this syntax is the name you specify for this group. After executing this regular expression, you can access this group by the name Name
.
To access each group, you must first use a loop to iterate each Match
object in the MatchCollection
. For each Match
object, you access the GroupCollection
’s indexer, using the following unnamed syntax:
string group1 = m.Groups[1].Value; string group2 = m.Groups[2].Value;
or the following named syntax, where m
is the Match
object:
string group1 = m.Groups["Group1_Name"].Value; string group2 = m.Groups["Group2_Name"].Value;
If the Match
method was used to return a single Match
object instead of the MatchCollection
, use the following syntax to access each group:
// Unnamed syntax string group1 = theMatch.Groups[1].Value; string group2 = theMatch.Groups[2].Value; // Named syntax string group1 = theMatch.Groups["Group1_Name"].Value; string group2 = theMatch.Groups["Group2_Name"].Value;
where theMatch
is the Match
object returned by the Match
method.
The “.NET Framework Regular Expressions” and “Dictionary Class” topics in the MSDN documentation.
Use the VerifyRegEx
method shown in Example 7-2 to test the validity of a regular expression’s syntax.
using System; using System.Text.RegularExpressions; public static bool VerifyRegEx(string testPattern) { bool isValid = true; if ((testPattern?.Length ?? 0) > 0) { try { Regex.Match("", testPattern); } catch (ArgumentException) { // BAD PATTERN: syntax error isValid = false; } } else { //BAD PATTERN: pattern is null or empty isValid = false; } return (isValid); }
To use this method, pass it the regular expression that you wish to verify:
public static void TestUserInputRegEx(string regEx) { if (VerifyRegEx(regEx)) Console.WriteLine("This is a valid regular expression."); else Console.WriteLine("This is not a valid regular expression."); }
The VerifyRegEx
method calls the static Regex.Match
method, which is useful for running regular expressions on the fly against a string. The static Regex.Match
method returns a single Match
object. By using this static method to run a regular expression against a string (in this case, an empty string), you can determine whether the regular expression is invalid by watching for a thrown exception. The Regex.Match
method will throw an ArgumentException
if the regular expression is not syntactically correct. The Message
property of this exception contains the reason the regular expression failed to run, and the ParamName
property contains the regular expression passed to the Match
method. Both of these properties are read-only.
Before testing the regular expression with the static Match
method, VerifyRegEx
tests the regular expression to see if it is null
or blank. A null
regular expression string returns an ArgumentNullException
when passed in to the Match
method. On the other hand, if a blank regular expression is passed in to the Match
method, no exception is thrown (as long as a valid string is also passed to the first parameter of the Match
method).
While this recipe validates whether or not the regular expression syntax is correct, it does not look for poorly written expressions. One common case of poorly written regular expressions is when the expressions rely on backtracking. Backtracking can cause the regular expression to take an exponentially long time to complete, making it appear as if the code executing the regular expression has frozen.
For a thorough explanation of backtracking in regular expressions, read the MSDN topic “Backtracking” under the “.NET Framework Regular Expressions” parent topic.
In cases where regular expressions use backtracking, it is recommended that you use a timeout value to limit the time a regular expression has to complete. Use the following RegEx
constructor:
Regex (String, RegexOptions, TimeSpan)
where TimeSpan
is the length of time within which the regular expression is allowed to execute:
Regex regex = new RegEx(bkTrkPattern, RegexOptions.None, TimeSpan.FromMilliseconds(1000));
You can then execute the regular expression within a try
-catch
block, using the RegexMatchTimeoutException
to catch a poorly written regular expression that takes an unusually long time to execute.
Use the overloaded instance Replace
method shown in Example 7-3, which accepts a MatchEvaluator
delegate along with its other parameters. The MatchEvaluator
delegate is a callback method that overrides the default behavior of the Replace
method.
using System; using System.Text.RegularExpressions; public static string MatchHandler(Match theMatch) { // Handle all ControlID_ entries. if (theMatch.Value.StartsWith("ControlID_", StringComparison.Ordinal)) { long controlValue = 0; // Obtain the numeric value of the Top attribute. Match topAttributeMatch = Regex.Match(theMatch.Value, "Top=([-]*\d*)"); if (topAttributeMatch.Success) { if (topAttributeMatch.Groups[1].Value.Trim().Equals("")) { // If blank, set to zero. return (theMatch.Value.Replace( topAttributeMatch.Groups[0].Value.Trim(), "Top=0")); } else if (topAttributeMatch.Groups[1].Value.Trim().StartsWith("-" , StringComparison.Ordinal)) { // If only a negative sign (syntax error), set to zero. return (theMatch.Value.Replace( topAttributeMatch.Groups[0].Value.Trim(), "Top=0")); } else { // We have a valid number. // Convert the matched string to a numeric value. controlValue = long.Parse(topAttributeMatch.Groups[1].Value, System.Globalization.NumberStyles.Any); // If the Top attribute is out of the specified range, // set it to zero. if (controlValue < 0 || controlValue > 5000) { return (theMatch.Value.Replace( topAttributeMatch.Groups[0].Value.Trim(), "Top=0")); } } } } return (theMatch.Value); }
The callback method for the Replace
method is shown here:
public static void ComplexReplace(string matchPattern, string source) { MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler); Regex RE = new Regex(matchPattern, RegexOptions.Multiline); string newString = RE.Replace(source, replaceCallback); Console.WriteLine($"Replaced String = {newString}"); }
To use this callback method with the static Replace
method, modify the previous ComplexReplace
method as follows:
public void ComplexReplace(string matchPattern, string source) { MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler); string newString = Regex.Replace(source, matchPattern, replaceCallback); Console.WriteLine("Replaced String = " + newString); }
where source
is the original string to run the replace operation against, and matchPattern
is the regular expression pattern to match in the source
string.
If the ComplexReplace
method is called from the following code:
public static void TestComplexReplace() { string matchPattern = "(ControlID_.*)"; string source = @"WindowID=Main ControlID_TextBox1 Top=–100 Left=0 Text=BLANK ControlID_Label1 Top=9999990 Left=0 Caption=Enter Name Here ControlID_Label2 Top= Left=0 Caption=Enter Name Here"; ComplexReplace(matchPattern, source); }
only the Top
attributes of the ControlID_*
lines are changed from their original values to 0
.
The result of this replace action will change the Top
attribute value of a ControlID_*
line to 0
if it is less than 0 or greater than 5,000. Any other tag that contains a Top
attribute will remain unchanged. The following three lines of the source
string will be changed from:
ControlID_TextBox1 Top=–100 Left=0 Text=BLANK ControlID_Label1 Top=9999990 Left=0 Caption=Enter Name Here ControlID_Label2 Top= Left=0 Caption=Enter Name Here";
to:
ControlID_TextBox1 Top=0 Left=0 Text=BLANK ControlID_Label1 Top=0 Left=0 Caption=Enter Name Here ControlID_Label2 Top=0 Left=0 Caption=Enter Name Here";
The MatchEvaluator
delegate, which is automatically invoked when it is supplied as a parameter to the Regex
class’s Replace
method, allows for custom replacement of each string that conforms to the regular expression pattern.
If the current Match
object is operating on a ControlID_*
line with a Top
attribute that is out of the specified range, the code within the MatchHandler
callback method returns a new modified string. Otherwise, the currently matched string is returned unchanged. This allows you to override the default Replace
functionality by modifying only that part of the source
string that meets certain criteria. The code within this callback method gives you some idea of what you can accomplish using this replacement technique.
To make use of this callback method, you need a way to call it from the ComplexReplace
method. First, a variable of type System.Text.RegularExpressions.MatchEvaluator
is created. This variable (replaceCallback
) is the delegate that is used to call the MatchHandler
method:
MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler);
Finally, the Replace
method is called with the reference to the MatchEvaluator
delegate passed in as a parameter:
string newString = Regex.Replace(source, matchPattern, replaceCallback);
The “.NET Framework Regular Expressions” topic in the MSDN documentation.
With the Split
method of the Regex
class, you can create a regular expression to indicate the types of tokens and separators that you are interested in gathering. This technique works especially well with equations, since the tokens of an equation are well defined. For example, the code:
using System; using System.Text.RegularExpressions; public static string[] Tokenize(string equation) { Regex re = new Regex(@"([+–*()^\])"); return (re.Split(equation)); }
will divide up a string according to the regular expression specified in the Regex
constructor. In other words, the string passed in to the Tokenize
method will be divided up based on the delimiters +
, –
, *
, (
, )
, ^
, and . The following method will call the
Tokenize
method to tokenize the equation (y – 3)*(3111*x^21 + x + 320)
:
public static void TestTokenize() { foreach(string token in Tokenize("(y – 3)*(3111*x^21 + x + 320)")) Console.WriteLine("String token = " + token.Trim()); }
which displays the following output:
string token = String token = ( String token = y String token = - String token = 3 String token = ) String token = * String token = ( String token = 3111 String token = * String token = x String token = ^ String token = 21 String token = + String token = x String token = + String token = 320 String token = ) String token =
Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.
In real-world projects, you do not always have the luxury of being able to control the set of inputs to your code. By making use of regular expressions, you can take the original tokenizer and make it flexible enough to allow it to be applied to many types or styles of input.
The key method used here is the Split
instance method of the Regex
class. The return value of this method is a string array with elements that include each individual token of the source
string—the equation, in this case.
Note that the static Split
method allows RegexOptions
enumeration values to be used, while the instance method allows for a starting position to be defined and a maximum number of matches to occur. This may have some bearing on whether you choose the static or instance method.
The “.NET Framework Regular Expressions” topic in the MSDN documentation.
Use the StreamReader.ReadLine
method to obtain each line in a file to run a regular expression against, as shown in Example 7-4.
public static List<string> GetLines(string source, string pattern, bool isFileName) { List<string> matchedLines = new List<string>(); // If this is a file, get the entire file's text. if (isFileName) { using (FileStream FS = new FileStream(source, FileMode.Open, FileAccess.Read, FileShare.Read)) { using (StreamReader SR = new StreamReader(FS)) { Regex RE = new Regex(pattern, RegexOptions.Multiline); string text = ""; while (text != null) { text = SR.ReadLine(); if (text != null) { // Run the regex on each line in the string. if (RE.IsMatch(text)) { // Get the line if a match was found. matchedLines.Add(text); } } } } } } else { // Run the regex once on the entire string. Regex RE = new Regex(pattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); // Use these vars to remember the last line added to matchedLines // so that we do not add duplicate lines. int lastLineStartPos = -1; int lastLineEndPos = -1; // Get the line for each match. foreach (Match m in theMatches) { int lineStartPos = GetBeginningOfLine(source, m.Index); int lineEndPos = GetEndOfLine(source, (m.Index + m.Length - 1)); // If this is not a duplicate line, add it. if (lastLineStartPos != lineStartPos && lastLineEndPos != lineEndPos) { string line = source.Substring(lineStartPos, lineEndPos - lineStartPos); matchedLines.Add(line); // Reset line positions. lastLineStartPos = lineStartPos; lastLineEndPos = lineEndPos; } } } return (matchedLines); } public static int GetBeginningOfLine(string text, int startPointOfMatch) { if (startPointOfMatch > 0) { --startPointOfMatch; } if (startPointOfMatch >= 0 && startPointOfMatch < text?.Length) { // Move to the left until the first ' char is found. for (int index = startPointOfMatch; index >= 0; index--) { if (text?[index] == ' ') { return (index + 1); } } return (0); } return (startPointOfMatch); } public static int GetEndOfLine(string text, int endPointOfMatch) { if (endPointOfMatch >= 0 && endPointOfMatch < text?.Length) { // Move to the right until the first ' char is found. for (int index = endPointOfMatch; index < text.Length; index++) { if (text?[index] == ' ') { return (index); } } return (text.Length); } return (endPointOfMatch); }
The following method shows how to call the GetLines
method with either a filename or a string:
public static void TestGetLine() { // Get each line within the file TestFile.txt as a separate string. Console.WriteLine(); List<string> lines = GetLines(@"C:TestFile.txt", "Line", true); foreach (string s in lines) Console.WriteLine($"MatchedLine: {s}"); // Get the lines matching the text "Line" within the given string. Console.WriteLine(); lines = GetLines("Line1 Line2 Line3 Line4", "Line", false); foreach (string s in lines) Console.WriteLine($"MatchedLine: {s}"); }
The GetLines
method accepts three parameters:
source
pattern
source
string.isFileName
true
if source
is a filename, or false
if source
is a string.This method returns a List<string>
of strings that contains each line in which the regular expression match was found.
The GetLines
method can obtain the lines on which matches occur within a string or a file. When a regular expression is run against a file whose name is passed in to the source
parameter (when isFileName
equals true
) in the GetLines
method, the file is opened and read line by line. The regular expression is run against each line, and if a match is found, that line is stored in the matchedLines List<string>
. Using the ReadLine
method of the StreamReader
object saves you from having to determine where each line starts and ends. Determining where a line starts and ends in a string requires some work, as you will see.
Running the regular expression against a string passed in to the source
parameter (when isFileName
equals false
) in the GetLines
method produces a MatchCollection
. Each Match
object in this collection is used to obtain the line on which it is located in the source
string. We obtain the line by starting at the position of the first character of the match in the source
string and moving one character to the left until either an
character or the beginning of the source
string is found (this code is found in the GetBeginningOfLine
method). This gives you the beginning of the line, which is placed in the variable LineStartPos
. Next, we find the end of the line by starting at the last character of the match in the source
string and moving to the right until either an
character or the end of the source
string is found (this code is found in the GetEndOfLine
method). This ending position is placed in the LineEndPos
variable. All of the text between the LineStartPos
and LineEndPos
will be the line in which the match is found. Each of these lines is added to the matchedLines List<string>
and returned to the caller.
Something interesting you can do with the GetLines
method is to pass in the string "
"
in the pattern parameter of this method. This trick will effectively return each line of the string or file as a string in the List<string>
. While this will work with strings that already have the CRLF characters embedded in them, it will not work on text returned from a file. The reason is that the ReadLine
method in the preceding GetLines
method will strip off the CRLF characters. To fix this we can simply add these characters back in, as we are performing the match in the GetLines
method:
// It is necessary to add CRLF chars // since Readline() strips off these chars if (RE.IsMatch(text + Environment.NewLine))
Finally, note that if more than one match is found on a line, each matching line will be added to the List<string>
.
Take care when adding line break characters back into the text. If you are using and processing this text exclusively on Windows systems, you won’t have any issues. However, if you are using other systems, or a mix of systems, you need to make sure you add the correct line break characters—that is, for UNIX and OS X, use only the Linefeed character (
).
The “.NET Framework Regular Expressions,” “FileStream Class,” and “Stream-Reader Class” topics in the MSDN documentation.
To find a particular occurrence of a match in a string, simply subscript the array returned from Regex.Matches
:
public static Match FindOccurrenceOf(string source, string pattern, int occurrence) { if (occurrence < 1) { throw (new ArgumentException("Cannot be less than 1", nameof(occurrence))); } // Make occurrence zero-based. --occurrence; // Run the regex once on the source string. Regex RE = new Regex(pattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); if (occurrence >= theMatches.Count) { return (null); } else { return (theMatches[occurrence]); } }
To find each particular occurrence of a match in a string, build a List<Match>
on the fly:
public static List<Match> FindEachOccurrenceOf(string source, string pattern, int occurrence) { if (occurrence < 1) { throw (new ArgumentException("Cannot be less than 1", nameof(occurrence))); } List<Match> occurrences = new List<Match>(); // Run the regex once on the source string. Regex RE = new Regex(pattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); for (int index = (occurrence - 1); index < theMatches.Count; index += occurrence) { occurrences.Add(theMatches[index]); } return (occurrences); }
The following method shows how to invoke the two previous methods:
public static void TestOccurrencesOf() { Match matchResult = FindOccurrenceOf ("one two three one two three one two three one" + " two three one two three one two three", "two", 2); Console.WriteLine($"{matchResult?.ToString()} {matchResult?.Index}"); Console.WriteLine(); List<Match> results = FindEachOccurrenceOf ("one one two three one two three one " + " two three one two three", "one", 2); foreach (Match m in results) Console.WriteLine($"{m.ToString()} {m.Index}"); }
This recipe contains two similar but distinct methods. The first method, FindOccurrenceOf
, returns a particular occurrence of a regular expression match. The occurrence you want to find is passed in to this method via the occurrence
parameter. If the particular occurrence of the match does not exist—for example, you ask to find the second occurrence, but only one occurrence exists—a null
is returned from this method. Because of this, you should check that the returned object of this method is not null
before using that object. If the particular occurrence exists, the Match
object that holds the match information for that occurrence is returned.
The second method in this recipe, FindEachOccurrenceOf
, works similarly to the FindOccurrenceOf
method, except that it continues to find a particular occurrence of a regular expression match until the end of the string is reached. For example, if you ask to find the second occurrence, this method would return a List<Match>
of zero or more Match
objects. The Match
objects would correspond to the second, fourth, sixth, and eighth occurrences of a match and so on until the end of the string is reached.
The “.NET Framework Regular Expressions” and “ArrayList Class” topics in the MSDN documentation.
You need a quick list from which to choose regular expression patterns that match standard items. These standard items could be a Social Security number, a zip code, a word containing only characters, an alphanumeric word, an email address, a URL, dates, or one of many other possible items used throughout business applications.
These patterns can be useful in making sure that a user has input the correct data and that it is well formed. These patterns can also be used as an extra security measure to keep hackers from attempting to break your code by entering strange or malformed input (e.g., SQL injection or cross-site-scripting attacks). Note that these regular expressions are not a silver bullet that will stop all attacks on your system; rather, they are an added layer of defense.
Match only alphanumeric characters along with the characters -, +, ., and any whitespace:
^([w.+-]|s)*$
Be careful using the - (hyphen) character within a character class—that is, a regular expression enclosed within [ and ]. That character is also used to specify a range of characters, as in a-z
for “a through z inclusive.” If you want to use a literal - character, either escape it with or put it at the end of the expression, as shown in the next examples.
Match only alphanumeric characters along with the characters -, +, ., and any whitespace, with the stipulation that there is at least one of these characters and no more than 10 of these characters:
^([w.+-]|s){1,10}$
Match a person’s name, up to 55 characters:
^[a-zA-Z'-s]{1,55}$
Match a positive or negative integer:
^(+|-)?d+$
Match a positive or negative floating-point number only; this pattern does not match integers:
^(+|-)?(d*.d+)$
Match a floating-point or integer number that can have a positive or negative value:
^(+|-)?(d*.)?d+$
Match a date in the form ##/##/####, where the day and month can be a one- or two-digit value and the year can only be a four-digit value:
^d{1,2}/d{1,2}/d{4}$
Verify if the input is a Social Security number of the form ###-##-####:
^d{3}-d{2}-d{4}$
^([0-2]?[0-9]?[0-9].){3}[0-2]?[0-9]?[0-9]$
Verify that an email address is in the form name@address where address is not an IP address:
^[A-Za-z0-9_-.]+@(([A-Za-z0-9-])+.)+([A-Za-z-])+$
Verify that an email address is in the form name@address where address is an IP address:
^[A-Za-z0-9_-.]+@([0-2]?[0-9]?[0-9].){3}[0-2]?[0-9]?[0-9]$
Match or verify a URL that uses either the HTTP, HTTPS, or FTP protocol. Note that this regular expression will not match relative URLs:
^(http|https|ftp)://[a-zA-Z0-9-.]+.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/? ([a-zA-Z0-9-._?,'/\+&%$#=~])*$
Match only a dollar amount with the optional $ and + or - preceding characters (note that any number of decimal places may be added):
^$?[+-]?[d,]*(.d*)?$
This is similar to the previous regular expression, except that no more than two decimal places are allowed:
^$?[+-]?[d,]*.?d{0,2}$
Match a credit card number to be entered as four sets of four digits separated with a space, -, or no character at all:
^((d{4}[- ]?){3}d{4})$
Match a zip code to be entered as five digits with an optional four-digit extension:
^d{5}(-d{4})?$
Match a North American phone number with an optional area code and an optional - character to be used in the phone number and no extension:
^((?[0-9]{3})?)?-?[0-9]{3}-?[0-9]{4}$
Match a phone number similar to the previous regular expression but allow an optional five-digit extension prefixed with either ext
or extension
:
^((?[0-9]{3})?)?-?[0-9]{3}-?[0-9]{4}(s*ext(ension)?[0-9]{5})?$
Match a full path beginning with the drive letter and optionally match a filename with a three-character extension (note that no .. characters signifying to move up the directory hierarchy are allowed, nor is a directory name with a . followed by an extension):
^[a-zA-Z]:[\/]([_a-zA-Z0-9]+[\/]?)*([_a-zA-Z0-9]+.[_a-zA-Z0-9]{0,3})?$
Verify if the input password string matches some specific rules for entering a password (i.e., the password is between 6 and 25 characters in length and contains alphanumeric characters):
^(?=.*d)(?=.*[a-z])(?=.*[A-Z]).{6,25}$
Determine if any malicious characters were input by the user. Note that this regular expression will not prevent all malicious input, and it also prevents some valid input, such as last names that contain a single quote:
^([^)(<>"'\%&+;][(-{2})])*$
Extract a tag from an XHTML, HTML, or XML string. This regular expression will return the beginning tag and ending tag, including any attributes of the tag.
Note that you will need to replace TAGNAME
with the real tag name you want to search for:
<TAGNAME.*?>(.*?)</TAGNAME
>
Extract a comment line from code. The following regular expression extracts HTML comments from a web page. This can be useful in determining if any HTML comments that are leaking sensitive information need to be removed from your code base before it goes into production:
<!--.*?-->
Match a C# single-line comment:
//.*$
Match a C# multiline comment:
/*.*?*/
While the four aforementioned regular expressions are great for finding tags and comments, they are not foolproof. To accurately find all tags and comments, you need to use a full parser for the language you are targeting.
Regular expressions are effective at finding specific information, and they have a wide range of uses. Many applications use them to locate specific information within a larger range of text, as well as to filter out bad input. The filtering action is very useful in tightening the security of an application and preventing an attacker from attempting to use carefully formed input to gain access to a machine on the Internet or a local network. By using a regular expression to allow only good input to be passed to the application, you can reduce the likelihood of many types of attacks, such as SQL injection or cross-site scripting.
The regular expressions presented in this recipe provide only a small cross-section of what you can accomplish with them. You can easily modify these expressions to suit your needs. Take, for example, the following expression, which allows only between 1 and 10 alphanumeric characters, along with a few symbols, as input:
^([w.+–]|s){1,10}$
By changing the {1,10}
part of the regular expression to {0,200}
, you can make this expression match a blank entry or an entry of the specified symbols up to and including 200 characters.
Note the use of the ^
character at the beginning of the expression and the $
character at the end of the expression. These characters start the match at the beginning of the text and match all the way to the end of the text. Adding these characters forces the regular expression to match the entire string or none of it. By removing these characters, you can search for specific text within a larger block of text. For example, the following regular expression matches only a string containing nothing but a US zip code (there can be no leading or trailing spaces):
^d{5}(-d{4})?$
This version matches only a zip code with leading or trailing spaces (notice the addition of the s*
to the beginning and ending of the expression):
^s*d{5}(-d{4})?s*$
However, this modified expression matches a zip code found anywhere within a string (including a string containing just a zip code):
d{5}(-d{4})?
Use the regular expressions in this recipe and modify them to suit your needs.
Introducing Regular Expressions by Michael Fitzgerald and Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (both O’Reilly).