Chapter 27: Introducing Perl Regular Expressions

27.1  Introduction

27.2  Describing the Syntax of Regular Expressions

27.3  Testing That Social Security Numbers Are in Standard Form

27.4  Checking for Valid ZIP Codes

27.5  Verifying That Phone Numbers Are in a Standard Form

27.6  Describing the PRXPARSE Function

27.7  Problems

 

27.1 Introduction

Regular expressions were first introduced in a language called Perl, developed for UNIX and Linux systems. Perl regular expressions were added to SAS starting with SAS 9. They are used to describe text patterns. For example, you can write an expression that matches any Social Security number (three digits, a dash, two digits, a dash, and four digits). Therefore, you can use a regular expression to test whether a specific pattern is present.

27.2 Describing the Syntax of Regular Expressions

There are entire books devoted to regular expressions. The goal of this chapter is to describe some basic aspects of regular expressions and provide some examples of how they can be used.

A regular expression (called a regex by programmers) starts with a delimiter (most often a forward slash), followed by a combination of characters and metacharacters. Metacharacters describe character classes such as all digits or all punctuation marks. The expression ends with the same delimiter that you started with. For example, a regular expression for a Social Security number is:

/ddd-dd-dddd/

d is the metacharacter for any digit. The two dashes in the expression are just that—dashes. Any character that is not defined as a special character, such as the dashes in the preceding expression, is a literal character. Even spaces count as characters in a regular expression.

Because writing d four times is tedious, this expression can be rewritten as:

/ddd-dd-d{4}/

You can probably guess that the {4} following the d says to repeat d four times.

You can create sets of characters using square brackets ( [ and ] ). All the uppercase letters are represented by [A-Z]. All uppercase and lowercase letters are represented by [A-Za-z].

Likewise, the digits 0-9 can be represented by [0-9]. If you are using ASCII, you can also use d instead of [0-9].

In SAS, all the regular expression functions begin with the letters PRX, which stands for Perl Regular Expression). Because this chapter is an introduction to the PRX functions, only two of these functions will be described. For a more detailed discussion of the SAS PRX functions, please look at SAS Functions by Example, 2nd edition, written by this author and published by SAS Press.

27.3 Testing That Social Security Numbers Are in Standard Form

Before a discussion of metacharacters and other details of regular expressions, let's look at a simple program to verify that a list of Social Security numbers conforms to the standard form. Here is such a program:

Program 27.1: Using a Regex to Test Social Security Values

title "Checking Social Security Numbers Using a Regular Expression";

data _null_;

   file print;

   input SS $11.;

   if not prxmatch("/ddd-dd-dddd/",SS) then

      put "Error for SS Number " SS;

datalines;

123-45-6789

123456789

123-ab-9876

999-888-7777

;

You may not be familiar with the DATALINES statement. Instead of writing an external file and using an INFILE statement, you can place your data right after the DATALINES statement. The INPUT statement will read the values as if they were in an external file.

The PRXMATCH function takes two arguments. The first argument is a regular expression, and the second argument is the string that you are examining. If the string contains a pattern described by the regular expression, the function returns the starting position for the pattern. If the pattern described by the regular expression is not found in the string, the function returns a zero. In Program 27.1, the regular expression is describing a pattern of a typical Social Security number.

 

Here is the output:

Figure 27.1: Output from Program 27.1

Figure 27.1: Output from Program 27.1

Each of the numbers listed does not conform to the standard Social Security number format.

27.4 Checking for Valid ZIP Codes

You can use a program similar to Program 27.1 to verify that an address contains the valid form of a ZIP code, either a five-digit code or a five-digit code followed by a dash, followed by four more digits (ZIP code +4).

A regular expression to check either the five digit or ZIP+4 code is:

/d{5}(-d{4})?/

The first part of this expression is pretty clear—it says to search for five digits. The expression in parenthesis matches a dash followed by four digits. The question mark following the parenthesis means to search for zero or 1 occurrences of the previous expression. Therefore, this expression matches either of the two valid US ZIP code formats.

You can run the following program to test this expression:

Program 27.2: Testing the Regular Expression for US ZIP Codes

title "Testing the Regular Expression for US ZIP Codes";

data _null_;

   file print;

   input Zip $10.;

  if not prxmatch("/d{5}(-d{4})?/",Zip) then

      put "Invalid ZIP Code " Zip;

datalines;

12345

78010-5049

12Z44

ABCDE

08822

;

This program reads in the ZIP code (allowing for up to 10 characters) and prints out an error message for any code that does not match the regular expression. Here is the output from Program 27.2:

Figure 27.2: Output from Program 27.2

Figure 27.2: Output from Program 27.2

Notice that all valid ZIP codes, either the five-digit codes or the ZIP + 4 codes were validated.

27.5 Verifying That Phone Numbers Are in a Standard Form

You want to search a list of phone numbers and identify any numbers that do not conform to the form:

(ddd)ddd-dddd

Where d is any digit. Writing a regular express for this is a bit tricky because both the open and closed parentheses have a special meaning in a regular expression. In the previous example, you may have noticed that the parentheses around the dash and final four digits allowed the question mark (repeat the previous expression 0 or 1 times) to be placed outside the final closed parenthesis. In regular expressions, parentheses indicate a grouping of characters or metacharacters. How do you indicate that you are searching for either the open or closed parenthesis in a standard phone number? The answer is that you place a backward slash () before either of these special characters to signify that you mean to treat them as characters and not grouping symbols. The backslash character is sometimes referred to as an escape character. Therefore, the regular expression for a phone number in the form discussed here is:

/(ddd)ddd-dddd/

Notice the backslash before the open and closed parentheses. Writing the validation program for standard phone numbers is identical to either Program 27.1 or Program 27.2, substituting the expression for a phone number for Social Security numbers or ZIP codes. Here it is:

Program 27.3: Using a Regex to Check for Phone Numbers in Standard Form

title "Checking that Phone Numbers are in Standard Form";

data _null_;

   file print;

   input Phone $13.;

  if not prxmatch("/(ddd)ddd-dddd/",Phone) then

      put "Invalid Phone Number " Phone;

datalines;

(908)432-1234

800.343.1234

8882324444

(888)456-1324

;

 

Output from this program is shown next:

Figure 27.3: Output from Program 27.3

Figure 27.3: Output from Program 27.3

It is just about impossible for this author to leave this section without showing you how to convert any phone number that includes an area code into a standard number. In Chapter 12 that focuses on character functions, you saw how the COMPRESS function can be used with a k (keep) modifier to keep selected characters and to remove everything else. You can use the k modifier to your advantage in converting nonstandard phone numbers. Program 27.4 reads the data from Program 27.3 and converts all of the phone numbers to the standard form:

Program 27.4: Converting Phone Numbers to Standard Form

data Standard;

   length Standard $ 13;

   input Phone $13.;

   Digits = compress(Phone,,'kd');

   Standard = cats('(',substr(Digits,1,3),')',substr(Digits,4,3),

                   '-',substr(Digits,7,4));

   drop Digits;

datalines;

(908)432-1234

800.343.1234

8882324444

(888)456-1324

;

The COMPRESS function with the kd (keep the digits) modifiers extracts the digits from the phone numbers. To create a standard phone number, you use the CATS function to concatenate all the necessary pieces (including the parentheses and the dash), and you use the SUBSTR function to extract the appropriate digits from the original phone numbers. Here is a listing of the resulting data set:

Figure 27.4: Listing of Data Set Standard

Figure 27.4: Listing of Data Set Standard

27.6 Describing the PRXPARSE Function

Many of the more advanced PRX functions and call routines require that you first run a function called PRXPARSE to obtain a return code that you use in place of a regular expression in these functions and call routines. As a matter of fact, you can use PRXPARSE to create a return code that you can use in place of the regular expression in the PRXMATCH function, which were used in the previous programs in this chapter. To see how this works, here is Program 27.1, which was rewritten to use a combination of the PRXPARSE and PRXMATCH functions.

Program 27.5: Demonstrating a Combination of PRXPARSE and PRXMATCH Functions

title "Checking Social Security Numbers Using a Regular Expression";

title2 "Using a Combination of PRXPARSE and PRXMATCH Functions";

data _null_;

   file print;

   input SS $11.;

   Return_Code = prxparse("/ddd-dd-dddd/");

   if not prxmatch(Return_Code,SS) then

      put "Error for SS Number " SS;

datalines;

123-45-6789

123456789

123-ab-9876

999-888-7777

;

The PRXPARSE function “compiles” the regular expression and assigns a sequential number (starting at 1 and increasing for each use of PRXPARSE in a DATA step) to a variable (called Return_Code) in this example. You can use this variable in place of the actual regular expression as the first argument of PRXMATCH.

Although the PRXMATCH function allows you to enter the regular expression as the first argument, many of the other PRX functions and call routines require you to use PRXPARSE first. The output from Program 27.5 is identical to that in Figure 27.3

It is a good practice to execute the PRXPARSE function only once and retain the return code. (Although it is not necessary if the argument to the PRXPARSE function is a constant, as it is here). You can rewrite Program 27.5 like this:

 

Program 27.6: Rewriting Program 27.5 to Demonstrate a Program Written by a Compulsive Programmer

title "Checking Social Security Numbers Using a Regular Expression";

title2 "Using a Combination of PRXPARSE and PRXMATCH Functions";

data _null_;

   file print;

   input SS $11.;

   retain Return_Code;

   if _n_ = 1 then

      Return_Code = prxparse("/ddd-dd-dddd/");

   if not prxmatch(Return_Code,SS) then

      put "Error for SS Number " SS;

datalines;

123-45-6789

123456789

123-ab-9876

999-888-7777

;

Because you execute the PRXPARSE function only once, you need a RETAIN statement so that the variable Return_Code does not get set to Missing at the next iteration of the DATA step.

Hopefully, you will see that regular expressions can be incredibly useful when you need to verify a pattern rather than an exact match of characters. You can use Google to search for regular expressions for almost any term or phrase. Beware because some of the results may be incorrect or overly complicated. One site recommended by this author is www.stackoverflow.com.

27.7  Problems

Solutions to odd-numbered problems are located at the back of this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion website at support.sas.com/cody for information about how to obtain the solutions to all problems.

1.       You have a list of license plate numbers and want to extract any number that does not have the form of three uppercase letters followed by three digits.  Use this list of numbers to test your program:

Note: SASMAN is my Texas license plate and SASJEDI belongs to my friend Mark Jordan.

ABC123

SASMAN

SASJEDI

345XYZ

low987

WWW999

 

2.       Use the telephone numbers in Program 27.3 but assume you want the numbers to be in the form ddd.ddd.dddd where d is any digit.  Be aware that a period is a special character in a regular expression.  It is a wildcard that stands for any character.  Therefore, you must precede the period with a backslash when you write your expression.

3.       Repeat Problem 2 using a combination or PRXPARSE and PRXMATCH.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset