Performing Pattern Matching with Perl Regular Expressions

A Brief Overview

Perl regular expressions enable you to perform pattern matching by using functions. A regular expression is a sequence of strings that defines a search pattern.
For example, suppose you have the Certadv.NANumbr data set, which contains phone numbers for the United States, Canada, and Mexico.
Figure 14.3 Certadv.NaNumbr Data Set (partial output)
Partial Certadv.NaNumbr Data Set
By using a regular expression, you can find valid values for Phone. The advantage of using regular expressions is that you can often accomplish in only one Perl regular expression function something that would require a combination of traditional SAS functions to accomplish.
In SAS, you use Perl regular expressions within the functions and call routines that start with PRX. The PRX functions use a modified version of the Perl language (Perl 5.6.1) to perform regular expression compilation and matching.

Using Metacharacters

The Perl regular expressions within the PRX functions and call routines are based on using metacharacters. A metacharacter is a character that has a special meaning during pattern processing. You can use metacharacters in regular expressions to define the search criteria and any text manipulations. The following table lists the metacharacters that you can use to match patterns in Perl regular expressions.
Table 14.3 Basic Perl Metacharacters and Their Descriptions
Metacharacter
Description
Example
/.../
Provides the starting and ending delimiter.
s/ ([a-z]) / X / substitutes X in place of a space followed by a lowercase letter and then a space.
(...)
Enables grouping.
f(u|boo)bar matches "fubar" or "foobar".
|
Denotes the OR situation.
d
Matches a digit (0–9).
dddd matches any four-digit string (0-9) such as "1234" or "6387"
D
Matches a non-digit such as a letter or special character.
DDDD matches any four non-digit string such as "WxYz" or "AVG%"
s
Matches a whitespace character such as a space, tab, or newline.
xsx matches "x x" (space between the letters x) or "x x" (tab between the letters x).
w
Matches a group of one or more characters (a-z, A-Z, 0-9, or an underscore).
www matches any three-word characters.
.
Matches any character.
mi.e matches "mike" and "mice".
[...]
Matches a character in brackets.
[dmn]ice matches "dice" or "mice" or "nice"
d[6789]d matches "162" or "574" or "685" or "999"
[^...]
Matches a character not in brackets.
[^ ] matches
[^ ] matches " " but not " "
^
Matches the beginning of the string.
d[^a]me matches "dime" or "dome" but not "dame".
$
Matches the end of the string.
ter$ matches "winter" not "winner" or "terminal".

Matches a word boundary (the last position before a space).
bar matches "bar food" but not "barfood" or "barter" .
B
Matches a non-word boundary.
barB matches "foobar" but not "bar food".
*
Matches the preceding character 0 or more times.
  • zo* matches "z" and "zoo"
  • * is equivalent to {0,}
+
Matches the preceding character 1 or more times.
  • zo+ matches "zo" and "zoo".
  • zo+ does not match "z"
  • + is equivalent to {1,}
?
Matches the preceding character 0 or 1 times.
  • do(es)? matches the "do" in "do" or "does"
  • ? is equivalent to {0,1}
{n}
Matches exactly n times.
fo{2}bar matches "foobar" but not "fobar" or "fooobar".
Overrides the next metacharacter such as a ( or ?)
final. matches "final." "final" is followed by the character '.'

Example: Using Metacharacters

A valid United States, Canada, or Mexico phone number contains a three-digit area code, followed by a hyphen (-), a three-digit prefix, and then the remaining numbers. More specifically, the first digit of the area code and prefix cannot start with 0 or 1.
A Perl regular expression must start and end with a delimiter. The following example uses parentheses to represent a group of numbers that is required. The first two groups specify that first there must be a digit 2 through 9 followed by two more digits. In the last group, there must be four digits. The hyphens between the groups signify the hyphens between the numbers in the output.
/([2-9]dd)-([2-9]dd)-(d{4})/
Output 14.5 Certadv.NaNumbr Data Set (partial output)
Partial Certadv.NaNumbr Data Set

The PRXMATCH Function

A Brief Overview

The Perl regular expression using metacharacters can be used with the PRX functions. The PRXMATCH function searches for a pattern match and returns the position at which the pattern is found. A value of zero is returned if no match is found. This function has two arguments. The first argument specifies the Perl regular expression that contains your pattern. The second argument is the character constant, column, or expression that you want to search.

PRXMATCH Syntax

Syntax, PRXMATCH function:
PRXMATCH (Perl-regular-expression, source);
Perl-regular-expression
specifies a character value that is a Perl regular expression. The expression can be referenced using a constant, a column, or a pattern identifier number.
source
specifies a character constant, variable, or expression that you want to search.

Example: PRXMATCH Function Using a Constant

The PRXMATCH function is commonly used for validating data. The following example uses the PRXMATCH function to validate whether a phone number pattern is present.
If the pattern is present, a numeric value is returned to the pattern’s starting position. For this example, the pattern was found in 19 rows.
The example specifies the expression as a hard-coded constant as the first argument of the function. When a constant value is specified, the constant must be in quotation marks (either single or double). When you specify the expression as a constant, the expression is compiled once, and each use of the PRX function reuses the compiled expression. Compiling the expression only once saves time. The compiled version is saved in memory.
data work.matchphn;
   set certadv.nanumbr;
   loc=prxmatch('/([2-9]dd)-([2-9]dd)-(d{4})/',PhoneNumber);
run;
proc print data=work.matchphn;
   where loc>0;
run;
Output 14.6 PROC PRINT Result of Work.MatchPhn
PROC PRINT Result of Work.MatchPhn

Example: PRXMATCH Function Using a Column

Instead of using the first argument to specify a constant for the regular expression, you can refer to a column that contains the expression. This is a commonly used technique when you might need to manipulate the assignment statement that is specifying the expression.
When the first argument refers to a column instead of a constant, the expression is compiled for each execution of the function. To avoid compiling the expression each time, specify the option of a lower or uppercase O at the end of the expression. This makes SAS compile the expression only once. This is a useful approach when you have large data sets, as it decreases your processing time.
data work.phnumbr (drop=Exp);
   set certadv.nanumbr;
   Exp='/([2-9]dd)-([2-9]dd)-(d{4})/o';
   Loc=prxmatch(Exp,PhoneNumber);
run;
proc print data=work.phnumbr;
   where loc>0;
run;
Output 14.7 PROC PRINT Result of Work.PhNumbr
PROC PRINT Result of Work.PhNumbr

The PRXPARSE Function

A Brief Overview

Another method for specifying the Perl regular expression is to specify a pattern identifier number. Before using PRXMATCH, you can use the PRXPARSE function to create the pattern identifier number. This function references the regular expression either as a constant or a column. The function returns a pattern identifier number. This number can then be passed to PRX functions and call routines to reference the regular expression. It is not required to use the pattern identifier number with the PRXMATCH function, but some of the other PRX functions and call routines do require the pattern identifier number.

PRXPARSE Function Syntax

The PRXPARSE function returns a pattern identifier number that is used by other PRX functions and call routines.
Syntax, PRXPARSE function:
pattern-ID-number=PRXPARSE (Perl-regular-expression);
pattern-ID-number
is a numeric pattern identifier that is returned by the PRXPARSE function.
Perl-regular-expression
specifies a character value that is a Perl regular expression. The expression can be referenced using a constant, a column, or a pattern identifier number.

Example: PRXPARSE and PRXMATCH Function Using a Pattern ID Number

In this example, the regular expression is being assigned to the column Exp. The PRXPARSE function is referencing this column. Because the expression ends with the O option, the function compiles the value only once. The PRXPARSE function returns a number that is associated with this expression. In this example, the number is a value of 1, and the value is being stored in the Pid column.
PRXMATCH then references this number in the Pid column as its first argument. If the O option had not used at the end of the Perl regular expression, the value of Pid would differ for each row.
data work.phnumbr (drop=Exp);
   set certadv.nanumbr;
   Exp='/([2-9]dd)-([2-9]dd)-(d{4})/o';
   Pid=prxparse(Exp);
   Loc=prxmatch(Pid,PhoneNumber);
run;
proc print data=work.phnumbr;
run;
Output 14.8 PROC PRINT Output of Work.PhNumbr (partial output)
Partial Output: PROC PRINT Result of Work.PhNumbr

The PRXCHANGE Function

A Brief Overview

The PRXCHANGE function performs a substitution for a pattern match. This function has three arguments. The first argument is the Perl regular expression, which can be specified as a constant, a column, or a pattern identifier number that comes from the PRXPARSE function. The second argument is a numeric value that specifies the number of times to search for a match and replace it with a matching pattern. If the value is -1, then the matching pattern continues to be replaced until the end of the source is reached. The third argument is the character constant, column, or expression that you want to search for.

PRXCHANGE Function Syntax

The PRXCHANGE function performs a substitution for a pattern match.
Syntax, PRXCHANGE function:
PRXCHANGE (Perl-regular-expression, times, source)
Perl-regular-expression
specifies a character value that is a Perl regular expression. The expression can be referenced using a constant, a column, or a pattern identifier number.
times
is a numeric constant, variable, or expression that specifies the number of times to search for a match and replace a matching pattern.
source
specifies a character constant, variable, or expression that you want to search.

Example: Using the PRXCHANGE Function to Standardize Data

The PRXCHANGE function is commonly used to standardize data. For example, the Certadv.SocialAcct data set contains social media preference data for users between the ages of 18 and 50. The goal is to standardize the Certadv.SocialAcct data set by substituting Facebook for Fb and FB as well as Instagram for IG.
Figure 14.4 Certadv.SocialAcct (partial output)
Partial Output of Certadv.SocialAcct
When you are writing the Perl regular expression for substitution, start the expression with a lowercase s. The lowercase s signifies that substitution needs to happen instead of matching.
Following the lowercase s, place the beginning delimiter before the forward slash. Also, place the forward slash at the end of the expression. There is another forward slash between the starting and ending forward slashes.
Before the middle forward slash, specify the pattern that you are searching for, enclosed in parentheses. After the middle forward slash, specify the pattern that is to be used for substitution.
In this example, you are looking for the capital letters FB and IG in both Social_Media_Pref1 and Social_Media_Pref2 variables. If the pattern is found, then replace with Facebook and Instagram, respectively. The i modifier ignores the case of the pattern that you are searching for.
data work.prxsocial;
   set certadv.socialacct;
   Social_Media_Pref1=prxchange('s/(FB)/Facebook/i',-1,Social_Media_Pref1);
   Social_Media_Pref1=prxchange('s/(IG)/Instagram/i',-1,Social_Media_Pref1);
   Social_Media_Pref2=prxchange('s/(FB)/Facebook/i',-1,Social_Media_Pref2);
   Social_Media_Pref2=prxchange('s/(IG)/Instagram/i',-1,Social_Media_Pref2);
run;
proc print data=work.prxsocial;
run;
Output 14.9 PROC PRINT Output of Work.PrxSocial (partial output)
Partial Output: PROC PRINT Result of Work.PrxSocial

Example: Changing the Order Using the PRXCHANGE Function

Suppose you have the Certadv.SurvNames data set with names from the self-reported survey. Every 50th surveyor is given a gift card that is to be mailed to the surveyor’s home. You are asked to quickly reverse the names of the survey takers. You can use the PRXCHANGE function to reverse the order of the names.
data work.revname;
   set certadv.survnames;
   ReverseName=prxchange('s/(w+), (w+)/$2 $1/', -1, name);
run;
proc print data=work.revname;
run;
Output 14.10 PROC PRINT Result of Work.RevName
PROC PRINT Result of Work.RevName

Example: Capture Buffers for Substitution Using the PRXCHANGE Function

Suppose you have the data set Certadv.Email with email addresses, longitude, and latitude of those who have visited the company website. You are asked to reorder the longitude and latitude values to latitude and longitude.
When specifying a substitution value, you might need to rearrange pieces of the found pattern. This is possible using capture buffers.
In an earlier section, parentheses were used to represent grouping. When you use parentheses for grouping, you are creating capture buffers. Each capture buffer is referenced with a sequential number starting at 1. The first set of parentheses is for capture buffer 1. The second set of parentheses is for capture buffer 2, and so on.
When referencing a capture buffer, use a dollar sign in front of the capture buffer number. In the following example, specify the third buffer first and the first buffer last.
data work.latlong;
   set certadv.email;
   LatLong=prxchange('s/(-?d+.d*)(@)(-?d+.d*)/$3$2$1/', -1, LongLat);
run;
proc print data=work.latlong;
run;
Output 14.11 PROC PRINT Output of Work.LatLong (partial output)
Partial Output: PROC PRINT Result of Work.LatLong
Last updated: October 16, 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset