grep and regular expressions

grep() in R works exactly as it does in every UNIX-based operating system. Closely related to grep() (in fact, R groups them together and treats them as a group of functions), various other functions can be found. In this section, only grepl(), gregexpr(), and gsub() will be examined. One thing that all these functions have in common is that they perform an action based on a string pattern. The description of the functions are as follows:

  • grep(): This returns the indexes of the elements in a vector that match a string pattern. If the value is set to TRUE, it returns the values instead of the indexes.
  • grepl(): This returns a logical vector of the same length as the input vector, denoting whether the sought pattern was found or not.
  • gsub(): This substitutes the pattern for the replacement argument in the specified vector.
  • gregexpr(): This returns a list consisting of the vectors of starting positions that specify where the pattern was matched in the text in addition to the length of that match.

A brief introduction to regular expressions

All these functions implement regular expressions. Regular expressions are strings that represent string patterns. In R, they do not differ very much from Perl standard (see Perl Compatible Regular Expressions for more information). However, there are some differences. This is the reason why the logical argument of Perl can be specified in every function in R that uses regular expressions. This option is normally defaulted to FALSE.

These patterns can vary from the following:

  • Sets
  • Non-printable characters
  • Negation
  • Alternation
  • Quantifiers
  • Anchors
  • Expressions
  • Escapes

Sets

Sets are a group of characters enclosed by []. This means that the sought pattern can meet any of the characters contained inside the brackets. Letters can be abbreviated by determining a range with a hyphen. For example, a-e would match a, b, c, d, and e. Analogously, numbers can be abbreviated in the same way:

> gregexpr("[a-z]", "string 01 A")
[[1]]
[1] 1 2 3 4 5 6
attr(,"match.length")
[1] 1 1 1 1 1 1
attr(,"useBytes")
[1] TRUE

The [a-z] pattern is matched at the 1, 2, 3, 4, 5, and 6 positions in the string 01 A string, and the length of all these matches 1. (In this case, it could not have been greater than this because the pattern will match only elements of length 1. For further details, see the Quantifiers section of this chapter.) The space and the numbers are not matched because the pattern only matches letters from a-z in lower case.

Within a set, more than one abbreviation can be used. In fact, anything can be included. The logic would remain the same, that is, it will recover anything that matches the specified set. Back to the example, if instead of [a-z], the pattern was [a-z0], the result would be as follows:

> gregexpr("[a-z0]", "string 01 A")
[[1]]
[1] 1 2 3 4 5 6 8
attr(,"match.length")
[1] 1 1 1 1 1 1 1
attr(,"useBytes")
[1] TRUE

The eighth character (0) is matched.

Shortcuts

R provides a series of shortcuts to work with regular expressions. They can be found at http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html. They are mainly common groups of characters and work the same as any other abbreviation.

The following is the same example but with [[:alnum:]] (that is, alphanumeric characters) as pattern:

> gregexpr("[[:alnum:]]", "string 01 A")
[[1]]
[1] 1 2 3 4 5 6 8 9 11
attr(,"match.length")
[1] 1 1 1 1 1 1 1 1 1
attr(,"useBytes")
[1] TRUE

The pattern matches all the characters except the spaces.

Note

Notice that the use of double brackets responds to the fact that the set must be enclosed in brackets and the abbreviation also has a bracket structure. In fact, something like [[:alnum:]_] (a pattern that matches any character or digit plus the _ sign) is also possible.

Dot

The dot matches any character except for newline (newline is a non-printable character, for further information see the next item in this section). Unlike the other expressions seen previously, the dot character does not need to be enclosed in []:

> gregexpr(".", "string 01 A")
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11
attr(,"match.length")
[1] 1 1 1 1 1 1 1 1 1 1 1
attr(,"useBytes")
[1] TRUE

Dot matches all the characters in the string.

Non-printable characters

Non-printable characters are standard special sets of characters that indicate something about the text but are not actually printed. Among them, the most important ones are Tab, newline, and carriage return. To match them in R, the regular expressions must be specified with a double backslash:

> gregexpr("\n", "string 01
+          A")
[[1]]
[1] 10
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE

Negation

The regular expressions also permit negation. This is denoted by a ^ within the set (that is, inside the brackets). Consequently, they would match everything but the specified negated pattern:

> gregexpr("[^a-z]", "string 01 A")
[[1]]
[1] 7 8 9 10 11
attr(,"match.length")
[1] 1 1 1 1 1
attr(,"useBytes")
[1] TRUE

Alternation

The character for alternation is |. Its use is similar to the use of or in any language:

> gregexpr("r|n", "string 01 A")
[[1]]
[1] 3 5
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

The pattern r|n matches positions 3 and 5.

Quantifiers

Quantifiers express the number of times that the subpattern should be repeated. Its syntax is {min,max}, where min and max denote the minimum and maximum times the subpattern can be repeated respectively:

> gregexpr("[a-z]{2,4}", "string 01 A")
[[1]]
[1] 1 5
attr(,"match.length")
[1] 4 2
attr(,"useBytes")
[1] TRUE

There are two important things to consider in this example—firstly, the engine will always try to match the largest string possible. So, if the quantifier is set to {2,4}, the engine will try to match a 4-character string. Secondly, the matches do not overlap, this is the reason why there is no 4 length match at position 2.

Special quantifiers

There are special characters that denote quantity, such as *, ?, and +. The asterisk denotes that the preceding pattern can match 0 to infinite times, the question mark implies optionality, which is equivalent to saying that the preceding pattern can be matched 0 or 1 times, and finally, the plus means that the preceding pattern can match 1 to infinite times.

Anchors

Anchors specify the starting and ending point of the matches. They are expressed with the ^ and $ signs respectively. For example, the tri pattern is matched with string and triangle but ^tri is matched only with triangle:

> grep("tri", c("string","triangle"), value=T)
[1] "string" "triangle"
> grep("^tri", c("string","triangle"), value=T)
[1] "triangle"

The combination of both anchors results in an exact match of the pattern, as it specifies where it starts and where it ends:

> grep("^triangle", c("string","triangle","triangles"), value=T)
[1] "triangle" "triangles"

The ^triangle pattern matches triangle and triangles because it only specifies the beginning of the string. However, the ^triangle$ pattern only matches the triangle string:

> grep("^triangle$", c("string","triangle","triangles"), value=T)
[1] "triangle"

Expressions

Imagine a situation where a pattern must be used that matches any string starting with pri or tri. From what we have seen so far, the first attempt would be something like ^pri|tri. However, this would not give the desired outcome:

> grep("^pri|tri", c("triangle","triangles","price","priority","string"), value=T)
[1] "triangle" "triangles" "price" "priority" "string"

Here, string is matched, although it does not start with tri or pri. This is actually because the regex in the preceding code is expressing a string starting with pri or containing tri. In order to return only strings beginning with pri or tri, an expression must be used, that is, something that denotes that all the blocks must be evaluated together. This is done by enclosing the expression in parentheses:

> grep("^(pri|tri)", c("triangle","triangles","price","priority","string"), value=T)
[1] "triangle" "triangles" "price" "priority"

Escapes

When the pattern must include one of the special characters (dots, question marks, pluses, and so on) and not interpret them in their functions, they must be escaped (that is, deprived from their original meaning in regular expressions and treated as a normal character) with \ in R:

> gregexpr("\?","what is this?")
[[1]]
[1] 13
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE

Examples

After this extended explanation of regular expressions, let's look at some commented examples.

Example 1

This example is to find phone numbers within a string vector, with and without a hyphen, and with or without international code (+1 for USA, for example). It is assumed that the country code is separated from the phone number by a space; the prefix is 3 digits long and the suffix is 4 digits long:

> numbers <- c("+1 453-2341","5342673","55578274982","74683029873","25","+442 5421611")
> grep("^(\+[0-9]+\s)?[0-9]{3}\-?[0-9]{4}$",numbers,value=T)
[1] "+1 453-2341" "5342673" "+442 5421611"

For this example, the pattern should match an optional starting country code, which is the plus sign (escaped because it was sought literally), numbers (no matter how many), and a space (\s). As the entire subpattern is optional, it is followed by ?. After this, the pattern looks for three numbers, an optional hyphen, and finally four more numbers.

The anchors at the beginning and at the end match only those strings that had exactly seven numbers because they begin with 3 (and eventually a country code) and end with 4. In the case that one of the anchors was not included, the result would be as follows:

> grep("^(\+[0-9]+\s)?[0-9]{3}\-?[0-9]{4}",numbers,value=T)
[1] "+1 453-2341" "5342673" "55578274982" "74683029873" "+442 5421611"

The third and fourth elements of the output were matched because they do have three and then four digits inside (this means, they have seven or more).

Example 2

Another task that could be fulfilled with regular expressions is finding, for example, entire sentences that are questions within a text. In this case, it is assumed that punctuation is respected, that is, a capital letter is used after the beginning of a sentence and that there is a space after each stop:

> example.text <- "This is a text. What is this exactly? A text. Are you sure?"
> greps <- gregexpr("\s[A-Z]{1}[^\?\.]*\?",example.text)
> regmatches(example.text,greps)
[[1]]
[1] " What is this exactly?" " Are you sure?"

Note

regmatches() is a very useful function that returns the text matched by gregexpr() by passing the original text and the gregexpr() output.

This regular expression would be read as follows: a space, one capital letter, anything but a question mark or a stop repeated zero or more times, and finally a question mark. It is interesting to note what happens if "anything but a question mark or a stop" is replaced by "anything":

> example.text <- "This is a text. What is this exactly? A text. Are you sure?"
> greps <- gregexpr("\s[A-Z]{1}.*\?",example.text)
> regmatches(example.text,greps)
[[1]]
[1] " What is this exactly? A text. Are you sure?"

There is only one match that starts at the beginning of the question but ends at the end of the original string. As .* matches any printable character, it also matches ? and .. In this case, it will still be matching characters until the end of the string. On the contrary, by specifying that . and ? are not desired, the [^\?\.]* subpattern stops at the question mark, which is then matched with \?. There is a very complete and clear explanation of this at http://www.regular-expressions.info/repeat.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset