A regular expression, or regex, is a sequence of characters and special metacharacters used to match a set of character strings. Regular expressions allow you to be more expressive with string-matching operations than just providing a simple substring. You can think of it as a pattern that you want to match with strings of different lengths, made up of different characters.
In the str.contains() method, we supplied the regular expression, wigg|drew. In this case, the vertical bar | is a metacharacter that acts as the OR operator, so this regular expression matches any string that contains the substring wigg or drew.
Metacharacters let you change how you make matches. When you provide a regular expression that contains no metacharacters, it simply matches the exact substring. For instance, Wiggins would only match strings containing the exact substring, Wiggins.
Here is a list of basic metacharacters, and what they do:
- ".": The period is a metacharacter that matches any character other than a newline, as illustrated in the following code block:
# Match any substring ending in ill
my_words = pd.Series(["abaa","cabb","Abaa","sabb","dcbb"])
my_words.str.contains(".abb")
- "[ ]": Square brackets specify a set of characters to match. Look at the following example snippet and compare your output with the notebook given with this chapter:
my_words.str.contains("[Aa]abb")
- "^": Outside of square brackets, the caret symbol searches for matches at the beginning of a string, as illustrated in the following code block:
Sentence_series= pd.Series(["Where did he go", "He went to the shop", "he is good"])
Sentence_series.str.contains("^(He|he)")
- "( )": Parentheses in regular expressions are used for grouping and to enforce the proper order of operations, just as they are used in math and logical expressions. In the preceding examples, the parentheses let us group the OR expressions so that the "^" and "$" symbols operate on the entire OR statement.
- "*": An asterisk matches 0 or more copies of the preceding character.
- "?": A question mark matches 0 or 1 copy of the preceding character.
- "+": A plus sign matches 1 or more copies of the preceding character.
- "{ }": Curly braces match a preceding character for a specified number of repetitions:
- "{m}": The preceding element is matched m times.
- "{m,}": The preceding element is matched m times or more.
- "{m,n}": The preceding element is matched between m and n times.
Regular expressions include several special character sets that allow us to quickly specify certain common character types. They include the following:
- [a-z]: Match any lowercase letter.
- [A-Z]: Match any uppercase letter.
- [0-9]: Match any digit.
- [a-zA-Z0-9]: Match any letter or digit.
- Adding the "^" symbol inside the square brackets matches any characters not in the set:
- [^a-z]: Match any character that is not a lowercase letter.
- [^A-Z]: Match any character that is not an uppercase letter.
- [^0-9]: Match any character that is not a digit.
- [^a-zA-Z0-9]: Match any character that is not a letter or digit.
- Python regular expressions also include a shorthand for specifying common sequences:
- d: Match any digit.
- D: Match any non-digit.
- w: Match a word character.
- W: Match a non-word character.
- s: Match whitespace (spaces, tabs, newlines, and so on.).
- S: Match non-whitespace.
Remember—we did escape sequencing while string formatting. Likewise, you must escape with "" in a metacharacter when you want to match the metacharacter symbol itself.
For instance, if you want to match periods, you can't use "." because it is a metacharacter that matches anything. Instead, you'd use . to escape the period's metacharacter behavior and match the period itself. This is illustrated in the following code block:
# Match a single period and then a space
Word_series3 = pd.Series(["Mr. SK","Dr. Deepak","MissMrs Gaire."])
Word_series3.str.contains(". ")
If you want to match the escape character itself, you either have to use four backslashes "" or encode the string as a raw string in the form r"mystring" and then use double backslashes. Raw strings are an alternate string representation in Python that simplifies some oddities in performing regular expressions on normal strings, as illustrated in the following code snippet:
# Match strings containing a backslash
Word_series3.str.contains(r"\")
While dealing with special string characters in regular expressions, a raw string is often used because it avoids issues that may arise with those special characters.
Regular expressions are commonly used to match the patterns of phone numbers, email addresses, and web addresses in between the text. Pandas has several string functions that accept regular expression patterns and perform the operation. We are now familiar with these functions: series.str.contains() and series.str.replace().
Now, let's use some more functions in our dataset of comments.
Use series.str.count() to count the occurrences of a pattern in each string, as follows:
text.str.findall(r"[Ww]olves").head(8)
There are endless ways in which a string can be manipulated. We chose to illustrate the most basic ways in order to make it simple for you to understand.