19
String Localization and Regular Expressions

LOCALIZATION

When you’re learning how to program in C or C++, it’s useful to think of a character as equivalent to a byte and to treat all characters as members of the ASCII (American Standard Code for Information Interchange) character set. ASCII is a 7-bit set usually stored in an 8-bit char type. In reality, experienced C++ programmers recognize that successful programs are used throughout the world. Even if you don’t initially write your program with international audiences in mind, you shouldn’t prevent yourself from localizing, or making the software locale aware, at a later date.

Localizing String Literals

A critical aspect of localization is that you should never put any native-language string literals in your source code, except maybe for debug strings targeted at the developer. In Microsoft Windows applications, this is accomplished by putting the strings in STRINGTABLE resources. Most other platforms offer similar capabilities. If you need to translate your application to another language, translating those resources should be all you need to do, without requiring any source changes. There are tools available that will help you with this translation process.

To make your source code localizable, you should not compose sentences out of string literals, even if the individual literals can be localized. Here is an example:

cout << "Read " << n << " bytes" << endl;

This statement cannot be localized to Dutch because it requires a reordering of the words. The Dutch translation is as follows:

cout << n << " bytes gelezen" << endl;

To make sure you can properly localize this string, you could implement something like this:

cout << Format(IDS_TRANSFERRED, n) << endl;

IDS_TRANSFERRED is the name of an entry in a string resource table. For the English version, IDS_TRANSFERRED could be defined as “Read $1 bytes”, while the Dutch version of the resource could be defined as “$1 bytes gelezen”. The Format() function loads the string resource, and substitutes $1 with the value of n.

Wide Characters

The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t that holds a wide character. Languages with non-ASCII (U.S.) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t. However, the C++ standard does not define a size for wchar_t. Some compilers use 16 bits while others use 32 bits. To write cross-platform code, it is not safe to assume that wchar_t is of a particular size.

If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t, string and character literals are prefixed with the letter L to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t character to the letter m, you write it like this:

wchar_t myWideCharacter = L'm';

There are wide-character versions of most of your favorite types and classes. The wide string class is wstring. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with wofstream, and input is handled with wifstream. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs locale aware! Streams are discussed in detail in Chapter 13.

There are also wide-versions of cout, cin, cerr, and clog available, called wcout, wcin, wcerr, and wclog. Using them is no different than using the non-wide versions:

wcout << L"I am a wide-character string literal." << endl;

Non-Western Character Sets

Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, characters are represented by numbers, now called code points. The only difference is that each number does not fit in 8 bits. The map of characters to code points is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.

The Universal Character Set (UCS)—defined by the International Standard ISO 10646—and Unicode are both standardized sets of characters. They contain around one hundred thousand abstract characters, each identified by an unambiguous name and a code point. The same characters with the same numbers exist in both standards. Both have specific encodings that you can use. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values, and UTF-32 encodes Unicode characters as exactly 32 bits.

Different applications can use different encodings. Unfortunately, the C++ standard does not specify a size for wide characters (wchar_t). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross-platform code. To help solve this issue, there are two other character types: char16_t and char32_t. The following list gives an overview of all available character types.

  • char: Stores 8 bits. This type can be used to store ASCII characters, or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode character is encoded as one to four chars.
  • char16_t: Stores at least 16 bits. This type can be used as the basic building block for UTF-16 encoded Unicode characters, where one Unicode character is encoded as one or two char16_ts.
  • char32_t: Stores at least 32 bits. This type can be used for storing UTF-32 encoded Unicode characters as one char32_t.
  • wchar_t: Stores a wide character of a compiler-specific size and encoding.

The benefit of using char16_t and char32_t instead of wchar_t is that the size of char16_t is guaranteed to be at least 16 bits, and the size of char32_t is guaranteed to be at least 32 bits, independent of the compiler. There is no minimum size guaranteed for wchar_t.

The standard also defines the following macros.

  • __STDC_UTF_32__: If defined by the compiler, then the type char32_t uses UTF-32 encoding. If it is not defined, char32_t has a compiler-dependent encoding.
  • __STDC_UTF_16__: If defined by the compiler, then the type char16_t uses UTF-16 encoding. If it is not defined, char16_t has a compiler-dependent encoding.

String literals can have a string prefix to turn them into a specific type. The complete set of supported string prefixes is as follows.

  • u8: A char string literal with UTF-8 encoding.
  • u: A char16_t string literal, which can be UTF-16 if __STDC_UTF_16__ is defined by the compiler.
  • U: A char32_t string literal, which can be UTF-32 if __STDC_UTF_32__ is defined by the compiler.
  • L: A wchar_t string literal with a compiler-dependent encoding.

All of these string literals can be combined with the raw string literal prefix, R, discussed in Chapter 2. Here are some examples:

const char* s1 = u8R"(Raw UTF-8 encoded string literal)";
const wchar_t* s2 = LR"(Raw wide string literal)";
const char16_t* s3 = uR"(Raw char16_t string literal)";
const char32_t* s4 = UR"(Raw char32_t string literal)";

If you are using Unicode encoding, for example, by using u8 UTF-8 string literals, or if your compiler defines __STDC_UTF_16__ or __STDC_UTF_32__, you can insert a specific Unicode code point in your non-raw string literal by using the uABCD notation. For example, u03C0 represents the pi character, and u00B2 represents the ‘squared’ character. The following formula string represents “π r2”:

const char* formula = u8"u03C0 ru00B2";

Similarly, character literals can also have a prefix to turn them into specific types. The prefixes u, U, and L are supported, and C++17 adds the u8 prefix for character literals. Here are some examples: u'a', U'a', L'a', and u8'a'.

Besides the std::string class, there is also support for wstring, u16string, and u32string. They are all defined as follows:

  • using string = basic_string<char>;
  • using wstring = basic_string<wchar_t>;
  • using u16string = basic_string<char16_t>;
  • using u32string = basic_string<char32_t>;

Multibyte characters are characters composed of one or more bytes with a compiler-dependent encoding, similar to how Unicode can be represented with one to four bytes using UTF-8, or with one or two 16-bit values using UTF-16. There are conversion functions to convert between char16_t/char32_t and multibyte characters, and vice versa: mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb.

Unfortunately, the support for char16_t and char32_t doesn’t go much further. There are some conversion classes available (see the next section), but, for example, there is nothing like a version of cout or cin that supports char16_t and char32_t; this makes it difficult to print such strings to a console or to read them from user input. If you want to do more with char16_t and char32_t strings, you need to resort to third-party libraries. ICU—International Components for Unicode—is one well-known library that provides Unicode and globalization support for your applications.

Conversions

The C++ standard provides the codecvt class template to help with converting between different encodings. The <locale> header defines the following four encoding conversion classes.

CLASS DESCRIPTION
codecvt<char,char,mbstate_t> Identity conversion, that is, no conversion
codecvt<char16_t,char,mbstate_t> Conversion between UTF-16 and UTF-8
codecvt<char32_t,char,mbstate_t> Conversion between UTF-32 and UTF-8
codecvt<wchar_t,char,mbstate_t> Conversion between wide (implementation-specific) and narrow character encodings

image Before C++17, the following three code conversion facets were defined in <codecvt>: codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16. These could be used with two convenience conversion interfaces: wstring_convert and wbuffer_convert. However, C++17 has deprecated those three conversion facets (the entire <codecvt> header) and the two convenience interfaces, so they are not further discussed in this book. The C++ Standards Committee decided to deprecate this functionality because it does not handle errors well. Ill-formed Unicode strings are a security risk, and in fact can be, and have been, used as an attack vector to compromise the security of systems. Also, the API is too obscure, and too hard to understand. I recommend using third-party libraries, such as ICU, to work correctly with Unicode strings until the Standards Committee comes up with a suitable, safe, and easier-to-use replacement for the deprecated functionality.

Locales and Facets

Character sets are only one of the differences in data representation between countries. Even countries that use similar character sets, such as Great Britain and the United States, still differ in how they represent certain data, such as dates and money.

The standard C++ mechanism that groups specific data about a particular set of cultural parameters is called a locale. An individual component of a locale, such as date format, time format, number format, and so on, is called a facet. An example of a locale is U.S. English. An example of a facet is the format used to display a date. There are several built-in facets that are common to all locales. C++ also provides a way to customize or add facets.

There are third-party libraries available that make it easier to work with locales. One example is boost.locale, which is able to use ICU as its backend, supporting collations and conversions, converting strings to uppercase (instead of converting character by character to uppercase), and so on.

Using Locales

When using I/O streams, data is formatted according to a particular locale. Locales are objects that can be attached to a stream, and they are defined in the <locale> header file. Locale names are implementation specific. The POSIX standard is to separate a language and an area into two-letter sections with an optional encoding. For example, the locale for the English language as spoken in the U.S. is en_US, while the locale for the English language as spoken in Great Britain is en_GB. The locale for Japanese spoken in Japan with Japanese Industrial Standard encoding is ja_JP.jis.

Locale names on Windows can have two formats. The preferred format is very similar to the POSIX format, but uses a dash instead of an underscore. The second, old format, looks like this:

lang[_country_region[.code_page]]

Everything between the square brackets is optional. The following table lists some examples.

POSIX WINDOWS WINDOWS OLD
U.S. English en_US en-US English_United States
Great Britain English en_GB en-GB English_Great Britain

Most operating systems have a mechanism to determine the locale as defined by the user. In C++, you can pass an empty string to the std::locale object constructor to create a locale from the user’s environment. Once this object is created, you can use it to query the locale, possibly making programmatic decisions based on it. The following code demonstrates how to use the user’s locale for a stream by calling the imbue() method on the stream. The result is that everything that is sent to wcout is formatted according to the formatting rules for your environment:

wcout.imbue(locale(""));
wcout << 32767 << endl;

This means that if your system locale is English United States and you output the number 32767, the number is displayed as 32,767; however, if your system locale is Dutch Belgium, the same number is displayed as 32.767.

The default locale is the classic/neutral locale, and not the user’s locale. The classic locale uses ANSI C conventions, and has the name C. The classic C locale is similar to U.S. English, but there are slight differences. For example, numbers are handled without any punctuation:

wcout.imbue(locale("C"));
wcout << 32767 << endl;

The output of this code is as follows:

32767

The following code manually sets the U.S. English locale, so the number 32767 is formatted with U.S. English punctuation, independent of your system locale:

wcout.imbue(locale("en-US")); // "en_US" for POSIX
wcout << 32767 << endl;

The output of this code is as follows:

32,767

A locale object allows you to query information about the locale. For example, the following program creates a locale matching the user’s environment. The name() method is used to get a C++ string that describes the locale. Then, the find() method is used on the string object to find a given substring, which returns string::npos when the given substring is not found. The code checks for the Windows name and the POSIX name. One of two messages is output, depending on whether or not the locale appears to be U.S. English:

locale loc("");
if (loc.name().find("en_US") == string::npos &&
    loc.name().find("en-US") == string::npos) {
    wcout << L"Welcome non-U.S. English speaker!" << endl;
} else {
    wcout << L"Welcome U.S. English speaker!" << endl;
}

Character Classification

The <locale> header contains the following character classification functions: std::isspace(), isblank(), iscntrl(), isupper(), islower(), isalpha(), isdigit(), ispunct(), isxdigit(), isalnum(), isprint(), isgraph(). They all accept two parameters: the character to classify, and the locale to use for the classification. Here is an example of isupper() using the user’s environment locale:

bool result = isupper('A', locale(""));

Character Conversion

The <locale> header also defines two character conversion functions: std::toupper() and tolower(). They accept two parameters: the character to convert, and the locale to use for the conversion.

Using Facets

You can use the std::use_facet() function to obtain a particular facet in a particular locale. The argument to use_facet() is a locale. For example, the following expression retrieves the standard monetary punctuation facet of the British English locale using the POSIX locale name:

use_facet<moneypunct<wchar_t>>(locale("en_GB"));

Note that the innermost template type determines the character type to use. This is usually wchar_t or char. The use of nested template classes is unfortunate, but once you get past the syntax, the result is an object that contains all the information you want to know about British money punctuation. The data available in the standard facets is defined in the <locale> header and its associated files. The following table lists the facet categories defined by the standard. Consult a Standard Library reference for details about the individual facets.

FACET DESCRIPTION
ctype Character classification facets.
codecvt Conversion facets, see earlier in this chapter.
collate Comparing strings lexicographically.
time_get Parsing dates and times.
time_put Formatting dates and times.
num_get Parsing numeric values.
num_put Formatting numeric values.
numpunct Defines the formatting parameters for numeric values.
money_get Parsing monetary values.
money_put Formatting monetary values.
moneypunct Defines formatting parameters for monetary values.

The following program brings together locales and facets by printing out the currency symbol in both U.S. English and British English. Note that, depending on your environment, the British currency symbol may appear as a question mark, a box, or not at all. If your environment is set up to handle it, you may actually get the British pound symbol:

locale locUSEng("en-US");       // For Windows
//locale locUSEng("en_US");     // For Linux
locale locBritEng("en-GB");     // For Windows
//locale locBritEng("en_GB");   // For Linux

wstring dollars = use_facet<moneypunct<wchar_t>>(locUSEng).curr_symbol();
wstring pounds = use_facet<moneypunct<wchar_t>>(locBritEng).curr_symbol();

wcout << L"In the US, the currency symbol is " << dollars << endl;
wcout << L"In Great Britain, the currency symbol is " << pounds << endl;

REGULAR EXPRESSIONS

Regular expressions, defined in the <regex> header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier. Regular expressions can be used for several string-related operations.

  • Validation: Check if an input string is well formed. #160;For example: Is the input string a well-formed phone number?
  • Decision: Check what kind of string an input represents. #160;For example: Is the input string the name of a JPEG or a PNG file?
  • Parsing: Extract information from an input string. #160;For example: From a full filename, extract the filename part without the full path and without its extension.
  • Transformation: Search substrings and replace them with a new formatted substring. #160;For example: Search all occurrences of “C++17” and replace them with “C++”.
  • Iteration: Search all occurrences of a substring. #160;For example: Extract all phone numbers from an input string.
  • Tokenization: Split a string into substrings based on a set of delimiters. #160;For example: Split a string on whitespace, commas, periods, and so on to extract the individual words.

Of course, you could write your own code to perform any of the preceding operations on your strings, but using the regular expressions functionality is highly recommended, because writing correct and safe code to process strings can be tricky.

Before I can go into more detail on regular expressions, there is some important terminology you need to know. The following terms are used throughout the discussion.

  • Pattern: The actual regular expression is a pattern represented by a string.
  • Match: Determines whether there is a match between a given regular expression and all of the characters in a given sequence [first, last).
  • Search: Determines whether there is some substring within a given sequence [first, last) that matches a given regular expression.
  • Replace: Identifies substrings in a given sequence, and replaces them with a corresponding new substring computed from another pattern, called a substitution pattern.

If you look around on the Internet, you will find several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars:

  • ECMAScript: The grammar based on the ECMAScript standard. ECMAScript is a scripting language standardized by ECMA-262. The core of JavaScript, ActionScript, Jscript, and so on, all use the ECMAScript language standard at their core.
  • basic: The basic POSIX grammar.
  • extended: The extended POSIX grammar.
  • awk: The grammar used by the POSIX awk utility.
  • grep: The grammar used by the POSIX grep utility.
  • egrep: The grammar used by the POSIX grep utility with the -E parameter.

If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax (syntax_option_type). The default grammar in C++ is ECMAScript, whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it’s recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this book.

ECMAScript Syntax

A regular expression pattern is a sequence of characters representing what you want to match. Any character in the regular expression matches itself except for the following special characters:

^ $  . * + ? ( ) [ ] { } |

These special characters are explained throughout the following discussion. If you need to match one of these special characters, you need to escape it using the character, as in this example:

[ or . or * or \

Anchors

The special characters ^ and $ are called anchors. The ^ character matches the position immediately following a line termination character, and $ matches the position of a line termination character. By default, ^ and $ also match the beginning and ending of a string, respectively, but this behavior can be disabled.

For example, ^test$ matches only the string test, and not strings that contain test somewhere in the line, such as 1test, test2, test abc, and so on.

Wildcards

The wildcard character . can be used to match any character except a newline character. For example, the regular expression a.c will match abc, and a5c, but will not match ab5c, ac, and so on.

Alternation

The | character can be used to specify the “or” relationship. For example, a|b matches a or b.

Grouping

Parentheses () are used to mark subexpressions, also called capture groups. Capture groups can be used for several purposes:

  • Capture groups can be used to identify individual subsequences of the original string; each marked subexpression (capture group) is returned in the result. For example, take the following regular expression: (.)(ab|cd)(.). It has three marked subexpressions. Running a regex_search() with this regular expression on 1cd4 results in a match with four entries. The first entry is the entire match, 1cd4, followed by three entries for the three marked subexpressions. These three entries are 1, cd, and 4. The details on how to use the regex_search() algorithm are shown in a later section.
  • Capture groups can be used during matching for a purpose called back references (explained later).
  • Capture groups can be used to identify components during replace operations (explained later).

Repetition

Parts of a regular expression can be repeated by using one of four repeats:

  • * matches the preceding part zero or more times. For example, a*b matches b, ab, aab, aaaab, and so on.
  • + matches the preceding part one or more times. For example, a+b matches ab, aab, aaaab, and so on, but not b.
  • ? matches the preceding part zero or one time. For example, a?b matches b and ab, but nothing else.
  • {…} represents a bounded repeat. a{n} matches a repeated exactly n times; a{n,} matches a repeated n times or more; and a{n,m} matches a repeated between n and m times inclusive. For example, a{3,4} matches aaa and aaaa but not a, aa, aaaaa, and so on.

The repeats described in the previous list are called greedy because they find the longest match while still matching the remainder of the regular expression. To make them non-greedy, a ? can be added behind the repeat, as in *?, +?, ??, and {…}?. A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.

For example, the following table shows a greedy and a non-greedy regular expression, and the resulting submatches when running them on the input sequence aaabbb.

REGULAR EXPRESSION SUBMATCHES
Greedy: (a+)(ab)*(b+) "aaa" "" "bbb"
Non-greedy: (a+?)(ab)*(b+) "aa" "ab" "bb"

Precedence

Just as with mathematical formulas, it’s important to know the precedence of regular expression elements. Precedence is as follows:

  • Elements like a are the basic building blocks of a regular expression.
  • Quantifiers like +, *, ?, and {…} bind tightly to the element on the left; for example, b+.
  • Concatenation like ab+c binds after quantifiers.
  • Alternation like | binds last.

For example, take the regular expression ab+c|d. This matches abc, abbc, abbbc, and so on, and also d. Parentheses can be used to change these precedence rules. For example, ab+(c|d) matches abc, abbc, abbbc, …, abd, abbd, abbbd, and so on. However, by using parentheses you also mark it as a subexpression or capture group. It is possible to change the precedence rules without creating new capture groups by using (?:…). For example, ab+(?:c|d) matches the same as the preceding ab+(c|d) but does not create an additional capture group.

Character Set Matches

Instead of having to write (a|b|c|…|z), which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets, and allows you to write [c1c2cn], which matches any of the characters c1, c2 , …, or cn. For example, [abc] matches any character a, b, or c. If the first character is ^, it means “any but”:

  • ab[cde] matches abc, abd, and abe.
  • ab[^cde] matches abf, abp, and so on but not abc, abd, and abe.

If you need to match the ^, [, or ] characters themselves, you need to escape them; for example, [[^]] matches the characters [, ^, or ].

If you want to specify all letters, you could use a character set like [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]; however, this is clumsy and doing this several times is awkward, especially if you make a typo and omit one of the letters accidentally. There are two solutions to this.

One solution is to use the range specification in square brackets; this allows you to write [a-zA-Z], which recognizes all the letters in the range a to z and A to Z. If you need to match a hyphen, you need to escape it; for example, [a-zA-Z-]+ matches any word including a hyphenated word.

Another solution is to use one of the character classes. These are used to denote specific types of characters and are represented as [:name:]. Which character classes are available depends on the locale, but the names listed in the following table are always recognized. The exact meaning of these character classes is also dependent on the locale. This table assumes the standard C locale.

CHARACTER CLASS NAME DESCRIPTION
digit Digits
d Same as digit
xdigit Digits (digit) and the following letters used in hexadecimal numbers: ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’.
alpha Alphabetic characters. For the C locale, these are all lowercase and uppercase letters.
alnum A combination of the alpha class and the digit class
w Same as alnum
lower Lowercase letters, if applicable to the locale
upper Uppercase letters, if applicable to the locale
blank A blank character is a whitespace character used to separate words within a line of text. For the C locale, these are ‘ ’ and ‘ ’ (tab).
space Whitespace characters. For the C locale, these are ‘ ’, ‘ ’, ‘ ’, ‘ ’, ‘v’, and ‘f’.
s Same as space
print Printable characters. These occupy a printing position—for example, on a display—and are the opposite of control characters (cntrl). Examples are lowercase letters, uppercase letters, digits, punctuation characters, and space characters.
cntrl Control characters. These are the opposite of printable characters (print), and don’t occupy a printing position, for example, on a display. Some examples for the C locale are ‘f’ (form feed), ‘ ’ (new line), and ‘ ’ (carriage return).
graph Characters with a graphical representation. These are all characters that are printable (print), except the space character ‘ ’.
punct Punctuation characters. For the C locale, these are all graphical characters (graph) that are not alphanumeric (alnum). Some examples are ‘!’, ‘#’, ‘@’, ‘}’, and so on.

Character classes are used within character sets; for example, [[:alpha:]]* in English means the same as [a-zA-Z]*.

Because certain concepts like matching digits are so common, there are shorthand patterns for them. For example, [:digit:] and [:d:] mean the same thing as [0-9]. Some classes have an even shorter pattern using the escape notation . For example, d means [:digit:]. Therefore, to recognize a sequence of one or more numbers, you can write any of the following patterns:

  • [0-9]+
  • [[:digit:]]+
  • [[:d:]]+
  • d+

The following table lists the available escape notations for character classes.

ESCAPE NOTATION EQUIVALENT TO
d [[:d:]]
D [^[:d:]]
s [[:s:]]
S [^[:s:]]
w [_[:w:]]
W [^_[:w:]]

Here are some examples:

  • Test[5-8] matches Test5, Test6, Test7, and Test8.
  • [[:lower:]] matches a, b, and so on, but not A, B, and so on.
  • [^[:lower:]] matches any character except lowercase letters like a, b, and so on.
  • [[:lower:]5-7] matches any lowercase letter like a, b, and so on, and the numbers 5, 6, and 7.

Word Boundaries

A word boundary can mean the following:

  • The beginning of the source string if the first character of the source string is one of the word characters, that is, a letter, digit, or an underscore. For the standard C locale this is equal to [A-Za-z0-9_]. Matching the beginning of the source string is enabled by default, but you can disable it (regex_constants::match_not_bow).
  • The end of the source string if the last character of the source string is one of the word characters. Matching the end of the source string is enabled by default, but you can disable it (regex_constants::match_not_eow).
  • The first character of a word, which is one of the word characters, while the preceding character is not a word character.
  • The end of a word, which is a non-word character, while the preceding character is a word character.

You can use  to match a word boundary, and B to match anything except a word boundary.

Back References

Back references allow you to reference a captured group inside the regular expression itself: refers to the n-th captured group, with n > 0. For example, the regular expression (d+)-.*-1 matches a string that has the following format:

  • one or more digits captured in a capture group (d+)
  • followed by a dash -
  • followed by zero or more characters .*
  • followed by another dash -
  • followed by exactly the same digits captured by the first capture group 1

This regular expression matches 123-abc-123, 1234-a-1234, and so on but does not match 123-abc-1234, 123-abc-321, and so on.

Lookahead

Regular expressions support positive lookahead (which uses ?=pattern) and negative lookahead (which uses ?!pattern). The characters following the lookahead must match (positive), or not match (negative) the lookahead pattern, but those characters are not yet consumed.

For example: the pattern a(?!b) contains a negative lookahead to match a letter a not followed by a b. The pattern a(?=b) contains a positive lookahead to match a letter a followed by a b, but b is not consumed so it does not become part of the match.

The following is a more complicated example. The regular expression matches an input sequence that consists of at least one lowercase letter, at least one uppercase letter, at least one punctuation character, and is at least eight characters long. Such a regular expression can, for example, be used to enforce that passwords satisfy certain criteria.

 (?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{8,}

Regular Expressions and Raw String Literals

As you saw in the preceding sections, regular expressions often use special characters that should be escaped in normal C++ string literals. For example, if you write d in a regular expression, it matches any digit. However, because is a special character in C++, you need to escape it in your regular expression string literal as \d; otherwise, your C++ compiler tries to interpret the d. It can get more complicated if you want your regular expression to match a single back-slash character . Because is a special character in the regular expression syntax itself, you need to escape it as \. The character is also a special character in C++ string literals, so you need to escape it in your C++ string literal, resulting in \\.

You can use raw string literals to make a complicated regular expression easier to read in your C++ source code. (Raw string literals are discussed in Chapter 2.) For example, take the following regular expression:

"( |\n|\r|\\)"

This regular expression matches spaces, newlines, carriage returns, and back slashes. As you can see, you need a lot of escape characters. Using raw string literals, this can be replaced with the following more readable regular expression:

R"(( |
|
|\))"

The raw string literal starts with R"( and ends with )". Everything in between is the regular expression. Of course, you still need a double back slash at the end because the back slash needs to be escaped in the regular expression itself.

This concludes a brief description of the ECMAScript grammar. The following sections explain how to actually use regular expressions in your C++ code.

The regex Library

Everything for the regular expression library is in the <regex> header file and in the std namespace. The basic template types defined by the regular expression library are

  • basic_regex: An object representing a specific regular expression.
  • match_results: A substring that matched a regular expression, including all the captured groups. It is a collection of sub_matches.
  • sub_match: An object containing a pair of iterators into the input sequence. These iterators represent the matched capture group. The pair is an iterator pointing to the first character of a matched capture group and an iterator pointing to one-past-the-last character of the matched capture group. It has an str() method that returns the matched capture group as a string.

The library provides three key algorithms: regex_match(), regex_search(), and regex_replace(). All of these algorithms have different versions that allow you to specify the source string as a string, a character array, or as a begin/end iterator pair. The iterators can be any of the following:

  • const char*
  • const wchar_t*
  • string::const_iterator
  • wstring::const_iterator

In fact, any iterator that behaves as a bidirectional iterator can be used. See Chapters 17 and 18 for details on iterators.

The library also defines the following two types for regular expression iterators, which are very important if you want to find all occurrences of a pattern in a source string.

  • regex_iterator: iterates over all the occurrences of a pattern in a source string.
  • regex_token_iterator: iterates over all the capture groups of all occurrences of a pattern in a source string.

To make the library easier to use, the standard defines a number of type aliases for the preceding templates:

using regex  = basic_regex<char>;
using wregex = basic_regex<wchar_t>;

using csub_match  = sub_match<const char*>;
using wcsub_match = sub_match<const wchar_t*>;
using ssub_match  = sub_match<string::const_iterator>;
using wssub_match = sub_match<wstring::const_iterator>;

using cmatch  = match_results<const char*>;
using wcmatch = match_results<const wchar_t*>;
using smatch  = match_results<string::const_iterator>;
using wsmatch = match_results<wstring::const_iterator>;

using cregex_iterator  = regex_iterator<const char*>;
using wcregex_iterator = regex_iterator<const wchar_t*>;
using sregex_iterator  = regex_iterator<string::const_iterator>;
using wsregex_iterator = regex_iterator<wstring::const_iterator>;

using cregex_token_iterator  = regex_token_iterator<const char*>;
using wcregex_token_iterator = regex_token_iterator<const wchar_t*>;
using sregex_token_iterator  = regex_token_iterator<string::const_iterator>;
using wsregex_token_iterator = regex_token_iterator<wstring::const_iterator>;

The following sections explain the regex_match(), regex_search(), and regex_replace() algorithms, and the regex_iterator and regex_token_iterator classes.

regex_match()

The regex_match() algorithm can be used to compare a given source string with a regular expression pattern. It returns true if the pattern matches the entire source string, and false otherwise. It is very easy to use. There are six versions of the regex_match() algorithm accepting different kinds of arguments. They all have the following form:

template<…>
bool regex_match(InputSequence[, MatchResults], RegEx[, Flags]);

The InputSequence can be represented as follows:

  • A start and end iterator into a source string
  • An std::string
  • A C-style string

The optional MatchResults parameter is a reference to a match_results and receives the match. If regex_match() returns false, you are only allowed to call match_results::empty() or match_results::size(); anything else is undefined. If regex_match() returns true, a match is found and you can inspect the match_results object for what exactly got matched. This is explained with examples in the following subsections.

The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the matching algorithm. In most cases you can keep the default. For more details, consult a Standard Library Reference, see Appendix B.

regex_match() Example

Suppose you want to write a program that asks the user to enter a date in the format year/month/day, where year is four digits, month is a number between 1 and 12, and day is a number between 1 and 31. You can use a regular expression together with the regex_match() algorithm to validate the user input as follows. The details of the regular expression are explained after the code.

regex r("\d{4}/(?:0?[1-9]|1[0-2])/(?:0?[1-9]|[1-2][0-9]|3[0-1])");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    if (regex_match(str, r))
        cout << "  Valid date." << endl;
    else
        cout << "  Invalid date!" << endl;
}

The first line creates the regular expression. The expression consists of three parts separated by a forward slash (/) character: one part for year, one for month, and one for day. The following list explains these parts.

  • d{4}: This matches any combination of four digits; for example, 1234, 2010, and so on.
  • (?:0?[1-9]|1[0-2]): This subpart of the regular expression is wrapped inside parentheses to make sure the precedence is correct. You don’t need a capture group, so (?:…) is used. The inner expression consists of an alternation of two parts separated by the | character.
  • 0?[1-9]: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.
  • 1[0-2]: This matches 10, 11, or 12, and nothing else.
  • (?:0?[1-9]|[1-2][0-9]|3[0-1]): This subpart is also wrapped inside a non-capture group and consists of an alternation of three parts.
  • 0?[1-9]: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.
  • [1-2][0-9]: This matches any number between 10 and 29 inclusive and nothing else.
  • 3[0-1]: This matches 30 or 31 and nothing else.

The example then enters an infinite loop to ask the user to enter a date. Each date entered is given to the regex_match() algorithm. When regex_match() returns true, the user has entered a date that matches the date regular expression pattern.

This example can be extended a bit by asking the regex_match() algorithm to return captured subexpressions in a results object. You first have to understand what a capture group does. By specifying a match_results object like smatch in a call to regex_match(), the elements of the match_results object are filled in when the regular expression matches the input string. To be able to extract these substrings, you must create capture groups using parentheses.

The first element, [0], in a match_results object contains the string that matched the entire pattern. When using regex_match() and a match is found, this is the entire source sequence. When using regex_search(), discussed in the next section, this is a substring in the source sequence that matches the regular expression. Element [1] is the substring matched by the first capture group, [2] by the second capture group, and so on. To get a string representation of a capture group, you can write m[i] as in the following code, or write m[i].str(), where i is the index of the capture group and m is a match_results object.

The following code extracts the year, month, and day digits into three separate integer variables. The regular expression in the revised example has a few small changes. The first part matching the year is wrapped in a capture group, while the month and day parts are now also capture groups instead of non-capture groups. The call to regex_match() includes a smatch parameter, which receives the matched capture groups. Here is the adapted example:

regex r("(\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    smatch m;
    if (regex_match(str, m, r)) {
        int year = stoi(m[1]);
        int month = stoi(m[2]);
        int day = stoi(m[3]);
        cout << "  Valid date: Year=" << year
             << ", month=" << month
             << ", day=" << day << endl;
    } else {
        cout << "  Invalid date!" << endl;
    }
}

In this example, there are four elements in the smatch results objects:

  • [0]: the string matching the full regular expression, which is the full date in this example
  • [1]: the year
  • [2]: the month
  • [3]: the day

When you execute this example, you can get the following output:

Enter a date (year/month/day) (q=quit): 2011/12/01
  Valid date: Year=2011, month=12, day=1
Enter a date (year/month/day) (q=quit): 11/12/01
  Invalid date!

regex_search()

The regex_match() algorithm discussed in the previous section returns true if the entire source string matches the regular expression, and false otherwise. It cannot be used to find a matching substring. Instead, you need to use the regex_search() algorithm, which allows you to search for a substring that matches a certain pattern. There are six versions of the regex_search() algorithm, and they all have the following form:

template<…>
bool regex_search(InputSequence[, MatchResults], RegEx[, Flags]);

All variations return true when a match is found somewhere in the input sequence, and false otherwise. The parameters are similar to the parameters for regex_match().

Two versions of the regex_search() algorithm accept a begin and end iterator as the input sequence that you want to process. You might be tempted to use this version of regex_search() in a loop to find all occurrences of a pattern in a source string by manipulating these begin and end iterators for each regex_search() call. Never do this! It can cause problems when your regular expression uses anchors (^ or $), word boundaries, and so on. It can also cause an infinite loop due to empty matches. Use the regex_iterator or regex_token_iterator as explained later in this chapter to extract all occurrences of a pattern from a source string.

regex_search() Example

The regex_search() algorithm can be used to extract matching substrings from an input sequence. The following example extracts code comments from input lines. The regular expression searches for a substring that starts with // followed by some optional whitespace s* followed by one or more characters captured in a capture group (.+). This capture group captures only the comment substring. The smatch object m receives the search results. If successful, m[1] contains the comment that was found. You can check the m[1].first and m[1].second iterators to see where exactly the comment was found in the source string.

regex r("//\s*(.+)$");
while (true) {
    cout << "Enter a string with optional code comments (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    smatch m;
    if (regex_search(str, m, r))
        cout << "  Found comment '" << m[1] << "'" << endl;
    else
        cout << "  No comment found!" << endl;
}

The output of this program can look as follows:

Enter a string (q=quit): std::string str;   // Our source string
  Found comment 'Our source string'
Enter a string (q=quit): int a; // A comment with // in the middle
  Found comment 'A comment with // in the middle'
Enter a string (q=quit): float f; // A comment with a       (tab) character
  Found comment 'A comment with a       (tab) character'

The match_results object also has a prefix() and suffix() method, which return the string preceding or following the match, respectively.

regex_iterator

As explained in the previous section, you should never use regex_search() in a loop to extract all occurrences of a pattern from a source sequence. Instead, you should use a regex_iterator or regex_token_iterator. They work similarly to iterators for Standard Library containers.

regex_iterator Example

The following example asks the user to enter a source string, extracts every word from the string, and prints all words between quotes. The regular expression in this case is [w]+, which searches for one or more word-letters. This example uses std::string as source, so it uses sregex_iterator for the iterators. A standard iterator loop is used, but in this case, the end iterator is done slightly differently from the end iterators of ordinary Standard Library containers. Normally, you specify an end iterator for a particular container, but for regex_iterator, there is only one “end” iterator. You can get this end iterator by declaring a regex_iterator type using the default constructor.

The for loop creates a start iterator called iter, which accepts a begin and end iterator into the source string together with the regular expression. The loop body is called for every match found, which is every word in this example. The sregex_iterator iterates over all the matches. By dereferencing an sregex_iterator, you get a smatch object. Accessing the first element of this smatch object, [0], gives you the matched substring:

regex reg("[\w]+");
while (true) {
    cout << "Enter a string to split (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    const sregex_iterator end;
    for (sregex_iterator iter(cbegin(str), cend(str), reg);
        iter != end; ++iter) {
        cout << """ << (*iter)[0] << """ << endl;
    }
}

The output of this program can look as follows:

Enter a string to split (q=quit): This, is    a test.
"This"
"is"
"a"
"test"

As this example demonstrates, even simple regular expressions can do some powerful string manipulation!

Note that both a regex_iterator and a regex_token_iterator internally contain a pointer to the given regular expression. They both explicitly delete constructors accepting an rvalue regular expression, so you cannot construct them with a temporary regex object. For example, the following does not compile:

for (sregex_iterator iter(cbegin(str), cend(str), regex("[\w]+"));
    iter != end; ++iter) { … }

regex_token_iterator

The previous section describes regex_iterator, which iterates through every matched pattern. In each iteration of the loop you get a match_results object, which you can use to extract subexpressions for that match that are captured by capture groups.

A regex_token_iterator can be used to automatically iterate over all or selected capture groups across all matched patterns. There are four constructors with the following format:

regex_token_iterator(BidirectionalIterator a,
                     BidirectionalIterator b,
                     const regex_type& re
                     [, SubMatches
                     [, Flags]]);

All of them require a begin and end iterator as input sequence, and a regular expression. The optional SubMatches parameter is used to specify which capture groups should be iterated over. SubMatches can be specified in four ways:

  • As a single integer representing the index of the capture group that you want to iterate over.
  • As a vector with integers representing the indices of the capture groups that you want to iterate over.
  • As an initializer_list with capture group indices.
  • As a C-style array with capture group indices.

When you omit SubMatches or when you specify a 0 for SubMatches, you get an iterator that iterates over all capture groups with index 0, which are the substrings matching the full regular expression. The optional Flags parameter specifies options for the matching algorithm. In most cases you can keep the default. Consult a Standard Library Reference for more details.

regex_token_iterator Examples

The previous regex_iterator example can be rewritten using a regex_token_iterator as follows. Note that *iter is used in the loop body instead of (*iter)[0] as in the regex_iterator example, because the token iterator with 0 as the default submatch index automatically iterates over all capture groups with index 0. The output of this code is exactly the same as the output generated by the regex_iterator example:

regex reg("[\w]+");
while (true) {
    cout << "Enter a string to split (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    const sregex_token_iterator end;
    for (sregex_token_iterator iter(cbegin(str), cend(str), reg);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

The following example asks the user to enter a date and then uses a regex_token_iterator to iterate over the second and third capture groups (month and day), which are specified as a vector of integers. The regular expression used for dates is explained earlier in this chapter. The only difference is that ^ and $ anchors are added since we want to match the entire source sequence. Earlier, that was not necessary, because regex_match() automatically matches the entire input string.

regex reg("^(\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    vector<int> indices{ 2, 3 };
    const sregex_token_iterator end;
    for (sregex_token_iterator iter(cbegin(str), cend(str), reg, indices);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

This code prints only the month and day of valid dates. Output generated by this example can look like this:

Enter a date (year/month/day) (q=quit): 2011/1/13
"1"
"13"
Enter a date (year/month/day) (q=quit): 2011/1/32
Enter a date (year/month/day) (q=quit): 2011/12/5
"12"
"5"

The regex_token_iterator can also be used to perform a so-called field splitting or tokenization. It is a much safer and more flexible alternative to using the old strtok() function from C. Tokenization is triggered in the regex_token_iterator constructor by specifying -1 as the capture group index to iterate over. When in tokenization mode, the iterator iterates over all substrings of the input sequence that do not match the regular expression. The following code demonstrates this by tokenizing a string on the delimiters , and ; with zero or more whitespace characters before or after the delimiters:

regex reg(R"(s*[,;]s*)");
while (true) {
    cout << "Enter a string to split on ',' and ';' (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    const sregex_token_iterator end;
    for (sregex_token_iterator iter(cbegin(str), cend(str), reg, -1);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

The regular expression in this example is specified as a raw string literal and searches for patterns that match the following:

  • zero or more whitespace characters
  • followed by a , or ; character
  • followed by zero or more whitespace characters

The output can be as follows:

Enter a string to split on ',' and ';' (q=quit): This is,   a; test string.
"This is"
"a"
"test string."

As you can see from this output, the string is split on , and ;. All whitespace characters around the , and ; are removed, because the tokenization iterator iterates over all substrings that do not match the regular expression, and because the regular expression matches , and ; with whitespace around them.

regex_replace()

The regex_replace() algorithm requires a regular expression, and a formatting string that is used to replace matching substrings. This formatting string can reference part of the matched substrings by using the escape sequences in the following table.

ESCAPE SEQUENCE REPLACED WITH
$ n The string matching the n-th capture group; for example, $1 for the first capture group, $2 for the second, and so on. n must be greater than 0.
$& The string matching the entire regular expression.
$` The part of the input sequence that appears to the left of the substring matching the regular expression.
The part of the input sequence that appears to the right of the substring matching the regular expression.
$$ A single dollar sign.

There are six versions of the regex_replace() algorithm. The difference between them is in the type of arguments. Four of them have the following format:

string regex_replace(InputSequence, RegEx, FormatString[, Flags]);

These four versions return the resulting string after performing the replacement. Both the InputSequence and the FormatString can be an std::string or a C-style string. The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the replace algorithm.

Two versions of the regex_replace() algorithm have the following format:

OutputIterator regex_replace(OutputIterator,
                             BidirectionalIterator first,
                             BidirectionalIterator last,
                             RegEx, FormatString[, Flags]);

These two versions write the resulting string to the given output iterator and return this output iterator. The input sequence is given as a begin and end iterator. The other parameters are identical to the other four versions of regex_replace().

regex_replace() Examples

As a first example, take the following HTML source string,

<body><h1>Header</h1><p>Some text</p></body>

and the regular expression,

<h1>(.*)</h1><p>(.*)</p>

The following table shows the different escape sequences and what they will be replaced with.

ESCAPE SEQUENCE REPLACED WITH
$1 Header
$2 Some text
$& <h1>Header</h1><p>Some text</p>
$` <body>
</body>

The following code demonstrates the use of regex_replace():

const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");

const string format("H1=$1 and P=$2");  // See above table
string result = regex_replace(str, r, format);

cout << "Original string: '" << str << "'" << endl;
cout << "New string     : '" << result << "'" << endl;

The output of this program is as follows:

Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string     : '<body>H1=Header and P=Some text</body>'

The regex_replace() algorithm accepts a number of flags that can be used to manipulate how it is working. The most important flags are given in the following table.

FLAG DESCRIPTION
format_default The default is to replace all occurrences of the pattern, and to also copy everything to the output that does not match the pattern.
format_no_copy Replaces all occurrences of the pattern, but does not copy anything to the output that does not match the pattern.
format_first_only Replaces only the first occurrence of the pattern.

The following example modifies the previous code to use the format_no_copy flag:

const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");

const string format("H1=$1 and P=$2");
string result = regex_replace(str, r, format,
    regex_constants::format_no_copy);

cout << "Original string: '" << str << "'" << endl;
cout << "New string     : '" << result << "'" << endl;

The output is as follows. Compare this with the output of the previous version.

Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string     : 'H1=Header and P=Some text'

Another example is to get an input string and replace each word boundary with a newline so that the output contains only one word per line. The following example demonstrates this without using any loops to process a given input string. The code first creates a regular expression that matches individual words. When a match is found, it is replaced with $1 where $1 is replaced with the matched word. Note also the use of the format_no_copy flag to prevent copying whitespace and other non-word characters from the source string to the output.

regex reg("([\w]+)");
const string format("$1
");
while (true) {
    cout << "Enter a string to split over multiple lines (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;

    cout << regex_replace(str, reg, format,
        regex_constants::format_no_copy) << endl;
}

The output of this program can be as follows:

Enter a string to split over multiple lines (q=quit):   This is   a test.
This
is
a
test

SUMMARY

This chapter gave you an appreciation for coding with localization in mind. As anyone who has been through a localization effort will tell you, adding support for a new language or locale is infinitely easier if you have planned ahead; for example, by using Unicode characters and being mindful of locales.

The second part of this chapter explained the regular expressions library. Once you know the syntax of regular expressions, it becomes much easier to work with strings. Regular expressions allow you to validate strings, search for substrings inside an input sequence, perform find-and-replace operations, and so on. It is highly recommended that you get to know regular expressions and start using them instead of writing your own string manipulation routines. They will make your life easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset