When you’re learning how to program in C or C++, it’s useful to think of a character as equivalent to a byte and to treat all characters as members of the ASCII (American Standard Code for Information Interchange) character set. ASCII is a 7-bit set usually stored in an 8-bit char
type. In reality, experienced C++ programmers recognize that successful programs are used throughout the world. Even if you don’t initially write your program with international audiences in mind, you shouldn’t prevent yourself from localizing, or making the software locale aware, at a later date.
A critical aspect of localization is that you should never put any native-language string literals in your source code, except maybe for debug strings targeted at the developer. In Microsoft Windows applications, this is accomplished by putting the strings in STRINGTABLE
resources. Most other platforms offer similar capabilities. If you need to translate your application to another language, translating those resources should be all you need to do, without requiring any source changes. There are tools available that will help you with this translation process.
To make your source code localizable, you should not compose sentences out of string literals, even if the individual literals can be localized. Here is an example:
cout << "Read " << n << " bytes" << endl;
This statement cannot be localized to Dutch because it requires a reordering of the words. The Dutch translation is as follows:
cout << n << " bytes gelezen" << endl;
To make sure you can properly localize this string, you could implement something like this:
cout << Format(IDS_TRANSFERRED, n) << endl;
IDS_TRANSFERRED
is the name of an entry in a string resource table. For the English version, IDS_TRANSFERRED
could be defined as “Read $1 bytes
”, while the Dutch version of the resource could be defined as “$1 bytes gelezen
”. The Format()
function loads the string resource, and substitutes $1
with the value of n
.
The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t
that holds a wide character. Languages with non-ASCII (U.S.) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t
. However, the C++ standard does not define a size for wchar_t
. Some compilers use 16 bits while others use 32 bits. To write cross-platform code, it is not safe to assume that wchar_t
is of a particular size.
If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t
, string and character literals are prefixed with the letter L
to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t
character to the letter m
, you write it like this:
wchar_t myWideCharacter = L'm';
There are wide-character versions of most of your favorite types and classes. The wide string
class is wstring
. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with wofstream
, and input is handled with wifstream
. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs locale aware! Streams are discussed in detail in Chapter 13.
There are also wide-versions of cout
, cin
, cerr
, and clog
available, called wcout
, wcin
, wcerr
, and wclog
. Using them is no different than using the non-wide versions:
wcout << L"I am a wide-character string literal." << endl;
Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, characters are represented by numbers, now called code points. The only difference is that each number does not fit in 8 bits. The map of characters to code points is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.
The Universal Character Set (UCS)—defined by the International Standard ISO 10646—and Unicode are both standardized sets of characters. They contain around one hundred thousand abstract characters, each identified by an unambiguous name and a code point. The same characters with the same numbers exist in both standards. Both have specific encodings that you can use. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values, and UTF-32 encodes Unicode characters as exactly 32 bits.
Different applications can use different encodings. Unfortunately, the C++ standard does not specify a size for wide characters (wchar_t
). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross-platform code. To help solve this issue, there are two other character types: char16_t
and char32_t
. The following list gives an overview of all available character types.
char:
Stores 8 bits. This type can be used to store ASCII characters, or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode character is encoded as one to four char
s.char16_t
: Stores at least 16 bits. This type can be used as the basic building block for UTF-16 encoded Unicode characters, where one Unicode character is encoded as one or two char16_t
s.char32_t
: Stores at least 32 bits. This type can be used for storing UTF-32 encoded Unicode characters as one char32_t
.wchar_t
: Stores a wide character of a compiler-specific size and encoding.The benefit of using char16_t
and char32_t
instead of wchar_t
is that the size of char16_t
is guaranteed to be at least 16 bits, and the size of char32_t
is guaranteed to be at least 32 bits, independent of the compiler. There is no minimum size guaranteed for wchar_t
.
The standard also defines the following macros.
__STDC_UTF_32__
: If defined by the compiler, then the type char32_t
uses UTF-32 encoding. If it is not defined, char32_t
has a compiler-dependent encoding.__STDC_UTF_16__
: If defined by the compiler, then the type char16_t
uses UTF-16 encoding. If it is not defined, char16_t
has a compiler-dependent encoding.String literals can have a string prefix to turn them into a specific type. The complete set of supported string prefixes is as follows.
u8
: A char
string literal with UTF-8 encoding.u
: A char16_t
string literal, which can be UTF-16 if __STDC_UTF_16__
is defined by the compiler.U
: A char32_t
string literal, which can be UTF-32 if __STDC_UTF_32__
is defined by the compiler.L
: A wchar_t
string literal with a compiler-dependent encoding.All of these string literals can be combined with the raw string literal prefix, R
, discussed in Chapter 2. Here are some examples:
const char* s1 = u8R"(Raw UTF-8 encoded string literal)";
const wchar_t* s2 = LR"(Raw wide string literal)";
const char16_t* s3 = uR"(Raw char16_t string literal)";
const char32_t* s4 = UR"(Raw char32_t string literal)";
If you are using Unicode encoding, for example, by using u8
UTF-8 string literals, or if your compiler defines __STDC_UTF_16__
or __STDC_UTF_32__
, you can insert a specific Unicode code point in your non-raw string literal by using the uABCD
notation. For example, u03C0
represents the pi character, and u00B2
represents the ‘squared’ character. The following formula
string represents “π r
2”:
const char* formula = u8"u03C0 ru00B2";
Similarly, character literals can also have a prefix to turn them into specific types. The prefixes u
, U
, and L
are supported, and C++17 adds the u8
prefix for character literals. Here are some examples: u'a'
, U'a'
, L'a'
, and u8'a'
.
Besides the std::string
class, there is also support for wstring
, u16string
, and u32string
. They are all defined as follows:
using string = basic_string<char>;
using wstring = basic_string<wchar_t>;
using u16string = basic_string<char16_t>;
using u32string = basic_string<char32_t>;
Multibyte characters are characters composed of one or more bytes with a compiler-dependent encoding, similar to how Unicode can be represented with one to four bytes using UTF-8, or with one or two 16-bit values using UTF-16. There are conversion functions to convert between char16_t
/char32_t
and multibyte characters, and vice versa: mbrtoc16
, c16rtomb
, mbrtoc32
, and c32rtomb
.
Unfortunately, the support for char16_t
and char32_t
doesn’t go much further. There are some conversion classes available (see the next section), but, for example, there is nothing like a version of cout
or cin
that supports char16_t
and char32_t
; this makes it difficult to print such strings to a console or to read them from user input. If you want to do more with char16_t
and char32_t
strings, you need to resort to third-party libraries. ICU—International Components for Unicode—is one well-known library that provides Unicode and globalization support for your applications.
The C++ standard provides the codecvt
class template to help with converting between different encodings. The <locale>
header defines the following four encoding conversion classes.
CLASS | DESCRIPTION |
codecvt<char,char,mbstate_t> |
Identity conversion, that is, no conversion |
codecvt<char16_t,char,mbstate_t> |
Conversion between UTF-16 and UTF-8 |
codecvt<char32_t,char,mbstate_t> |
Conversion between UTF-32 and UTF-8 |
codecvt<wchar_t,char,mbstate_t> |
Conversion between wide (implementation-specific) and narrow character encodings |
Before C++17, the following three code conversion facets were defined in <codecvt>
: codecvt_utf8
, codecvt_utf16
, and codecvt_utf8_utf16
. These could be used with two convenience conversion interfaces: wstring_convert
and wbuffer_convert
. However, C++17 has deprecated those three conversion facets (the entire <codecvt>
header) and the two convenience interfaces, so they are not further discussed in this book. The C++ Standards Committee decided to deprecate this functionality because it does not handle errors well. Ill-formed Unicode strings are a security risk, and in fact can be, and have been, used as an attack vector to compromise the security of systems. Also, the API is too obscure, and too hard to understand. I recommend using third-party libraries, such as ICU, to work correctly with Unicode strings until the Standards Committee comes up with a suitable, safe, and easier-to-use replacement for the deprecated functionality.
Character sets are only one of the differences in data representation between countries. Even countries that use similar character sets, such as Great Britain and the United States, still differ in how they represent certain data, such as dates and money.
The standard C++ mechanism that groups specific data about a particular set of cultural parameters is called a locale. An individual component of a locale, such as date format, time format, number format, and so on, is called a facet. An example of a locale is U.S. English. An example of a facet is the format used to display a date. There are several built-in facets that are common to all locales. C++ also provides a way to customize or add facets.
There are third-party libraries available that make it easier to work with locales. One example is boost.locale
, which is able to use ICU as its backend, supporting collations and conversions, converting strings to uppercase (instead of converting character by character to uppercase), and so on.
When using I/O streams, data is formatted according to a particular locale. Locales are objects that can be attached to a stream, and they are defined in the <locale>
header file. Locale names are implementation specific. The POSIX standard is to separate a language and an area into two-letter sections with an optional encoding. For example, the locale for the English language as spoken in the U.S. is en_US
, while the locale for the English language as spoken in Great Britain is en_GB
. The locale for Japanese spoken in Japan with Japanese Industrial Standard encoding is ja_JP.jis
.
Locale names on Windows can have two formats. The preferred format is very similar to the POSIX format, but uses a dash instead of an underscore. The second, old format, looks like this:
lang[_country_region[.code_page]]
Everything between the square brackets is optional. The following table lists some examples.
POSIX | WINDOWS | WINDOWS OLD | |
U.S. English | en_US |
en-US |
English_United States |
Great Britain English | en_GB |
en-GB |
English_Great Britain |
Most operating systems have a mechanism to determine the locale as defined by the user. In C++, you can pass an empty string to the std::locale
object constructor to create a locale
from the user’s environment. Once this object is created, you can use it to query the locale
, possibly making programmatic decisions based on it. The following code demonstrates how to use the user’s locale for a stream by calling the imbue()
method on the stream. The result is that everything that is sent to wcout
is formatted according to the formatting rules for your environment:
wcout.imbue(locale(""));
wcout << 32767 << endl;
This means that if your system locale is English United States and you output the number 32767, the number is displayed as 32,767; however, if your system locale is Dutch Belgium, the same number is displayed as 32.767.
The default locale is the classic/neutral locale, and not the user’s locale. The classic locale uses ANSI C conventions, and has the name C. The classic C locale is similar to U.S. English, but there are slight differences. For example, numbers are handled without any punctuation:
wcout.imbue(locale("C"));
wcout << 32767 << endl;
The output of this code is as follows:
32767
The following code manually sets the U.S. English locale, so the number 32767 is formatted with U.S. English punctuation, independent of your system locale:
wcout.imbue(locale("en-US")); // "en_US" for POSIX
wcout << 32767 << endl;
The output of this code is as follows:
32,767
A locale
object allows you to query information about the locale. For example, the following program creates a locale
matching the user’s environment. The name()
method is used to get a C++ string
that describes the locale. Then, the find()
method is used on the string
object to find a given substring, which returns string::npos
when the given substring is not found. The code checks for the Windows name and the POSIX name. One of two messages is output, depending on whether or not the locale appears to be U.S. English:
locale loc("");
if (loc.name().find("en_US") == string::npos &&
loc.name().find("en-US") == string::npos) {
wcout << L"Welcome non-U.S. English speaker!" << endl;
} else {
wcout << L"Welcome U.S. English speaker!" << endl;
}
The <locale>
header contains the following character classification functions: std::isspace()
, isblank()
, iscntrl()
, isupper()
, islower()
, isalpha()
, isdigit()
, ispunct()
, isxdigit()
, isalnum()
, isprint()
, isgraph()
. They all accept two parameters: the character to classify, and the locale to use for the classification. Here is an example of isupper()
using the user’s environment locale:
bool result = isupper('A', locale(""));
The <locale>
header also defines two character conversion functions: std::toupper()
and tolower()
. They accept two parameters: the character to convert, and the locale to use for the conversion.
You can use the std::use_facet()
function to obtain a particular facet in a particular locale. The argument to use_facet()
is a locale
. For example, the following expression retrieves the standard monetary punctuation facet of the British English locale using the POSIX locale name:
use_facet<moneypunct<wchar_t>>(locale("en_GB"));
Note that the innermost template type determines the character type to use. This is usually wchar_t
or char
. The use of nested template classes is unfortunate, but once you get past the syntax, the result is an object that contains all the information you want to know about British money punctuation. The data available in the standard facets is defined in the <locale>
header and its associated files. The following table lists the facet categories defined by the standard. Consult a Standard Library reference for details about the individual facets.
FACET | DESCRIPTION |
ctype |
Character classification facets. |
codecvt |
Conversion facets, see earlier in this chapter. |
collate |
Comparing strings lexicographically. |
time_get |
Parsing dates and times. |
time_put |
Formatting dates and times. |
num_get |
Parsing numeric values. |
num_put |
Formatting numeric values. |
numpunct |
Defines the formatting parameters for numeric values. |
money_get |
Parsing monetary values. |
money_put |
Formatting monetary values. |
moneypunct |
Defines formatting parameters for monetary values. |
The following program brings together locales and facets by printing out the currency symbol in both U.S. English and British English. Note that, depending on your environment, the British currency symbol may appear as a question mark, a box, or not at all. If your environment is set up to handle it, you may actually get the British pound symbol:
locale locUSEng("en-US"); // For Windows
//locale locUSEng("en_US"); // For Linux
locale locBritEng("en-GB"); // For Windows
//locale locBritEng("en_GB"); // For Linux
wstring dollars = use_facet<moneypunct<wchar_t>>(locUSEng).curr_symbol();
wstring pounds = use_facet<moneypunct<wchar_t>>(locBritEng).curr_symbol();
wcout << L"In the US, the currency symbol is " << dollars << endl;
wcout << L"In Great Britain, the currency symbol is " << pounds << endl;
Regular expressions, defined in the <regex>
header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier. Regular expressions can be used for several string-related operations.
Of course, you could write your own code to perform any of the preceding operations on your strings, but using the regular expressions functionality is highly recommended, because writing correct and safe code to process strings can be tricky.
Before I can go into more detail on regular expressions, there is some important terminology you need to know. The following terms are used throughout the discussion.
If you look around on the Internet, you will find several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars:
If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax (syntax_option_type
). The default grammar in C++ is ECMAScript, whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it’s recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this book.
A regular expression pattern is a sequence of characters representing what you want to match. Any character in the regular expression matches itself except for the following special characters:
^ $ . * + ? ( ) [ ] { } |
These special characters are explained throughout the following discussion. If you need to match one of these special characters, you need to escape it using the character, as in this example:
[ or . or * or \
The special characters For example, The wildcard character The Parentheses Parts of a regular expression can be repeated by using one of four repeats: The repeats described in the previous list are called greedy because they find the longest match while still matching the remainder of the regular expression. To make them non-greedy, a For example, the following table shows a greedy and a non-greedy regular expression, and the resulting submatches when running them on the input sequence Just as with mathematical formulas, it’s important to know the precedence of regular expression elements. Precedence is as follows: For example, take the regular expression Instead of having to write If you need to match the If you want to specify all letters, you could use a character set like One solution is to use the range specification in square brackets; this allows you to write Another solution is to use one of the character classes. These are used to denote specific types of characters and are represented as Character classes are used within character sets; for example, Because certain concepts like matching digits are so common, there are shorthand patterns for them. For example, The following table lists the available escape notations for character classes. Here are some examples: A word boundary can mean the following: You can use Back references allow you to reference a captured group inside the regular expression itself: This regular expression matches Regular expressions support positive lookahead (which uses ?=pattern) and negative lookahead (which uses ?!pattern). The characters following the lookahead must match (positive), or not match (negative) the lookahead pattern, but those characters are not yet consumed. For example: the pattern The following is a more complicated example. The regular expression matches an input sequence that consists of at least one lowercase letter, at least one uppercase letter, at least one punctuation character, and is at least eight characters long. Such a regular expression can, for example, be used to enforce that passwords satisfy certain criteria. As you saw in the preceding sections, regular expressions often use special characters that should be escaped in normal C++ string literals. For example, if you write You can use raw string literals to make a complicated regular expression easier to read in your C++ source code. (Raw string literals are discussed in Chapter 2.) For example, take the following regular expression:
This regular expression matches spaces, newlines, carriage returns, and back slashes. As you can see, you need a lot of escape characters. Using raw string literals, this can be replaced with the following more readable regular expression:
The raw string literal starts with This concludes a brief description of the ECMAScript grammar. The following sections explain how to actually use regular expressions in your C++ code.Anchors
^
and $
are called anchors. The ^
character matches the position immediately following a line termination character, and $
matches the position of a line termination character. By default, ^
and $
also match the beginning and ending of a string, respectively, but this behavior can be disabled.^test$
matches only the string test
, and not strings that contain test
somewhere in the line, such as 1test
, test2
, test
abc
, and so on.Wildcards
.
can be used to match any character except a newline character. For example, the regular expression a.c
will match abc
, and a5c
, but will not match ab5c
, ac
, and so on.Alternation
|
character can be used to specify the “or” relationship. For example, a|b
matches a
or b
.Grouping
()
are used to mark subexpressions, also called capture groups. Capture groups can be used for several purposes:
(.)(ab|cd)(.)
. It has three marked subexpressions. Running a regex_search()
with this regular expression on 1cd4
results in a match with four entries. The first entry is the entire match, 1cd4
, followed by three entries for the three marked subexpressions. These three entries are 1
, cd
, and 4
. The details on how to use the regex_search()
algorithm are shown in a later section.Repetition
*
matches the preceding part zero or more times. For example, a*b
matches b
, ab
, aab
, aaaab
, and so on.+
matches the preceding part one or more times. For example, a+b
matches ab
, aab
, aaaab
, and so on, but not b
.?
matches the preceding part zero or one time. For example, a?b
matches b
and ab
, but nothing else.{…}
represents a bounded repeat. a{n}
matches a
repeated exactly n times; a{n,}
matches a
repeated n times or more; and a{n,m}
matches a
repeated between n and m times inclusive. For example, a{3,4}
matches aaa
and aaaa
but not a
, aa
, aaaaa
, and so on.?
can be added behind the repeat, as in *?
, +?
, ??
, and {…}?
. A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.aaabbb
.
REGULAR EXPRESSION
SUBMATCHES
Greedy:
(a+)(ab)*(b+)
"aaa" "" "bbb"
Non-greedy:
(a+?)(ab)*(b+)
"aa" "ab" "bb"
Precedence
a
are the basic building blocks of a regular expression.+
, *
, ?
, and {…}
bind tightly to the element on the left; for example, b+
.ab+c
binds after quantifiers.|
binds last.ab+c|d
. This matches abc
, abbc
, abbbc
, and so on, and also d
. Parentheses can be used to change these precedence rules. For example, ab+(c|d)
matches abc
, abbc
, abbbc
, …, abd
, abbd
, abbbd
, and so on. However, by using parentheses you also mark it as a subexpression or capture group. It is possible to change the precedence rules without creating new capture groups by using (?:
…). For example, ab+(?:c|d)
matches the same as the preceding ab+(c|d)
but does not create an additional capture group.Character Set Matches
(a|b|c|
…|z)
, which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets, and allows you to write [c
1c2…c
n]
, which matches any of the characters c
1, c
2 , …, or c
n. For example, [abc]
matches any character a
, b
, or c
. If the first character is ^
, it means “any but”:
ab[cde]
matches abc
, abd
, and abe
.ab[^cde]
matches abf
, abp
, and so on but not abc
, abd
, and abe
.^
, [
, or ]
characters themselves, you need to escape them; for example, [[^]]
matches the characters [
, ^
, or ]
.[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]
; however, this is clumsy and doing this several times is awkward, especially if you make a typo and omit one of the letters accidentally. There are two solutions to this.[a-zA-Z]
, which recognizes all the letters in the range a
to z
and A
to Z
. If you need to match a hyphen, you need to escape it; for example, [a-zA-Z-]+
matches any word including a hyphenated word.[:name:]
. Which character classes are available depends on the locale, but the names listed in the following table are always recognized. The exact meaning of these character classes is also dependent on the locale. This table assumes the standard C locale.
CHARACTER CLASS NAME
DESCRIPTION
digit
Digits
d
Same as
digit
xdigit
Digits (
digit)
and the following letters used in hexadecimal numbers: ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’.
alpha
Alphabetic characters. For the C locale, these are all lowercase and uppercase letters.
alnum
A combination of the
alpha
class and the digit
class
w
Same as
alnum
lower
Lowercase letters, if applicable to the locale
upper
Uppercase letters, if applicable to the locale
blank
A blank character is a whitespace character used to separate words within a line of text. For the C locale, these are ‘ ’ and ‘ ’ (tab).
space
Whitespace characters. For the C locale, these are ‘ ’, ‘ ’, ‘
’, ‘
’, ‘v’, and ‘f’.
s
Same as
space
print
Printable characters. These occupy a printing position—for example, on a display—and are the opposite of control characters (
cntrl)
. Examples are lowercase letters, uppercase letters, digits, punctuation characters, and space characters.
cntrl
Control characters. These are the opposite of printable characters (
print
), and don’t occupy a printing position, for example, on a display. Some examples for the C locale are ‘f’ (form feed), ‘
’ (new line), and ‘
’ (carriage return).
graph
Characters with a graphical representation. These are all characters that are printable (
print
), except the space character ‘ ’.
punct
Punctuation characters. For the C locale, these are all graphical characters (
graph
) that are not alphanumeric (alnum
). Some examples are ‘!’, ‘#’, ‘@’, ‘}’, and so on.[[:alpha:]]*
in English means the same as [a-zA-Z]*
.[:digit:]
and [:d:]
mean the same thing as [0-9]
. Some classes have an even shorter pattern using the escape notation . For example,
d
means [:digit:]
. Therefore, to recognize a sequence of one or more numbers, you can write any of the following patterns:
[0-9]+
[[:digit:]]+
[[:d:]]+
d+
ESCAPE NOTATION
EQUIVALENT TO
d
[[:d:]]
D
[^[:d:]]
s
[[:s:]]
S
[^[:s:]]
w
[_[:w:]]
W
[^_[:w:]]
Test[5-8]
matches Test5
, Test6
, Test7
, and Test8
.[[:lower:]]
matches a
, b
, and so on, but not A
, B
, and so on.[^[:lower:]]
matches any character except lowercase letters like a
, b
, and so on.[[:lower:]5-7]
matches any lowercase letter like a
, b
, and so on, and the numbers 5
, 6
, and 7
.Word Boundaries
[A-Za-z0-9_]
. Matching the beginning of the source string is enabled by default, but you can disable it (regex_constants::match_not_bow
).regex_constants::match_not_eow
). to match a word boundary, and
B
to match anything except a word boundary.Back References
refers to the n-th captured group, with n
>
0
. For example, the regular expression (d+)-.*-1
matches a string that has the following format:
(d+)
-
.*
-
1
123-abc-123
, 1234-a-1234
, and so on but does not match 123-abc-1234
, 123-abc-321
, and so on.Lookahead
a(?!b)
contains a negative lookahead to match a letter a
not followed by a b
. The pattern a(?=b)
contains a positive lookahead to match a letter a
followed by a b
, but b
is not consumed so it does not become part of the match. (?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{8,}
Regular Expressions and Raw String Literals
d
in a regular expression, it matches any digit. However, because is a special character in C++, you need to escape it in your regular expression string literal as
\d
; otherwise, your C++ compiler tries to interpret the d
. It can get more complicated if you want your regular expression to match a single back-slash character . Because
is a special character in the regular expression syntax itself, you need to escape it as
\
. The character is also a special character in C++ string literals, so you need to escape it in your C++ string literal, resulting in
\\
."( |\n|\r|\\)"
R"(( |
|
|\))"
R"(
and ends with )"
. Everything in between is the regular expression. Of course, you still need a double back slash at the end because the back slash needs to be escaped in the regular expression itself.
Everything for the regular expression library is in the <regex>
header file and in the std
namespace. The basic template types defined by the regular expression library are
basic_regex
: An object representing a specific regular expression.match_results
: A substring that matched a regular expression, including all the captured groups. It is a collection of sub_match
es.sub_match
: An object containing a pair of iterators into the input sequence. These iterators represent the matched capture group. The pair is an iterator pointing to the first character of a matched capture group and an iterator pointing to one-past-the-last character of the matched capture group. It has an str()
method that returns the matched capture group as a string.The library provides three key algorithms: regex_match()
, regex_search()
, and regex_replace()
. All of these algorithms have different versions that allow you to specify the source string as a string
, a character array, or as a begin/end iterator pair. The iterators can be any of the following:
const char*
const wchar_t*
string::const_iterator
wstring::const_iterator
In fact, any iterator that behaves as a bidirectional iterator can be used. See Chapters 17 and 18 for details on iterators.
The library also defines the following two types for regular expression iterators, which are very important if you want to find all occurrences of a pattern in a source string.
regex_iterator
: iterates over all the occurrences of a pattern in a source string.regex_token_iterator
: iterates over all the capture groups of all occurrences of a pattern in a source string.To make the library easier to use, the standard defines a number of type aliases for the preceding templates:
using regex = basic_regex<char>;
using wregex = basic_regex<wchar_t>;
using csub_match = sub_match<const char*>;
using wcsub_match = sub_match<const wchar_t*>;
using ssub_match = sub_match<string::const_iterator>;
using wssub_match = sub_match<wstring::const_iterator>;
using cmatch = match_results<const char*>;
using wcmatch = match_results<const wchar_t*>;
using smatch = match_results<string::const_iterator>;
using wsmatch = match_results<wstring::const_iterator>;
using cregex_iterator = regex_iterator<const char*>;
using wcregex_iterator = regex_iterator<const wchar_t*>;
using sregex_iterator = regex_iterator<string::const_iterator>;
using wsregex_iterator = regex_iterator<wstring::const_iterator>;
using cregex_token_iterator = regex_token_iterator<const char*>;
using wcregex_token_iterator = regex_token_iterator<const wchar_t*>;
using sregex_token_iterator = regex_token_iterator<string::const_iterator>;
using wsregex_token_iterator = regex_token_iterator<wstring::const_iterator>;
The following sections explain the regex_match()
, regex_search()
, and regex_replace()
algorithms, and the regex_iterator
and regex_token_iterator
classes.
The regex_match()
algorithm can be used to compare a given source string with a regular expression pattern. It returns true
if the pattern matches the entire source string, and false
otherwise. It is very easy to use. There are six versions of the regex_match()
algorithm accepting different kinds of arguments. They all have the following form:
template<…>
bool regex_match(InputSequence[, MatchResults], RegEx[, Flags]);
The InputSequence
can be represented as follows:
std::string
The optional MatchResults
parameter is a reference to a match_results
and receives the match. If regex_match()
returns false
, you are only allowed to call match_results::empty()
or match_results::size()
; anything else is undefined. If regex_match()
returns true
, a match is found and you can inspect the match_results
object for what exactly got matched. This is explained with examples in the following subsections.
The RegEx
parameter is the regular expression that needs to be matched. The optional Flags
parameter specifies options for the matching algorithm. In most cases you can keep the default. For more details, consult a Standard Library Reference, see Appendix B.
Suppose you want to write a program that asks the user to enter a date in the format year/month/day, where year is four digits, month is a number between 1 and 12, and day is a number between 1 and 31. You can use a regular expression together with the regex_match()
algorithm to validate the user input as follows. The details of the regular expression are explained after the code.
regex r("\d{4}/(?:0?[1-9]|1[0-2])/(?:0?[1-9]|[1-2][0-9]|3[0-1])");
while (true) {
cout << "Enter a date (year/month/day) (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
if (regex_match(str, r))
cout << " Valid date." << endl;
else
cout << " Invalid date!" << endl;
}
The first line creates the regular expression. The expression consists of three parts separated by a forward slash (/
) character: one part for year, one for month, and one for day. The following list explains these parts.
d{4}:
This matches any combination of four digits; for example, 1234, 2010, and so on.(?:0?[1-9]|1[0-2])
: This subpart of the regular expression is wrapped inside parentheses to make sure the precedence is correct. You don’t need a capture group, so (?:
…) is used. The inner expression consists of an alternation of two parts separated by the |
character.0?[1-9]
: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.1[0-2]
: This matches 10, 11, or 12, and nothing else.(?:0?[1-9]|[1-2][0-9]|3[0-1])
: This subpart is also wrapped inside a non-capture group and consists of an alternation of three parts.0?[1-9]
: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.[1-2][0-9]
: This matches any number between 10 and 29 inclusive and nothing else.3[0-1]
: This matches 30 or 31 and nothing else.The example then enters an infinite loop to ask the user to enter a date. Each date entered is given to the regex_match()
algorithm. When regex_match()
returns true
, the user has entered a date that matches the date regular expression pattern.
This example can be extended a bit by asking the regex_match()
algorithm to return captured subexpressions in a results object. You first have to understand what a capture group does. By specifying a match_results
object like smatch
in a call to regex_match()
, the elements of the match_results
object are filled in when the regular expression matches the input string. To be able to extract these substrings, you must create capture groups using parentheses.
The first element, [0]
, in a match_results
object contains the string that matched the entire pattern. When using regex_match()
and a match is found, this is the entire source sequence. When using regex_search()
, discussed in the next section, this is a substring in the source sequence that matches the regular expression. Element [1]
is the substring matched by the first capture group, [2]
by the second capture group, and so on. To get a string representation of a capture group, you can write m[i]
as in the following code, or write m[i].str()
, where i
is the index of the capture group and m
is a match_results
object.
The following code extracts the year, month, and day digits into three separate integer variables. The regular expression in the revised example has a few small changes. The first part matching the year is wrapped in a capture group, while the month and day parts are now also capture groups instead of non-capture groups. The call to regex_match()
includes a smatch
parameter, which receives the matched capture groups. Here is the adapted example:
regex r("(\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])");
while (true) {
cout << "Enter a date (year/month/day) (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
smatch m;
if (regex_match(str, m, r)) {
int year = stoi(m[1]);
int month = stoi(m[2]);
int day = stoi(m[3]);
cout << " Valid date: Year=" << year
<< ", month=" << month
<< ", day=" << day << endl;
} else {
cout << " Invalid date!" << endl;
}
}
In this example, there are four elements in the smatch
results objects:
[0]:
the string matching the full regular expression, which is the full date in this example[1]:
the year[2]:
the month[3]:
the dayWhen you execute this example, you can get the following output:
Enter a date (year/month/day) (q=quit): 2011/12/01
Valid date: Year=2011, month=12, day=1
Enter a date (year/month/day) (q=quit): 11/12/01
Invalid date!
The regex_match()
algorithm discussed in the previous section returns true
if the entire source string matches the regular expression, and false
otherwise. It cannot be used to find a matching substring. Instead, you need to use the regex_search()
algorithm, which allows you to search for a substring that matches a certain pattern. There are six versions of the regex_search()
algorithm, and they all have the following form:
template<…>
bool regex_search(InputSequence[, MatchResults], RegEx[, Flags]);
All variations return true
when a match is found somewhere in the input sequence, and false
otherwise. The parameters are similar to the parameters for regex_match()
.
Two versions of the regex_search()
algorithm accept a begin and end iterator as the input sequence that you want to process. You might be tempted to use this version of regex_search()
in a loop to find all occurrences of a pattern in a source string by manipulating these begin and end iterators for each regex_search()
call. Never do this! It can cause problems when your regular expression uses anchors (^
or $
), word boundaries, and so on. It can also cause an infinite loop due to empty matches. Use the regex_iterator
or regex_token_iterator
as explained later in this chapter to extract all occurrences of a pattern from a source string.
The regex_search()
algorithm can be used to extract matching substrings from an input sequence. The following example extracts code comments from input lines. The regular expression searches for a substring that starts with //
followed by some optional whitespace s*
followed by one or more characters captured in a capture group (.+)
. This capture group captures only the comment substring. The smatch
object m
receives the search results. If successful, m[1]
contains the comment that was found. You can check the m[1].first
and m[1].second
iterators to see where exactly the comment was found in the source string.
regex r("//\s*(.+)$");
while (true) {
cout << "Enter a string with optional code comments (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
smatch m;
if (regex_search(str, m, r))
cout << " Found comment '" << m[1] << "'" << endl;
else
cout << " No comment found!" << endl;
}
The output of this program can look as follows:
Enter a string (q=quit): std::string str; // Our source string
Found comment 'Our source string'
Enter a string (q=quit): int a; // A comment with // in the middle
Found comment 'A comment with // in the middle'
Enter a string (q=quit): float f; // A comment with a (tab) character
Found comment 'A comment with a (tab) character'
The match_results
object also has a prefix()
and suffix()
method, which return the string preceding or following the match, respectively.
As explained in the previous section, you should never use regex_search()
in a loop to extract all occurrences of a pattern from a source sequence. Instead, you should use a regex_iterator
or regex_token_iterator
. They work similarly to iterators for Standard Library containers.
The following example asks the user to enter a source string, extracts every word from the string, and prints all words between quotes. The regular expression in this case is [w]+
, which searches for one or more word-letters. This example uses std::string
as source, so it uses sregex_iterator
for the iterators. A standard iterator loop is used, but in this case, the end iterator is done slightly differently from the end iterators of ordinary Standard Library containers. Normally, you specify an end iterator for a particular container, but for regex_iterator
, there is only one “end” iterator. You can get this end iterator by declaring a regex_iterator
type using the default constructor.
The for
loop creates a start iterator called iter
, which accepts a begin and end iterator into the source string together with the regular expression. The loop body is called for every match found, which is every word in this example. The sregex_iterator
iterates over all the matches. By dereferencing an sregex_iterator
, you get a smatch
object. Accessing the first element of this smatch
object, [0]
, gives you the matched substring:
regex reg("[\w]+");
while (true) {
cout << "Enter a string to split (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
const sregex_iterator end;
for (sregex_iterator iter(cbegin(str), cend(str), reg);
iter != end; ++iter) {
cout << """ << (*iter)[0] << """ << endl;
}
}
The output of this program can look as follows:
Enter a string to split (q=quit): This, is a test.
"This"
"is"
"a"
"test"
As this example demonstrates, even simple regular expressions can do some powerful string manipulation!
Note that both a regex_iterator
and a regex_token_iterator
internally contain a pointer to the given regular expression. They both explicitly delete constructors accepting an rvalue regular expression, so you cannot construct them with a temporary regex
object. For example, the following does not compile:
for (sregex_iterator iter(cbegin(str), cend(str),
regex("[\w]+")
);iter != end; ++iter) { … }
The previous section describes regex_iterator
, which iterates through every matched pattern. In each iteration of the loop you get a match_results
object, which you can use to extract subexpressions for that match that are captured by capture groups.
A regex_token_iterator
can be used to automatically iterate over all or selected capture groups across all matched patterns. There are four constructors with the following format:
regex_token_iterator(BidirectionalIterator a,
BidirectionalIterator b,
const regex_type& re
[, SubMatches
[, Flags]]);
All of them require a begin and end iterator as input sequence, and a regular expression. The optional SubMatches
parameter is used to specify which capture groups should be iterated over. SubMatches
can be specified in four ways:
vector
with integers representing the indices of the capture groups that you want to iterate over.initializer_list
with capture group indices.When you omit SubMatches
or when you specify a 0 for SubMatches
, you get an iterator that iterates over all capture groups with index 0, which are the substrings matching the full regular expression. The optional Flags
parameter specifies options for the matching algorithm. In most cases you can keep the default. Consult a Standard Library Reference for more details.
The previous regex_iterator
example can be rewritten using a regex_token_iterator
as follows. Note that *iter
is used in the loop body instead of (*iter)[0]
as in the regex_iterator
example, because the token iterator with 0 as the default submatch
index automatically iterates over all capture groups with index 0. The output of this code is exactly the same as the output generated by the regex_iterator
example:
regex reg("[\w]+");
while (true) {
cout << "Enter a string to split (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
const sregex_token_iterator end;
for (sregex_token_iterator iter(cbegin(str), cend(str), reg);
iter != end; ++iter) {
cout << """ << *iter << """ << endl;
}
}
The following example asks the user to enter a date and then uses a regex_token_iterator
to iterate over the second and third capture groups (month and day), which are specified as a vector
of integers. The regular expression used for dates is explained earlier in this chapter. The only difference is that ^
and $
anchors are added since we want to match the entire source sequence. Earlier, that was not necessary, because regex_match()
automatically matches the entire input string.
regex reg("^(\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$");
while (true) {
cout << "Enter a date (year/month/day) (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
vector<int> indices{ 2, 3 };
const sregex_token_iterator end;
for (sregex_token_iterator iter(cbegin(str), cend(str), reg, indices);
iter != end; ++iter) {
cout << """ << *iter << """ << endl;
}
}
This code prints only the month and day of valid dates. Output generated by this example can look like this:
Enter a date (year/month/day) (q=quit): 2011/1/13
"1"
"13"
Enter a date (year/month/day) (q=quit): 2011/1/32
Enter a date (year/month/day) (q=quit): 2011/12/5
"12"
"5"
The regex_token_iterator
can also be used to perform a so-called field splitting or tokenization. It is a much safer and more flexible alternative to using the old strtok()
function from C. Tokenization is triggered in the regex_token_iterator
constructor by specifying -1
as the capture group index to iterate over. When in tokenization mode, the iterator iterates over all substrings of the input sequence that do not match the regular expression. The following code demonstrates this by tokenizing a string on the delimiters ,
and ;
with zero or more whitespace characters before or after the delimiters:
regex reg(R"(s*[,;]s*)");
while (true) {
cout << "Enter a string to split on ',' and ';' (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
const sregex_token_iterator end;
for (sregex_token_iterator iter(cbegin(str), cend(str), reg, -1);
iter != end; ++iter) {
cout << """ << *iter << """ << endl;
}
}
The regular expression in this example is specified as a raw string literal and searches for patterns that match the following:
,
or ;
characterThe output can be as follows:
Enter a string to split on ',' and ';' (q=quit): This is, a; test string.
"This is"
"a"
"test string."
As you can see from this output, the string is split on ,
and ;
. All whitespace characters around the ,
and ;
are removed, because the tokenization iterator iterates over all substrings that do not match the regular expression, and because the regular expression matches ,
and ;
with whitespace around them.
The regex_replace()
algorithm requires a regular expression, and a formatting string that is used to replace matching substrings. This formatting string can reference part of the matched substrings by using the escape sequences in the following table.
ESCAPE SEQUENCE | REPLACED WITH |
$ n |
The string matching the n-th capture group; for example, $1 for the first capture group, $2 for the second, and so on. n must be greater than 0. |
$& |
The string matching the entire regular expression. |
$` |
The part of the input sequence that appears to the left of the substring matching the regular expression. |
$´ |
The part of the input sequence that appears to the right of the substring matching the regular expression. |
$$ |
A single dollar sign. |
There are six versions of the regex_replace()
algorithm. The difference between them is in the type of arguments. Four of them have the following format:
string regex_replace(InputSequence, RegEx, FormatString[, Flags]);
These four versions return the resulting string after performing the replacement. Both the InputSequence
and the FormatString
can be an std::string
or a C-style string. The RegEx
parameter is the regular expression that needs to be matched. The optional Flags
parameter specifies options for the replace algorithm.
Two versions of the regex_replace()
algorithm have the following format:
OutputIterator regex_replace(OutputIterator,
BidirectionalIterator first,
BidirectionalIterator last,
RegEx, FormatString[, Flags]);
These two versions write the resulting string to the given output iterator and return this output iterator. The input sequence is given as a begin and end iterator. The other parameters are identical to the other four versions of regex_replace()
.
As a first example, take the following HTML source string,
<body><h1>Header</h1><p>Some text</p></body>
and the regular expression,
<h1>(.*)</h1><p>(.*)</p>
The following table shows the different escape sequences and what they will be replaced with.
ESCAPE SEQUENCE | REPLACED WITH |
$1 |
Header |
$2 |
Some text |
$& |
<h1>Header</h1><p>Some text</p> |
$` |
<body> |
$´ |
</body> |
The following code demonstrates the use of regex_replace()
:
const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");
const string format("H1=$1 and P=$2");
// See above table
string result = regex_replace(str, r, format);
cout << "Original string: '" << str << "'" << endl;
cout << "New string : '" << result << "'" << endl;
The output of this program is as follows:
Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string : '<body>H1=Header and P=Some text</body>'
The regex_replace()
algorithm accepts a number of flags that can be used to manipulate how it is working. The most important flags are given in the following table.
FLAG | DESCRIPTION |
format_default |
The default is to replace all occurrences of the pattern, and to also copy everything to the output that does not match the pattern. |
format_no_copy |
Replaces all occurrences of the pattern, but does not copy anything to the output that does not match the pattern. |
format_first_only |
Replaces only the first occurrence of the pattern. |
The following example modifies the previous code to use the format_no_copy
flag:
const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");
const string format("H1=$1 and P=$2");
string result = regex_replace(str, r, format,
regex_constants::format_no_copy);
cout << "Original string: '" << str << "'" << endl;
cout << "New string : '" << result << "'" << endl;
The output is as follows. Compare this with the output of the previous version.
Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string : 'H1=Header and P=Some text'
Another example is to get an input string and replace each word boundary with a newline so that the output contains only one word per line. The following example demonstrates this without using any loops to process a given input string. The code first creates a regular expression that matches individual words. When a match is found, it is replaced with $1
where $1
is replaced with the matched word. Note also the use of the format_no_copy
flag to prevent copying whitespace and other non-word characters from the source string to the output.
regex reg("([\w]+)");
const string format("$1 ");
while (true) {
cout << "Enter a string to split over multiple lines (q=quit): ";
string str;
if (!getline(cin, str) || str == "q")
break;
cout << regex_replace(str, reg, format,
regex_constants::format_no_copy) << endl;
}
The output of this program can be as follows:
Enter a string to split over multiple lines (q=quit): This is a test.
This
is
a
test
This chapter gave you an appreciation for coding with localization in mind. As anyone who has been through a localization effort will tell you, adding support for a new language or locale is infinitely easier if you have planned ahead; for example, by using Unicode characters and being mindful of locales.
The second part of this chapter explained the regular expressions library. Once you know the syntax of regular expressions, it becomes much easier to work with strings. Regular expressions allow you to validate strings, search for substrings inside an input sequence, perform find-and-replace operations, and so on. It is highly recommended that you get to know regular expressions and start using them instead of writing your own string manipulation routines. They will make your life easier.