© Peter Van Weert and Marc Gregoire 2016
Peter Van Weert and Marc GregoireC++ Standard Library Quick Reference10.1007/978-1-4842-1876-1_6

6. Characters and Strings

Peter Van Weert and Marc Gregoire2
(1)
Kessel-Lo, Belgium
(2)
Meldert, Belgium
 

Strings    <string>

The Standard defines four different string types , each for a different char-like type:
 
String Type
Characters
Typical Character Size
Narrow strings
std::string
char
8 bit
Wide strings
std::wstring
wchar_t
16 or 32 bit
UTF-16 strings
std::u16string
char16_t
16 bit
UTF-32 strings
std::u32string
char32_t
32 bit
The names in the first column are purely indicative, because strings are completely agnostic about the character encoding used for the char-like items—or code units, as they are technically called—they contain. Narrow strings, for example, may be used to store ASCII strings, as well as strings encoded using UTF-8 or DBCS.
To illustrate, we will mostly use std::string. Everything in this section, though, applies equally well to all types. The locale and regular expression functionalities discussed thereafter are, unless otherwise noted, only required to be implemented for narrow and wide strings.
All four string types are instantiations of the same class template, std::basic_string<CharT>. A basic_string<CharT> is essentially a vector<CharT> with extra functions and overloads either to facilitate common string operations or for compatibility with C-style strings (const CharT*). All members of vector are provided for strings as well, except for the emplacement functions (which are of little use for characters). This implies that, unlike in other mainstream languages such as .NET, Python, and Java, strings in C++ are mutable. It also means, for example, that strings can readily be used with all algorithms seen in Chapter 4:
A417649_1_En_6_Figa_HTML.gif
The remainder of this section focuses on the functionality that strings add compared to vectors. For the functions that strings have in common with vector, we refer to Chapter 3. One thing to note is that string-specific functions and overloads are mostly index-based rather than iterator-based. The last three lines in the previous example, for instance, may be written more conveniently as
A417649_1_En_6_Figb_HTML.gif
or
A417649_1_En_6_Figc_HTML.gif
The equivalent of the end() iterator when working with string indices is basic_string::npos. This constant is consistently used to represent half open-ended ranges (that is, to denote “until the end of the string”), and, as you see next, as the “not found” return value for find()-like functions.

Searching in Strings

Strings offer six member functions to search for substrings or characters: find() and rfind(), find_first_of() and find_last_of(), and find_first_not_of() and find_last_not_of(). These always come in pairs: one to search from front to back, and one to search from back to front. All also have the same four overloads of the following form:
A417649_1_En_6_Figd_HTML.gif
The pattern to search for is either a single character or a string, with the latter represented as a C++ string, a null-terminated C-string, or a character buffer of which the first n values are used. The (r)find() functions search for an occurrence of the full pattern, and the find_xxx_of() / find_xxx_not_of() family of functions search for any single character that occurs / does not occur in the pattern. The result is the index of the (start of the) first occurrence starting from either the beginning or end, or npos if no match is found.
The mostly optional pos parameter is the index at which the search should start. For the functions searching backward, the default value for pos is npos.

Modifying Strings

To modify a string, you can use all members known already from vector, including erase(), clear(), push_back(), and so on (see Chapter 3). Additional functions or functions with string-specific overloads are assign(), insert(), append(), +=, and replace(). Their behavior should be obvious; only replace() may need some explanation. First though, let’s introduce the multitude of useful overloads these five functions have. These are generally of this form:
A417649_1_En_6_Fige_HTML.gif
For moving a string, assign(string&&) is defined as well. Because the += operator inherently only has a single parameter, naturally only the C++ string, C-style string, and initializer-list overloads are possible.
Analogous to its vector counterpart, for insert() the overloads marked with (*) return an iterator rather than a string. Likely for the same reason, the insert() function has these two additional overloads:
A417649_1_En_6_Figf_HTML.gif
Only insert() and replace() need a Position. For insert(), this is usually an index (a size_t), except for the last two overloads, where it is an iterator (analogous again to vector::insert()). For replace(), the Position is a range, specified either using two const_iterators (not available for the substring overload) or using a start index and a length (not for the last two overloads).
In other words, replace() does not, as you may expect, replace occurrences of a given character or string with another. Instead, it replaces a specified subrange with a new sequence—a string, substring, fill pattern, and so on—possibly of different length. You saw an example of its use earlier (2 is the length of the replaced range):
s.replace(s.find("be"), 2, "are");
To replace all occurrences of substrings or given patterns, you can use regular expressions and the std::regex_replace() function explained later in this chapter. To replace individual characters, the generic std::replace() and replace_if() algorithms from Chapter 4 are an option as well.
A final modifying function with a noteworthy difference from its vector counterpart is erase(): in addition to the two iterator-based overloads, it has one that works with indices. Use it to erase the tail or a subrange or, if you like, to clear() it:
string& erase(size_t pos = 0, size_t len = npos);

Constructing Strings

In addition to the default constructor, which creates an empty string, the constructor has the same seven overloads as the functions in the previous subsection, plus of course one for string&&. (Like other containers, all string constructors have an optional argument for custom allocators as well, but this is for advanced use only.)
As of C++14, basic_string objects of various character types can also be constructed from corresponding string literals by appending the suffix s. This literal operator is defined in the std::literals::string_literals namespace:
A417649_1_En_6_Figg_HTML.gif

String Length

To get a string’s length, you can use either the typical container member size() or its string-specific alias length(). Both return the number of char-like elements the string contains. Take care, though: C++ strings are agnostic on the character encoding used, so their length equals what is technically called the number of code units, which may be larger than the number of code points or characters. Well-known encodings where not all characters are represented as a single code unit are the variable-length Unicode encodings UTF-8 and UTF-16:
A417649_1_En_6_Figh_HTML.gif
One way to get the number of code points is to convert to a UTF-32 encoded string first, using the character-encoding conversion facilities introduced later in this chapter.

Copying (Sub)Strings

Another vector function (next to size()) that has a string-specific alias is data(), with its equivalent c_str(). Both return a const pointer to the internal character array (without copying). To copy the string to a C-style string instead, use copy():
size_t copy(char* out, size_t len, size_type pos = 0) const;
This copies len char values starting at pos to out. That is, it may be used to copy a substring as well. To create a substring as a C++ string, use substr():
string substr(size_t pos = 0, size_t len = npos) const;

Comparing Strings

Strings may be compared lexicographically with other C++ strings or C-style strings using either the non-member comparison operators (==, <, >=, and so on), or their compare() member. The latter has the following overloads:
int compare(const string& str) const noexcept;
int compare(size_type pos1, size_type n1, const string& str
            [, size_type pos2, size_type n2 = npos]) const;
int compare(const char* s) const;
int compare(size_type pos1, size_type n1, const char* s
            [, size_type n2]) const;
pos1/pos2 is the position in the first/second string where the comparison should start, and n1/n2 is the number of characters to compare from the first/second string. The return value is zero if both strings are equal or a negative/positive number if the first string is less/greater than the second.

String Conversions

To parse various types of integral numbers from a string, a series of non-member functions of the following form has been defined:
int stoi(const (w)string&, size_t* index = nullptr, int base = 10);
The following variants exist: stoi(), stol(), stoll(), stoul(), and stoull(), where i stands for int, l for long, and u for unsigned. These functions skip all leading whitespace characters, after which as many characters are parsed as allowed by the syntax determined by the base. If an index pointer is provided, it receives the index of the first character that is not converted.
Similarly, to parse floating-point numbers, a set of functions exists of the following form:
float stof(const (w)string&, size_t* index = nullptr);
stof(), stod(), and stold() are provided to convert to float, double, and long double, respectively.
To perform the opposite conversion and convert from numerical types to a (w)string, functions to_(w)string( X ) are provided, where X can be int, unsigned, long, unsigned long, long long, unsigned long long, float, double, or long double. The returned value is a std::(w)string.

Character Classification    <cctype>, <cwctype>

The <cctype> and <cwctype> headers offer a series of functions to classify, respectively, char and wchar_t characters. These functions are std::is class (int) (defined only for ints that represent chars) and std::isw class (wint_t) (analogous; wint_t is an integral typedef), where class equals one of the values in Table 6-1. All functions return a non-zero int if the given character belongs to the class, or zero otherwise.
Table 6-1.
The 12 Standard Character Classes
Class
Description
cntrl
Control characters: all non-print characters. Includes: '', ' ', ' ', ' ', and so on.
print
Printable characters: digits, letters, space, punctuation marks, and so on.
graph
Characters with graphical representation: all print characters except ' '.
blank
Whitespace characters that separate words on a line. At least ' ' and ' '.
space
Whitespace characters: at least all blank characters, ' ', ' ', 'v', and 'f'. Never alpha characters.
digit
Decimal digits (09).
xdigit
Hexadecimal digits (09, AF, af).
alpha
Letter characters. At least all lowercase and uppercase characters, and never any of the cntrl, digit, punct, and space characters.
lower
Lowercase alpha letters (az for the default locale).
upper
Uppercase alpha letters (AZ for the default locale).
alnum
Alphanumeric characters: union of all alpha and digit characters.
punct
Punctuation marks (! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ∼ for the default locale). Never a space or alnum character.
The same headers also offer the tolower() / toupper() and towlower() / towupper() functions for converting between lowercase and uppercase characters. Characters are again represented using the integral int and wint_t types. If the conversion is not defined or possible, these functions simply return their input value.
The exact behavior of all character classifications and transformations depends on the active C locale . Locales are explained in detail later in this chapter, but essentially this means the active language and regional settings may result in different sets of characters to be considered letters, lowercase or uppercase, digits, whitespace, and so on. Table 6-1 lists all general properties of and relations between the different character classes and gives some examples for the default C locale.
Note
In the “Localization” section, you also see that the C++ <locale> header offers a list of overloads for std::is class () and std::tolower() / toupper() (all templated on the character type) that use a given locale rather than the active C locale.

Character-Encoding Conversion    <locale>, <codecvt>

A character encoding determines how code points (many but not all code points are characters) are represented as binary code units . Examples include ASCII (classical encoding with 7-bit code units), the fixed-length UCS-2 and UCS-4 encodings (16-bit and 32-bit code units, respectively), and the three main Unicode encodings: the fixed-length UTF-32 (using a single 32-bit code unit for each code point) and variable-length UTF-8 and UTF-16 encodings (representing each code point as one or more 8- or 16-bit code units, respectively; up to 4 units for UTF-8, and 2 for UTF-16). The details of Unicode and the various character encodings and conversions could fill a book; we explain here what you need to know in practice to convert between encodings.
The class template for objects that contain the low-level encoding-conversion logic is std::codecvt<CharType1, CharType2, State> (cvt is likely short for converter). It is defined in <locale> (as you see in the next section, this is actually a locale facet). The first two parameters are the C++ character types used to represent the code units of both encodings. For all standard instantiations, CharType2 is char. State is an advanced parameter we do not explain further (all standard specializations use std::mbstate_t from <cwchar>).
The four codecvt specializations listed in Table 6-2 are defined in <locale>. Additionally, the <codecvt> header defines the three std::codecvt subclasses listed in Table 6-3.1 For these, CharT corresponds to the CharType1 parameter of the codecvt base class; as stated earlier, CharType2 is always char.
Table 6-2.
Character-Encoding Conversion Classes Defined in <locale>
codecvt<char,char,mbstate_t>
Identity conversion
codecvt<char16_t,char,mbstate_t>
Conversion between UTF-16 and UTF-8
codecvt<char32_t,char,mbstate_t>
Conversion between UTF-32 and UTF-8
codecvt<wchar_t,char,mbstate_t>
Conversion between native wide and narrow character encodings (implementation specific)
Table 6-3.
Character-Encoding Conversion Classes Defined in <codecvt>
codecvt_utf8<CharT>
codecvt_utf16<CharT>
Conversion between UCS-2 (for 16-bit CharTs) or UCS-4 (for 32-bit CharTs) and UTF-8 / UTF-16. The UTF-16 string is represented using 8-bit chars as well, so this is intended for binary UTF-16 encodings.
codecvt_utf8_utf16<CharT>
Conversion between UTF-16 and UTF-8 (CharT must be at least 16-bit).
Although codecvt instances could in theory be used directly, it is far easier to use the std::wstring_convert<CodecvtT, WCharT=wchar_t> class from <locale>. This helper class facilitates conversions between char strings and strings of a (generally wider) character type WCharT in both directions. Despite its misleading (outdated) name, wstring_convert can also convert from and to, for example, u16strings or u32strings, not just wstrings. These members are provided:
Method
Description
(constructor)
Constructors exist that take a pointer to an existing CodecvtT (of which wstring_convert takes ownership) and an initial state (not discussed further). Both are optional. A final constructor accepts two error strings: one to be returned by to_bytes() upon failure, and one by from_bytes() (the latter is optional).
from_bytes()
Converts either a single char or a string of chars (a C-style char* string, a std::string, or a sequence bounded by two char* pointers) to a std::basic_string<WCharT>, and returns the result. Throws std::range_error upon failure, unless an error string was provided upon construction: in that case, this error string is returned.
to_bytes()
Opposite conversion from WCharT to char, with analogous overloads.
converted()
Returns the number of input characters processed by the last from_bytes() or to_bytes() conversion.
state()
Returns the current state (mostly mbstate_t: not discussed further).
Recall the following example from the section on std::string lengths:
A417649_1_En_6_Figi_HTML.gif
To convert this string to UTF-32, you would hope the following is possible:
A417649_1_En_6_Figj_HTML.gif
Unfortunately, this does not compile. For the converter subclasses defined in <codecvt> , this would compile. But the destructor of the codecvt base class is protected (like all standard locale facets: discussed later), and the wstring_convert destructor calls it to delete the converter instance it owns. This design defect can be circumvented using a helper wrapper such as the following (similar tricks can be applied to make any protected function publically accessible, not just a destructor):
A417649_1_En_6_Figk_HTML.gif
To make the code compile, you then replace the first line with the following2:
typedef deletable<std::codecvt<char32_t,char,std::mbstate_t>> cvt;
To use the potentially locale-specific variants of these converters (see the next section), use the following (other locale name besides "" may be used as well):
typedef deletable<std::codecvt_byname<char32_t,char,std::mbstate_t>> cvt;
std::wstring_convert<cvt, char32_t> convertor(new cvt(""));
A related class is wbuffer_convert<CodecvtT, WCharT=wchar_t> , which wraps a basic_streambuf<char> and makes it act as a basic_streambuf<WCharT> (stream buffers are very briefly explained in Chapter 5). A wbuffer_convert instance is constructed with an optional basic_streambuf<char>*, CodecvtT*, and state. Both the getter and setter for the wrapped buffer are called rdbuf(), and the current conversion state may be obtained using state(). The following constructs a stream that accepts wide character strings, but writes it to an UTF-8 encoded file (needs <fstream>):
A417649_1_En_6_Figl_HTML.gif

Localization    <locale>

Textual representations of dates, monetary values, and numbers are governed by regional and cultural conventions. To illustrate, the following three sentences are analogous but written using local currencies, numeric, and date formats:
In the U.S., John Doe has won $100,000.00 on the lottery on 3/14/2015.
In India, Ashok Kumar has won ₹1,00,000.00 on the lottery on 14-03-2015.
En France, Monsieur Brun a gagné 100.000,00 € à la loterie sur 14/3/2015.
In C++, all parameters and functionality related to processing text in a locale-specific manner—that is, adapted to local conditions—are contained in a std::locale object. These include not only formatting of numeric values and dates as just illustrated, but also locale-specific sorting and conversions of strings.

Locale Names

Standard locale objects are constructed from a locale name :
std::locale(const char* locale_name);
std::locale(const std::string& locale_name);
These names commonly consist of a two-letter ISO-639 language code followed by a two-letter ISO-3166 country code. The precise format, however, is platform-specific: on Windows, for instance, the name for the English-American locale is "en-US", whereas on POSIX-based systems it is "en_US". Most platforms support, or sometimes require, additional specifications such as region codes, character encodings, and so on. Consult your platform’s documentation for a full list of supported locale names and options.
There are only two portable locale names, "" and "C":
  • With "", you construct a std::locale with the user’s preferred regional and language settings, taken from the program’s execution environment (that is, the operating system).
  • The "C" locale denotes the classic or neutral locale, which is the standardized, portable locale that all C and C++ programs use by default.
Using the "C" locale, the earlier example sentence becomes
Anywhere, a C/C++ programmer may win 100000 on the lottary on 3/14/2015.
Tip
When writing to a file intended to be read by computer programs (configuration files, numeric data output, and so on), it is highly recommended that you use the neutral "C" locale, to avoid problems during parsing. When displaying values to the user, you should consider using a locale based on the user’s preferences ("").

The Global Locale

The active global locale affects various Standard C++ functions that format or parse text, most directly the regular expression algorithms discussed later in this chapter and the I/O streams seen in Chapter 5. It is implementation dependent whether there is one program-wide global locale instance or one per thread of execution.
The global locale always starts out as the classic "C" locale. To set the global locale, you use the static std::locale::global() function . To get a copy of the currently active global locale, simply default-construct a std::locale. For example:
A417649_1_En_6_Figm_HTML.gif
Note
To avoid race conditions, Standard C++ objects (such as newly created stream or regex objects) always copy the global locale upon construction. Calling global() therefore does not affect existing objects, including std::cout and the other standard streams of <iostream>. To change their locale, you must call their imbue() member.

Basic std::locale Members

The following table lists most basic functions offered by a std::locale, not including the copy members. More advanced members to combine or customize locales are discussed near the end of the section:
Member
Description
global()
Static function to set the active global locale (discussed earlier).
classic()
Static function returning a constant reference to a classic "C" locale.
locale()
Default constructor: creates a copy of the global locale.
locale(name)
Construction from locale name, as discussed earlier. Throws a std::runtime_exception if a nonexistent name is passed.
name()
Returns the locale name, if any. If the locale represents a customized or combined locale (discussed later), "*" is returned.
== / !=
Compares two locale objects. Customized or combined locales are equal only if they are the same object or one is a copy of the other.

Locale Facets

As obvious from the previous subsection, the std::locale public interface does not offer much functionality. All localization facilities are instead offered in the form of facets. Each locale object encapsulates a number of such facets, a reference to which may be obtained via the std::use_facet<FacetType>() function . The following example, for instance, uses the classic locale’s numeric punctuation facet to print out the locale’s decimal mark for formatting floating-point numbers:
A417649_1_En_6_Fign_HTML.gif
For all standard facets, the instance referred to by the result of use_facet() cannot be copied, moved, swapped, or deleted. This facet is (co-)owned by the given locale and is deleted together with the (last) locale that owns it. When requesting a FacetType the given locale does not own, a bad_cast exception is raised. To verify the presence of a facet, you can use std::has_facet<FacetType>().
Caution
Never do something like auto& f = use_facet<...>(std::locale("..."));: the facet f was owned by the temporary locale object, so using it will likely crash.
By default, locales contain specializations of all the facets introduced in the remainder of this section, each in turn specialized for at least the char and wchar_t character types (additional minimal requirements are discussed throughout the section). Implementations may include more facets, and programs can even add custom facets themselves, as explained later.
We now discuss the 12 standard facet classes listed in Table 6-4 in order, grouped in sections by category. Afterwards, we show how to combine facets of different locales and create customized facets. Although this is perhaps not something most programmers will use regularly, occasionally the need does arise to customize facets. Regardless, it is worth knowing the scope and various effects of localization and to keep them in mind when developing programs that show or process user text (that is, most programs).
Table 6-4.
Overview of the 12 Basic Facet Classes , Grouped by Category
Category
Facets
numeric
numpunct<C>, num_put<C>, num_get<C>
monetary
moneypunct<C, International>, money_put<C>, money_get<C>
time
time_put<C>, time_get<C>
ctype
ctype<C>, codecvt<C1, C2, State>
collate
collate<C>
messages
messages<C>

Numeric Formatting

The facets of the numeric and monetary category follow the same pattern: there is one punct facet (short for punctuation) with the locale-specific formatting parameters, plus both a put and a get facet responsible for the actual formatting and parsing of values, respectively. The latter two facets are mostly intended to be used by the stream objects introduced in Chapter 5. The concrete format they use to read or write values is determined by a combination of the parameters set in the punct facet and others set using the stream’s members or stream manipulators.
Numeric Punctuation
The std::numpunct<CharT> facet offers functions to retrieve the following information related to the formatting of numeric and Boolean values:
  • decimal_point(): Returns the decimal separator
  • thousands_sep(): Returns the thousands separator character
  • grouping(): Returns a std::string encoding the digit grouping
  • truename() and falsename(): Return basic_string<CharT>s with textual representations for Boolean values
In the lottery example at the beginning of the section, a numeric value of 100000.00 was formatted using three different locales: "100,000.00", "1,00,000.00", and "100.000,00". The first two locales use a comma (,) and dot (.) as thousands and decimal separator, respectively, whereas for the third it is the other way around.
The digit grouping() is encoded as a sequence of char values indicating the number of digits in each group, starting with the number in the rightmost group. The last char in the sequence is used for all subsequent groups as well. Most locales group digits in threes, for example, which is encoded as "3". (Note: do not use "3", because the '3' ASCII character results in a char with value 51; that is: '3' == '51'.) For Indian locales, however, as seen in "1,00,000.00", only the rightmost group contains three digits; all other groups contain only two. This is encoded as "32". To indicate an infinite group, a std::numeric_limits<char>::max() value may be used in the last position. An empty grouping() string denotes that no grouping should be used at all, which is the case, for instance, for the classic "C" locale.
Formatting and Parsing of Numeric Values
The std::num_put and num_get facets constitute the implementation of the << and >> stream operators described in Chapter 5 and provide two sets of methods with the following signature:
Iter put(Iter target, ios_base& stream, char fill, X value)
Iter get(Iter begin, Iter end, ios_base& stream, iostate& error, X& result)
Here X can be bool, long, long long, unsigned int, unsigned long, unsigned long long, double, long double, or a void pointer. For get(), unsigned short and float are also possible. These methods either format a given numeric value or try to parse the characters in the range [begin, end). In both cases, the ios_base parameter is a reference to a stream from which locale and formatting information is taken (including, for example, the stream’s formatting flags and precision: see Chapter 5).
All put() functions simply return target after writing the formatted character sequence there. The fill character is used for padding if the formatted length is less than stream.width() (see Chapter 5 for the padding rules).
If parsing succeeds, get() stores the numeric value in result. If the input did not match the format, result is set to zero and the failbit is set in the iostate parameter (see Chapter 5). If the parsed value is too large/small for type X, the failbit is set as well, and result is set to std::numeric_limits<X>::max()/lowest() (see Chapter 1). If the end of the input was reached (can be a success or a failure), the eofbit is set. An iterator to the first character after the parsed sequence is returned.
We do not show example code here, but these facets are analogous to the monetary formatting facets introduced next, for which we do include a full example.

Monetary Formatting

Monetary Punctuation
The std::moneypunct<CharType, International=false> facet offers functions to retrieve the following information related to formatting monetary values:
  • decimal_point(), thousands_sep(), and grouping(): Analogous to the numeric punctuation members seen earlier.
  • frac_digits(): Returns the number of digits after the decimal separator. A typical value is 2.
  • curr_symbol(): Returns the currency symbol, such as '€', if the International template parameter is false, and the international currency code (usually three letters) followed by a space, such as "EUR", if International is true.
  • pos_format() and neg_format() return a money_base::pattern structure (discussed later) describing how positive and negative monetary values are to be formatted.
  • positive_sign() and negative_sign(): Return a formatting string for positive and negative monetary values.
The latter four members need more explanation. They use types defined in std::money_base, a base class of moneypunct. The money_base::pattern structure, defined as struct pattern{ char field[4]; }, is an array containing four values of the money_base::part enumeration, with these supported values:
part
Description
none
Optional whitespace characters, except when none appears last.
space
At least one whitespace character.
symbol
The currency symbol, curr_symbol().
sign
The first character returned by positive_sign() or negative_sign(). Additional characters appear at the end of the formatted monetary value.
value
The monetary value.
For example, assume that the neg_format() pattern is {none, symbol, sign, value}, that the currency symbol is '$', that negative_sign() returns "()", and that frac_digits() returns 2. Then the value -123456 is formatted as "$(1,234.56)".
Note
For American and many European locales, frac_digits() equals 2, meaning unformatted values are to be expressed in, for example, cents rather than dollars or euros. This is not always the case, though: for the Japanese locale, for example, frac_digits() is 0.
Formatting and Parsing of Monetary Values
The facets std::money_put and money_get handle formatting and parsing of monetary values and are mainly intended to be used by the put_money() and get_money() I/O manipulators discussed in Chapter 5. The facets offer methods of this form:
Iter put(Iter target, bool intl, ios_base& stream, char fill, X value)
Iter get(Iter begin, Iter end, bool intl, ios_base& stream,
                                            iostate& error, X& result)
Here X is either std::string or long double. The behavior and meaning of the parameters is similar to that discussed for num_put and num_get earlier. If intl is false, currency symbols like $ are used; otherwise, strings like USD are used.
The following illustrates how these facets can be used, although you normally simply use std::put_/get_money() (uses <cassert> and <sstream>):
A417649_1_En_6_Figo_HTML.gif

Time and Date Formatting

The two facets std::time_get and time_put handle parsing and formatting of time and dates and power the get_time() and put_time() manipulators seen in Chapter 5. They provide methods with the following signatures:
Iter put(Iter target, ios_base& stream, char fill, tm* value, <format>)
Iter get(Iter begin, Iter end, ios_base& stream, iostate& error, tm* result,
         <format>)
The <format> is either 'const char* from, const char* to', pointing to a time-formatting pattern expressed using the same syntax as explained for strftime() in Chapter 2, or a single time format specifier of the same grammar with optional modifier 'char format, char modifier'. The behavior and meaning of the parameters is analogous to those for the numeric and monetary formatting facets. The std::tm structure is explained in Chapter 2 as well. Only those members of the passed tm are used / written that are mentioned in the formatting pattern.
In addition to the generic get() functions, the time_get facet has a series of more restricted parsing functions, all with the following signature:
Iter get_x(Iter begin, Iter end, ios_base& stream, iostate& error, tm*)
Member
Description
get_time()
Tries to parse a time as %H:%M:%S.
get_date()
Tries to parse a date using a format that depends on the value of the facet’s date_order() member: either no_order: %m%d%y, dmy: %d%m%y, mdy: %m%d%y, ymd: %y%m%d, or ydm: %y%d%m. This date_order() enumeration value reflects the locale’s %X date format.
get_weekday()
get_monthname()
Tries to parse a name for a weekday or month, possibly abbreviated.
get_year()
Tries to parse a year. Whether two-digit year numbers are supported depends on your implementation.

Character Classification, Transformation, and Conversion

Character Classification and Transformation
The ctype<CharType> facets offer a series of locale-dependent character-classification and -transformation functions, including equivalents for those of the <cctype> and <cwctype> headers seen earlier.
For use in the character-classification functions listed next, 12 member constants of a bitmask type ctype_base::mask are defined (ctype_base is a base class of ctype), one for each character class. Their names equal the class names given in Table 6-1. Although their values are unspecified, alnum == alpha|digit and graph == alnum|punct. The following table lists all classification functions (input character ranges are represented using two CharType* pointers b and e):
Member
Description
is(mask,c)
Checks whether a given character c belongs to any of the character classes specified by mask.
is(b,e,mask*)
Identifies for each character in the range [b, e) the complete mask value that encodes all classes it belongs to, and stores the result in the output range pointed to by the last argument. Returns e.
scan_is(mask,b,e)
scan_not(mask,b,e)
Scans the character range [b, e), and returns a pointer to the first character that belongs / does not belong to any of the classes specified by mask. If none is found, the result is e.
The same facets also offer these transformation functions:
Member
Description
tolower(c)
toupper(c)
tolower(b,e)
toupper(b,e)
Performs upper-to-lower transformation or vice versa on a single character (result is returned) or a character range [b, e) (transformed in-place; e is returned). Characters that cannot be transformed are left unchanged.
widen(c)
widen(b,e,o)
Transforms char values to the facet’s character type on a single character (result is returned) or a character range [b, e) (transformed characters are put in the output range starting at *o; e is returned). Transformed characters never belong to a class their source characters did not belong to.
narrow(c,d)
narrow(b,e,d,o)
Transformation to char; opposite of widen(). However, only for the 96 basic source characters (all space and printable ASCII characters except $, `, and @) the relation widen(narrow(c,0)) == c is guaranteed to hold. If no transformed character is readily available, the given default char d is used.
The <locale> header defines a series of convenience functions for those functions of the ctype facets that also exist in <cctype> and <cwctype>: std::is class (c, locale&), with class a name from Table 6-1, and tolower(c, locale&) / toupper(c, locale&). Their implementations all have the following form (the return type is either bool or CharT):
template <typename CharT> ... function(CharT c, const std::locale& l) {
   return std::use_facet<std::ctype<CharT>>(l).function(c);
}
Character-Encoding Conversions
A std::codecvt facet converts character sequences between two character encodings. This is explained earlier in “Character-Encoding Conversion,” because these facets are useful also outside the context of locales. Each std::locale contains at least instances of the four codecvt specializations listed in Table 6-2, which implement potentially locale-specific converters. These are used implicitly by the streams of Chapter 5 when converting, for example, between wide and narrow strings. Because directly using these low-level facets is not recommended, we do not explain their members here. Always use the helper classes discussed in the “Character-Encoding Conversion” section instead.

String Ordering and Hashing

The std::collate<CharType> facet implements the following locale-dependent string-ordering comparisons and hashing functions. All character sequences are specified using begin (inclusive) and end (exclusive) CharType* pointers:
Member
Description
compare()
Locale-dependent three-way comparison of two character sequences, returning -1 if the first precedes the second, 0 if both are equivalent, and +1 otherwise. Not necessarily the same as naïve lexicographical sequence comparison.
transform()
Transforms a given character sequence to a specific normalized form, which is returned as a basic_string<CharType>. Applying naïve lexicographical ordering on two transformed strings (as with their operator<) returns the same result as applying the facet’s compare() function on the untransformed sequences.
hash()
Returns a long hash value for the given sequence (see Chapter 3 for hashing) that is the same for all sequences that transform() to the same normalized form.
A std::locale itself is a std::less<std::basic_string<CharT>>-like functor (see Chapter 2) that compares two basic_string<CharT>s using its collate<CharT> facet’s compare() function. The following example sorts French strings lexicographically, using the classic locale, and using a French locale (the locale name to use is platform specific). In addition to <locale>, it needs <vector>, <string>, and <algorithm>:
A417649_1_En_6_Figp_HTML.gif

Message Retrieval

The std::messages<CharT> facet facilitates retrieval of textual messages from message catalogs. These catalogs are essentially associative arrays that map a series of integers to a localized string. This could in principle be used, for instance, to retrieve translated error messages based on, for example, their error category and code (see Chapter 8). Which catalogs are available, and how they are structured, is entirely platform specific. For some, standardized message catalog APIs are used (such as POSIX’s catgets() or GNU’s gettext()), whereas others may not offer any catalogs (this is typically the case for Windows). The facet offers these functions:
Member
Description
open(n,l)
Opens a catalog based on a given platform-specific string n (a basic_string<CharT>), and for the given std::locale l. Returns a unique identifier of some signed integer type catalog.
get(c,set,id,def)
Retrieves from the catalog with given catalog identifier c, the message identified by set and id (two int values whose interpretation is catalog specific), and returns it as a basic_string<CharT>. Returns def if no such message is found.
close(c)
Closes the catalog with the given catalog identifier c.

Combining and Customizing Locales

The constructs of the <locale> library are designed to be very flexible when it comes to combining or customizing locale facets.

Combining Facets

std::locale provides combine<FacetType>(const locale& c), which returns a copy of the locale on which combine() is called, except for the FacetType facet, which is copied from the given argument. Here is an example (using namespace std is assumed):
A417649_1_En_6_Figq_HTML.gif
Alternatively, std::locale has a constructor accepting a base locale and an overriding facet that does the same as combine(). For example, the creation of combined in the previous example can be expressed as follows:
locale combined(locale::classic(), &use_facet<moneypunct<char>>(chinese));
std::locale moreover has a number of constructors to override all facets of one or more categories at once (String is either a std::string or a C-style string representing the name of a specific locale):
locale(const locale& base, String name, category cat)
locale(const locale& base, const locale& overrides, category cat)
For each of the six categories listed in Table 6-4, std::locale defines a constant with that name. The std::locale::category type is a bitmask type, meaning categories can be combined using bitwise operators. The all constant, for example, is defined as collate | ctype | monetary | numeric | time | messages. These constructors can be used to create a combined facet similar to the one earlier:
locale combined(locale::classic(), chinese, locale::monetary);

Custom Facets

All public functions func () of the facets simply call a protected virtual method on the facet called do_ func (). 3 You can implement custom facets by inheriting from existing ones and overriding these do-methods.
This first simple example changes the behavior of the numpunct facet to use the strings "yes" and "no" instead of "true" and "false" for Boolean input and output:
class yes_no_numpunct : public std::numpunct<char> {
protected:
   virtual string_type do_truename() const override { return "yes"; }
   virtual string_type do_falsename() const override { return "no"; }
};
You can use this custom facet, for instance, by imbuing it on a stream. The following prints "yes / no" to the console:
std::cout.imbue(std::locale(std::cout.getloc(), new yes_no_numpunct));
std::cout << std::boolalpha << true << " / " << false << std::endl;
Recall that facets are reference counted and that the destructor of the std::locale hence properly cleans up your custom facet.
The disadvantage of deriving from facets such as numpunct and moneypunct is that those generic base classes implement locale-independent behavior. To start from a locale-specific facet instead, facet classes such as numpunct_byname are available. For all facets seen so far, except the numeric and monetary put and get facets, a facet subclass exists with the same name but appended with _byname. They are constructed passing a locale name (const char* or std::string) and then behave as if taken from the corresponding locale. You can override from these facets to modify only specific aspects of a facet for a given locale.
The next example modifies the monetary punctuation facet to facilitate output using a format standard in accounting: negative numbers are put between parentheses, and padding is done in a particular way. You do so without overriding a locale’s currency symbol or most other settings by starting from std::moneypunct_byname (string_type is defined in std::moneypunct):
A417649_1_En_6_Figr_HTML.gif
This facet may then be used as follows (see Chapter 5 for details on the stream I/O manipulators of <iomanip>):
A417649_1_En_6_Figs_HTML.gif
The output of this program should be
$   1,000.00
$     (5.00)
You can in theory create a new facet class by directly inheriting from std::facet and add it to a locale using the same constructor to use it in your own library code later. The only additional requirement is that you define a default-constructed static constant named id of type std::locale::id.

C Locales     <clocale>

Locale-sensitive functions from the C Standard Library (including most functions in <cctype> and the I/O operations of <cstdio> and <ctime>) are not directly affected by the global C++ locale. Instead, they are governed by a corresponding C locale. This C locale is changed by one of two functions:
  • std::locale::global() is guaranteed to modify the C locale to match the given C++ locale, as long as the latter has a name. Otherwise, its effect on the C locale, if any, is implementation-defined.
  • Using the std::setlocale() function of <clocale>. This does not affect the C++ global locale in any way.
In other words, when using standard locales, a C++ program should simply call std::locale::global(). To write portable code when combining multiple locales, however, you have to call both the C++ and the C function because not all implementations set the C locale as expected when changing the global() C++ locale to a combined locale. This is done as follows:
A417649_1_En_6_Figt_HTML.gif
The setlocale() function takes a single category number (not a bitmask type; supported values include at least LC_ALL, LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TIME) and a locale name, all analogous to their C++ equivalents. It returns the name of the active C locale upon success as a char* pointer into a reused, global buffer, or nullptr upon failure. If nullptr is passed for the locale name, the C locale is not modified.
Unfortunately, the C locale functionality is far less powerful than the C++ one: customized facets or selecting individual facets for combining is not possible, making the use of such advanced locales impossible with portable code in general.
The <clocale> header has one more function: std::localeconv(). It returns a pointer to a global std::lconv struct with public members equivalent to the functions of the std::numpunct (decimal_point, thousands_sep, grouping) and std::moneypunct facets (mon_decimal_point, mon_thousands_sep, mon_grouping, positive_sign, negative_sign, currency_symbol, frac_digits, and so on). These values should be treated as read-only: writing to them results in undefined behavior.

Regular Expressions    <regex>

A regular expression is a textual representation of a pattern or patterns to be matched against a target sequence of characters. The regular expression ab*a, for instance, matches any target sequence starting with the character a, followed by zero or more bs, and ending again with an a. Regular expressions can be used to search for or replace particular patterns in the target, or to verify that it matches a desired pattern. You see how to perform these operations using the <regex> library later; first we introduce how to form and create regular expressions.

The ECMAScript Regular Expression Grammar

The syntax used to express patterns in textual form is defined by a grammar. By default, <regex> uses a modified version of the grammar used by the ECMAScript scripting language (best known for its widely used dialects JavaScript, JScript, and ActionScript). What follows is a concise, comprehensive reference for this grammar.
A regular expression pattern is a disjunction of sequences of terms, with each term either an atom, an assertion, or a quantified atom. Supported atoms and assertions are listed in Table 6-5 and Table 6-6, and Table 6-7 shows how atoms are quantified to express repetitive patterns. These terms are concatenated without separators and then optionally combined into disjunctions using the | operator. Empty disjuncts are allowed, with pattern | matching either the given pattern or the empty sequence. Some examples should clarify:
Table 6-5.
All Atoms with a Special Meaning in the ECMAScript Grammar
Atom
Matches
.
Any single character except line terminators4.
, f, , , , v
One of the common control characters: null, form feed (FF), line feed (LF), carriage return (CR), horizontal tab (HT), and vertical tab (VT).
c letter
The control character whose code unit equals that of the given ASCII lowercase or uppercase letter modulo 32. E.g. cj == cJ == (LF) as
(code unit of j or J) % 32 = (106 or 74) % 32 = 10 = code unit of LF.
x hh
The ASCII character with hexadecimal code unit hh (exactly two hexadecimal digitis). E.g. x0A == (LF), and x6A == J.
u hhhh
The Unicode character with hexadecimal code unit hhhh (exactly four hexadecimal digits). E.g. u006A == J, and u03c0 == π (Greek letter pi).
[ class ]
A character of a given class (see main text): [abc], [a-z], [[:alpha:]], and so on.
[^ class ]
A character not of a given class (see main text). E.g.: [^0-9], [^[:s:]], and so on.
d
A decimal digit character (short for [[:d:]] or [[:digit:]]).
s
A whitespace character (short for [[:s:]] or [[:space:]]).
w
A word character, that is: an alphanumeric or underscore character (short for [[:w:]] or [_[:alnum:]]).
D, S, W
Complement of d, s, w. In other words, any character that is not a decimal digit, whitespace, or word character, respectively (short for [^[:d:]] and so on).
character
The given character, as is. Required only for . * + ? ^ $ ( ) [ ] { } | because without escaping, these have special meaning; but any character may be used as long as character has no special meaning.
( pattern )
Matches pattern and creates a marked sub-expression, turning it into an atom that can be quantified, for one. The sequence it matches (called a submatch) can be retrieved from a match_results or referred to using a back reference (discussed later), either further in the surrounding pattern or in the replacement pattern when using regex_replace().
(?: pattern )
Same as previous, but the sub-expression is unmarked, meaning the sub-match is not stored in a match_results, nor can it be referred to.
integer
A back reference: matches the exact same sequence as the marked sub-expression with index integer did earlier. Sub-expressions are counted left to right in the order their opening parentheses appear in the full pattern, starting from one (recall: matches the null character).
Table 6-6.
Assertions Supported by the ECMAScript Grammar
Assertion
Matches If the Current Position Is ...
^
The beginning of the target (unless match_not_bol is specified), or a position that immediately follows a line-terminator character.4
$
The end of the target (unless match_not_eol is specified), or the position of a line-terminator character.

A word boundary: the next character is a word character5 whereas the previous is not, or vice versa. The beginning and end of the target are also word boundaries if the target begins/ends with a word character (and match_not_bow/match_not_eow is not specified, respectively).
B
Not a word boundary: both the previous and next character are either word or non-word characters. See  for when the beginning and end of the target are word boundaries.
(?= pattern )
A position at which the given pattern could be matched next. This is called a positive lookahead.
(?! pattern )
A position at which the given pattern would not be matched next. This is called a negative lookahead.
Table 6-7.
Quantifiers That Can Be Used for Repeated Matches of Atoms
Quantifier
Meaning
atom *
Greedily matches atom zero or more times.
atom +
Greedily matches atom one or more times.
atom ?
Greedily matches atom zero or one time.
atom {i}
Greedily matches atom exactly i times.
atom {i,}
Greedily matches atom i or more times.
atom {i,j}
Greedily matches atom between i and j times.
  • ?| matches line-break sequences for all major platforms (that is, , , or ).
  • <(.+)>(.*)</1> matches a XML-like sequences of the form < TAG > anything </ TAG > using a back reference for matching the closing tag, and extra grouping in the middle to allow retrieval of the second submatch (discussed later).
  • (?:d{1,3}.){3}d{1,3} matches IPv4 addresses. This naïve version also matches illegal addresses, though, such as 999.0.0.1, and the poor grouping prohibits the four matched numbers from being retrieved afterward. Note that without the ?:, 1 still would only refer to the third matched number.
Tip
When entering regular expressions as string literals in a C++ program, all backslashes have to be escaped. The first example becomes "\r\n?|\n". Because this is both tedious and obscuring, we recommend using raw string literals instead: for instance, R"( ?| )". Remember that the surrounding parentheses are part of the raw string literal notation and do not constitute a regular expression group.
The difference between an atom and an assertion is that the former consumes characters from the target sequence (typically one), whereas the latter does not. The (quantified) atoms in a pattern consume target characters one by one, simultaneously progressing left to right through both the pattern and target sequences. For an assertion to match, a specific condition must hold on the current position in the target (think of it as the caret position when typing text).
Most of the atoms in Table 6-5 match a single character; only subexpressions and back references may match a sequence. Any other single character is also an atom that matches simply that character. The match_ xxx flags mentioned in Table 6-6 are optionally passed to the matching functions or iterators discussed later.

Character Classes

A character class is a [ d ] or [^ d ] atom that defines a set of characters it may (for [ d ]) or may not ([^ d ]) match. The class definition d is a sequence of class atoms, each one either
  • An individual character.
  • A character range of the form from - to (bounds are inclusive).
  • Starting with a backslash (): the equivalent of any atom from Table 6-5 except back references, with the obvious meaning. Note that characters such as * + . $ do not need escaping in this context, but characters - [ ] : ^ may. Also, inside class definitions,  denotes the backspace character (u0008).
  • One of three types of special character class atoms enclosed between nested square brackets (described shortly).
The descriptors are concatenated without separators. For example: [_a-zA-Z] matches either an underscore or a single character in the range a–z or A–Z, whereas [^d] matches any single character that is not a decimal digit.
The first special class atom has form [: name :]. At least the following names are supported: equivalents of all 12 character classes explained in the section on character classification—alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, and xdigit—and d, s, and w. Of the latter, d and s are short for digit and space, and w is the class of word characters with [:w:] equivalent to _[:alnum:] (mind the underscore!). That is, for the classic "C" locale, [[:w:]] == [_a-zA-Z]. As another example, [D] == [^d] == [^[:d:]] == [^[:digit:]] == [^0-9].
The second type of special class atoms looks like [. name .], where name is a locale- and implementation-specific collating element name. This name can be a single character c, in which case [[. c .]] is equivalent to [ c ]. Similarly, [[.comma.]] may equal [,]. Some names refer to multicharacter collating elements: that is, multiple characters that are considered a single character in a specific alphabet and its sorting order. Possible names for the latter include those of digraphs: ae, ch, dz, ll, lj, nj, ss, and so on. For instance, [[.ae.]] matches two characters, whereas [ae] matches one.
Class atoms of the form [= name =], finally, are similar to [. name .], except that they match all characters that are part of the same primary equivalence class as the named collating element. Essentially, this means [=e=] in French should match not only e, but also é, è, ê, E, É, and so on. Similarly, [=ss=] in German should match the digraph ss, but also the Eszett character (ß).

Greedy vs. Non-Greedy Quantification

By default, quantified atoms as defined in Table 6-7 are greedy: they first match sequences that are as long as possible and only try shorter sequences if that does not lead to a successful match. To make them non-greedy—that is, to make them try the shortest possible sequences first—add a question mark (?) after the quantifier.
Recall, for example, the earlier example "<(.+)>(.*)</1>". When searching for or replacing its first match in "<b>Bold</b>, not bold, <b>bold again</b>", this pattern matches the full sequence. The non-greedy version, "<(.+)>(.*?)</1>", instead matches only the desired "<b>Bold</b>".
As an alternative to a non-greedy quantifier, a negative character class may be considered as well (it may be more efficient), such as "<(.+)>([^<]*)</1>".

Regular Expression Objects

The <regex> library models regular expressions as std::basic_regex<CharT> objects. Of this, at least two specializations are available for use with narrow strings (char sequences) and wide strings (wchar_t sequences): std::regex and std::wregex. The examples use regex, but wregex is completely analogous.

Construction and Syntax Options

A default-constructed regex does not match any sequence. More useful regular expressions are created using the constructors of the following form:
regex(Pattern, regex::flag_type flags = regex::ECMAScript);
The desired regular expression Pattern may be represented as either a std::string, a null-terminated char* array, a char* buffer with a size_t length (the number of chars to be read from the buffer), an initializer_list<char>, or a range formed by a beginning and end iterator.
When the given pattern is invalid (mismatched parentheses, a bad back reference, and so on), a std::regex_error is thrown. This is a std::runtime_exception with an additional code() member returning one of 11 error codes of type std::regex_constants::error_type (error_paren, error_backref, and so on).
The last argument determines which grammar is used and may be used to toggle certain syntax options. The flag_type values are aliases for those of std::regex_constants::syntax_option_type. Because it is a bitmask type, its values may be combined using the | operator. The following syntax options are supported:
Option
Effect
collate
Character ranges of form [a-z] become locale sensitive. For a French locale, for instance, [a-z] should then match é, è, and so on.
icase
Character matches are done in a case-insensitive manner.
nosubs
No submatches against sub-expressions are stored in match_results (discussed later). Back references will likely fail as well.
optimize
Hints the implementation to prefer improved matching speed over performance during construction of regular expression objects.
ECMAScript
Uses the ECMAScript-based regular expression grammar (default).
basic
Uses the POSIX basic regular expression grammar (BRE).
extended
Uses the POSIX extended regular expression grammar (ERE).
grep
Uses the grammar of the POSIX utility grep (a BRE variant).
egrep
Uses the grammar of the POSIX utility grep –E (an ERE variant).
awk
Uses the grammar of the POSIX utility awk (another ERE variant).
Of the last six options, only one is allowed to be specified; if none is specified, ECMAScript is used by default. All POSIX grammars are older and less powerful than the ECMAScript grammar. The only reason to use them would therefore be that you are already familiar with them, or have preexisting regular expressions. Either way, there is no reason to detail these grammars here.

Basic Member Functions

A regex object is primarily intended to be passed to one of the global functions or iterator adapters explained later, so not many member functions operate on it:
  • A regex can be copied, moved, and swapped.
  • It can be (re)initialized with a new regular expression and optional syntax options using assign(), which has the exact same set of signatures as its nondefault constructors.
  • The flags() member returns the syntax options flag it was initialized with, and mark_count() returns the number of marked sub-expressions in its regular expression (see Table 6-5).
  • The regex std::locale is returned by getloc(). This affects matching behavior in several ways and is initialized with the active global C++ locale upon construction. After construction, it may be changed using the imbue() function.

Matching and Searching Patterns

The std::regex_match() function verifies that the full target sequence matches a given pattern, whereas the similar std::regex_search() searches for a first occurrence of a pattern in the target. Both return false if no successful match is found. These function templates have an analogous set of overloads, all with signatures of this form:
bool regex_match (Target [, Results&], const Regex&, match_flag_type = 0);
bool regex_search(Target [, Results&], const Regex&, match_flag_type = 0);
All but the last argument is templated on the same character type CharT, with implementations available for at least char and wchar_t. As for the arguments:
  • A typical combination for the first three arguments is (w)string, (w)smatch, (w)regex.
  • Instead of a basic_string<CharT>, the Target sequence may also be represented as a null-terminated CharT* array (used also for string literals) or a pair of bidirectional iterators that mark the bounds of a CharT sequence. In both these cases, the normal Results type becomes std::(w)cmatch.
  • The w?[sc]match types used for the optional match Results output argument are discussed in the next subsection.
  • The Regex object passed is not copied, so these functions must not (ideally cannot) be called using a temporary object.
  • To control matching behavior, a value of the bitmask type std::regex_constants::match_flag_type may be passed. Supported values are shown in the following table:
Match Flag
Effect
match_default
Use default matching behavior (this constant has value zero).
match_not_bol
match_not_eol
match_not_bow
match_not_eow
The first or last position in the target sequence is no longer considered the beginning/end of a line/word. Affects the ^, $, , and B annotations as explained in Table 6-6.
match_any
If multiple disjuncts of a disjunction match, it is not required to find the longest match among them: any match will do (for example, the first one found, if that speeds things up). Not relevant for the ECMAScript grammar, because this already prescribes the use of the leftmost successful match for disjunctions.
match_not_null
The pattern will not match the empty sequence.
match_continuous
The pattern only matches sequences that start at the beginning of the target sequence (implied for regex_match()).
match_prev_avail
When deciding on line and word boundaries for ^, $, , and B annotations, matching algorithms look at the character at --first, with first pointing to the start of the target sequence. When set, match_not_bol and match_not_bow are ignored. Useful when repeatedly calling regex_search() on consecutive target subsequences. The iterators explained later do this correctly and are the recommended way to enumerate matches.
If either algorithm fails, a std::regex_error is raised. Because the regular expression’s syntax is already verified upon construction of the regex object (see earlier), this only rarely occurs for very complex expressions if the algorithm runs out of resources.

Match Results

A std::match_results<CharIter> is effectively a sequential container (see Chapter 3) of sub_match<CharIter> elements, which are std::pairs of bidirectional CharIters pointing into the target sequence marking the bounds of the submatch sequences. At index 0, there is a sub_match for the full match, followed by one sub_match per marked sub-expression in the order their opening parentheses appear in the regular expression (see Table 6-5). The following template specializations are provided:
Target
match_results
sub_match
CharIter
std::string
std::wstring
std::smatch
std::wsmatch
std::ssub_match
std::wssub_match
std::string::const_iterator
std::wstring::const_iterator
const char*
const wchar_t*
std::cmatch
std::wcmatch
std::csub_match
std::wcsub_ match
const char*
const wchar_t*
std::sub_match
In addition to the first and second members inherited from std::pair, sub_matches have a third member variable called matched. This Boolean is false if the match failed or if the corresponding sub-expression did not participate in the match. The latter occurs, for example, if the sub-expression was part of a nonmatched disjunct, or of a nonmatched atom quantified with, for example, ?, *, or {0, n }. When matching "(a)?b|(c)" against "b", for instance, the match succeeds with a match_result that contains two empty sub_matches with matched == false.
The operations available for sub_matches are summarized in this table:
Operation
Description
length()
The length of the match sequence (0 if not matched)
str() /
cast operator
Returns the match sequence as a std::basic_string
compare()
Returns 0 if the sub_match compares equal to, and a positive / negative number if it compares greater / smaller than, a given sub_match, basic_string or null-terminated character array
==, !=,
<, <=, >, >=
Non-member operators for compare()ing between a sub_match and a sub_match, basic_string or character array, or vice versa
<<
Nonmember operator for streaming to an output stream
std::match_results
A match_results can be copied, moved, swapped, and compared for equality using == and !=. In addition to those operations, the following member functions are available (functions related to custom allocators are omitted). Note that, unlike for strings, size() and length() are not equivalent here:
Operation
Description
ready()
A default-constructed match_results is not ready and becomes ready after execution of a match algorithm.
empty()
Returns size()==0 (true if not ready() or after a failed match).
size()
Returns the number of sub_matches contained (one plus the number of marked sub-expressions) if ready() and the match was successful, or zero otherwise.
max_size()
The theoretical maximum size() due to implementation or memory limitations.
operator[]
Returns the sub_match with specified index n (see earlier) or an empty sub_match sub with sub.matched == false if n >= size().
length(size_t=0)
results.length( n ) is equivalent to results[ n ].length().
str(size_t=0)
results.str( n ) is equivalent to results[ n ].str().
position(size_t=0)
The distance between the start of the target sequence and results[ n ].first.
prefix()
Returns a sub_match ranging from the start of the target sequence (inclusive) until that of the match (non-inclusive). Always empty for regex_match(). Undefined if not ready().
suffix()
Returns a sub_match ranging from the end of the full match (non-inclusive) until the end of the target sequence (inclusive). Always empty for regex_match(). Undefined if not ready().
begin(), cbegin(), end(), cend()
Return iterators pointing to the first or one-past-the-last sub_match contained in the match_results.
format()
Formats the matched sequence according to a specified format. The different overloads (either string- or iterator-based) have output, pattern, and format flag arguments analogous to those of the std::regex_replace() function explained later. Any match_xxx flags are ignored; only format_yyy flags are used.

Example

The following example illustrates the use of regex_match(), regex_search(), and match_results (smatch):
A417649_1_En_6_Figu_HTML.gif
But the preferred way of enumerating all matches is to use the iterators discussed in the next subsection.

Match Iterators

The std::regex_iterator and regex_token_iterator classes facilitate traversing all matches of a pattern in a target sequence. Like match_results, both are templated with a type of character iterator (CharIter). Four analogous typedefs also exist for the most common cases: the iterator type prefixed with s, ws, c, or wc. The while loop from the example at the end of the previous subsection, for instance, may be rewritten as follows:
A417649_1_En_6_Figv_HTML.gif
In other words, a regex_iterator is a forward iterator that enumerates all sub_matches of a pattern as if found by repeatedly calling regex_search(). The previous for_each() loop is not only shorter and clearer though, it is also more correct in general than our naïve while loop: the iterator, for one, sets the match_prev_avail flag after the first iteration. Only one non-trivial constructor is available, creating a regex_iterator<CharIter> pointing to the first sub_match (if any) of a given Regex in the target sequence bounded by two bidirectional CharIters:
regex_iterator(CharIter, CharIter, const Regex&, match_flag_type = 0);
Analogous to a regex_iterator, which enumerates match_results, a regex_token_iterator enumerates all or specific sub_matches contained in these match_results. The same example, for instance, may be written as
A417649_1_En_6_Figw_HTML.gif
The constructors of regex_token_iterator are analogous to the constructor of regex_iterator but have an extra argument indicating which sub_matches to enumerate. Overloads are defined for a single int (as in the example), vector<int>, int[ N ], and initializer_list<int>. Replacing the 2 in the example with {0,1}, for example, outputs "<b>Bold</b>", "b", "<b>bold again</b>", and then "b". When omitted, this argument defaults to 0, indicating only full pattern sub_matches are to be enumerated (the example then prints "<b>Bold</b>" and "<b>bold again</b>").
The last parameter of a regex_token_iterator can also be -1 which turns it into a field splitter or tokenizer. This is a safe alternative to the C function strtok() from <cstring>. In this mode, a regex_token_iterator iterates over all subsequences that do not match the regular expression pattern. It can for example be used to split a comma-separated string into its different fields (or tokens). The regular expression used in that case is simply ",".

Replacing Patterns

The final regular expression algorithm, std::regex_replace() , replaces all matches of a given pattern with another. The signatures are as follows:
String regex_replace(Target, Regex&, Format, match_flag_type = 0);
Out regex_replace(Out, Begin, End, Regex&, Format, match_flag_type = 0);
As before, argument types are templated in the same character type CharT, with support for at least char and wchar_t. The replacement Format is represented as either a (w)string or a null-terminated C-style string. For the target sequence, there are two groups of overloads. Those in the first represent the Target as a (w)string or a C-style string and return the result as a (w)string. Those in the second denote the target using bidirectional Begin and End character iterators and copy the result into an output iterator Out. The return value for the latter is an iterator pointing to one past the last character that was outputted.
All matches of the given Regex are replaced with the Format sequence, which by default may contain the following special character sequences:
Format
Replacement
$ n
A copy of the nth marked sub-expression of the match, where n > 0 is counted as with back references: see Table 6-5.
$&
A copy of the entire match.
$`
A copy of the prefix, the part of the target that precedes the match.
A copy of the suffix, the part of the target that follows the match.
$$
A $ character (this is the only escaping required).
Analogously to earlier, only if the algorithm has insufficient resources to evaluate the match, a std::regex_ error is thrown.
The following code, for example, prints "d*v*w*l*d" and "debolded":
std::regex vowels("[aeiou]");
std::cout << std::regex_replace("devoweled", vowels, "*") << ' ';
std::regex bolds("<b>(.*?)</b>");
std::string target = "<b>debolded</b>";
std::ostream_iterator<char> out(std::cout);
std::regex_replace(out, target.cbegin(), target.cend(), bolds, "$1");
The final argument is again a std::regex_constants::match_flag_ type , which for regex_replace() can be used to tweak both the matching behavior of the regular expression—using the same match_xxx values as listed earlier—and the formatting of the replacement. For the latter, the following values are supported:
Format Flag
Effect
format_default
Use default formatting (this constant has value zero).
format_sed
Use the same syntax as the POSIX utility sed for the Format.
format_no_copy
Parts of the Target sequence that are not matches of the regular expression pattern are not copied to the output.
format_first_only
Only the first occurrence of the pattern is replaced.
Footnotes
1
These classes have two more optional template parameters: a number specifying the largest code point to output without error, and a codecvt_mode bitmask value (default 0) with possible values little_endian (output encoding) and consume_header / generate_header (read/write initial BOM header to determine endianness).
 
2
This example does not work in Visual Studio 2015. It compiles after replacing char32_t with __int32 and u32string with basic_string<__int32>, but the result is wrong.
 
3
Nearly all functions: for performance, is(), scan_is(), and scan_not() of the ctype<char> specialization do not call a virtual function, but perform lookups in a mask* array (ctype::classic_table() for the "C" locale). A custom instance may be created by passing a custom lookup array to the facet’s constructor.
 
4
A line terminator is one of four characters: line feed ( ), carriage return ( ), line separator (u2028), or paragraph separator (u2029).
 
5
A word character is any character in the [[:w:]] or [_[:alnum:]] class: that is, an underscore or any alphabetic or numerical digit character.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset