Strings <string>
The Standard defines four different string types
, each for a different char-like type:
String Type | Characters | Typical Character Size | |
---|---|---|---|
Narrow strings |
std::string
|
char
| 8 bit |
Wide strings |
std::wstring
|
wchar_t
| 16 or 32 bit |
UTF-16 strings |
std::u16string
|
char16_t
| 16 bit |
UTF-32 strings |
std::u32string
|
char32_t
| 32 bit |
The names in the first column are purely indicative, because strings are completely agnostic about the character encoding used for the char-like items—or code units, as they are technically called—they contain. Narrow strings, for example, may be used to store ASCII strings, as well as strings encoded using UTF-8 or DBCS.
To illustrate, we will mostly use std::string. Everything in this section, though, applies equally well to all types. The locale and regular expression functionalities discussed thereafter are, unless otherwise noted, only required to be implemented for narrow and wide strings.
All four string types are instantiations of the same class template, std::basic_string<CharT>. A
basic_string<CharT>
is essentially a vector<CharT> with extra functions and overloads either to facilitate common string operations or for compatibility with C-style strings (const CharT*). All members of vector are provided for strings as well, except for the emplacement functions (which are of little use for characters). This implies that, unlike in other mainstream languages such as .NET, Python, and Java, strings in C++ are mutable. It also means, for example, that strings can readily be used with all algorithms seen in Chapter 4:
The remainder of this section focuses on the functionality that strings add compared to vectors. For the functions that strings have in common with vector, we refer to Chapter 3. One thing to note is that string-specific functions and overloads are mostly index-based rather than iterator-based. The last three lines in the previous example, for instance, may be written more conveniently as
or
The equivalent of the
end() iterator
when working with string indices is basic_string::npos. This constant is consistently used to represent half open-ended ranges (that is, to denote “until the end of the string”), and, as you see next, as the “not found” return value for find()-like functions.
Searching in Strings
Strings offer six member functions
to search for substrings or characters: find() and rfind(), find_first_of() and find_last_of(), and find_first_not_of() and find_last_not_of(). These always come in pairs: one to search from front to back, and one to search from back to front. All also have the same four overloads of the following form:
The pattern to search for is either a single character or a string, with the latter represented as a C++ string, a null-terminated C-string, or a character buffer of which the first n values are used. The (r)find() functions search for an occurrence of the full pattern, and the find_xxx_of() / find_xxx_not_of() family of functions search for any single character that occurs / does not occur in the pattern. The result is the index of the (start of the) first occurrence starting from either the beginning or end, or npos if no match is found.
The mostly optional pos parameter is the index at which the search should start. For the functions searching backward, the default value for pos is npos.
Modifying Strings
To modify a string, you can use all members known already from vector, including erase(), clear(), push_back(), and so on (see Chapter 3). Additional functions or functions with string-specific overloads are assign(), insert(), append(), +=, and replace(). Their behavior should be obvious; only replace() may need some explanation. First though, let’s introduce the multitude of useful overloads these five functions have. These are generally of this form:
For moving a string, assign(string&&) is defined as well. Because the += operator inherently only has a single parameter, naturally only the C++ string, C-style string, and initializer-list overloads are possible.
Analogous to its vector counterpart, for insert() the overloads marked with (*) return an iterator rather than a string. Likely for the same reason, the insert() function has these two additional overloads:
Only insert() and replace() need a Position. For insert(), this is usually an index (a size_t), except for the last two overloads, where it is an iterator (analogous again to vector::insert()). For replace(), the Position is a range, specified either using two const_iterators (not available for the substring overload) or using a start index and a length (not for the last two overloads).
In other words, replace() does not, as you may expect, replace occurrences of a given character or string with another. Instead, it replaces a specified subrange with a new sequence—a string, substring, fill pattern, and so on—possibly of different length. You saw an example of its use earlier (2 is the length of the replaced range):
s.replace(s.find("be"), 2, "are");
To replace all occurrences of substrings or given patterns, you can use regular expressions and the std::regex_replace() function explained later in this chapter. To replace individual characters, the generic std::replace() and replace_if() algorithms from Chapter 4 are an option as well.
A final modifying function with a noteworthy difference from its vector counterpart is erase(): in addition to the two iterator-based overloads, it has one that works with indices. Use it to erase the tail or a subrange or, if you like, to clear() it:
string& erase(size_t pos = 0, size_t len = npos);
Constructing Strings
In addition to the default constructor, which creates an empty string, the constructor has the same seven overloads as the functions in the previous subsection, plus of course one for string&&. (Like other containers, all string constructors have an optional argument for custom allocators as well, but this is for advanced use only.)
As of C++14, basic_string objects of various character types can also be constructed from corresponding string literals by appending the suffix s. This literal operator is defined in the std::literals::string_literals namespace:
String Length
To get a string’s length, you can use either the typical container member size() or its string-specific alias length(). Both return the number of char-like elements the string contains. Take care, though: C++ strings are agnostic on the character encoding used, so their length equals what is technically called the number of code units, which may be larger than the number of code points or characters. Well-known encodings where not all characters are represented as a single code unit are the variable-length Unicode encodings UTF-8 and UTF-16:
One way to get the number of code points is to convert to a UTF-32 encoded string first, using the character-encoding conversion facilities introduced later in this chapter.
Copying (Sub)Strings
Another vector function (next to size()) that has a string-specific alias is data(), with its equivalent c_str(). Both return a const pointer to the internal character array (without copying). To copy the string to a C-style string instead, use copy():
size_t copy(char* out, size_t len, size_type pos = 0) const;
This copies len char values starting at pos to out. That is, it may be used to copy a substring as well. To create a substring as a C++ string, use substr():
string substr(size_t pos = 0, size_t len = npos) const;
Comparing Strings
Strings may be compared lexicographically with other C++ strings or C-style strings using either the non-member comparison operators (==, <, >=, and so on), or their compare() member. The latter has the following overloads:
int compare(const string& str) const noexcept;
int compare(size_type pos1, size_type n1, const string& str
[, size_type pos2, size_type n2 = npos]) const;
int compare(const char* s) const;
int compare(size_type pos1, size_type n1, const char* s
[, size_type n2]) const;
pos1/pos2 is the position in the first/second string where the comparison should start, and n1/n2 is the number of characters to compare from the first/second string. The return value is zero if both strings are equal or a negative/positive number if the first string is less/greater than the second.
String Conversions
To parse various types of integral numbers from a string, a series of non-member functions of the following form has been defined:
int stoi(const (w)string&, size_t* index = nullptr, int base = 10);
The following variants exist: stoi(), stol(), stoll(), stoul(), and stoull(), where i stands for int, l for long, and u for unsigned. These functions skip all leading whitespace characters, after which as many characters are parsed as allowed by the syntax determined by the base. If an index pointer is provided, it receives the index of the first character that is not converted.
Similarly, to parse floating-point numbers, a set of functions exists of the following form:
float stof(const (w)string&, size_t* index = nullptr);
stof(), stod(), and stold() are provided to convert to float, double, and long double, respectively.
To perform the opposite conversion and convert from numerical types to a (w)string, functions to_(w)string(
X
) are provided, where X can be int, unsigned, long, unsigned long, long long, unsigned long long, float, double, or long double. The returned value is a std::(w)string.
Character Classification <cctype>, <cwctype>
The
<cctype> and <cwctype> headers
offer a series of functions to classify, respectively, char and wchar_t characters. These functions are std::is
class
(int)
(defined only for ints that represent chars) and std::isw
class
(wint_t) (analogous; wint_t is an integral typedef), where class equals one of the values in Table 6-1. All functions return a non-zero int if the given character belongs to the class, or zero otherwise.
Table 6-1.
The 12 Standard Character Classes
Class | Description |
---|---|
cntrl
| Control characters: all non-print characters. Includes: ' ', ' ', '
', '
', and so on. |
print
| Printable characters: digits, letters, space, punctuation marks, and so on. |
graph
| Characters with graphical representation: all print characters except ' '. |
blank
| Whitespace characters that separate words on a line. At least ' ' and ' '. |
space
| Whitespace characters: at least all blank characters, '
', '
', 'v', and 'f'. Never alpha characters. |
digit
| Decimal digits (0–9). |
xdigit
| Hexadecimal digits (0–9, A–F, a–f). |
alpha
| Letter characters. At least all lowercase and uppercase characters, and never any of the cntrl, digit, punct, and space characters. |
lower
| Lowercase alpha letters (a–z for the default locale). |
upper
| Uppercase alpha letters (A–Z for the default locale). |
alnum
| Alphanumeric characters: union of all alpha and digit characters. |
punct
| Punctuation marks (! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ∼ for the default locale). Never a space or alnum character. |
The same headers also
offer the
tolower()
/
toupper() and towlower()
/ towupper() functions for converting between lowercase and uppercase characters. Characters are again represented using the integral int and wint_t types. If the conversion is not defined or possible, these functions simply return their input value.
The exact behavior of all character classifications and transformations depends on the active C locale
. Locales are explained in detail later in this chapter, but essentially this means the active language and regional settings may result in different sets of characters to be considered letters, lowercase or uppercase, digits, whitespace, and so on. Table 6-1 lists all general properties of and relations between the different character classes and gives some examples for the default C locale.
Note
In the “Localization” section, you also see that the C++ <locale> header offers a list of overloads for std::is
class
() and std::tolower() / toupper() (all templated on the character type) that use a given locale rather than the active C locale.
Character-Encoding Conversion <locale>, <codecvt>
A character encoding determines how code points (many but not all code points are characters) are represented as binary code units
. Examples include ASCII (classical encoding with 7-bit code units), the fixed-length UCS-2 and UCS-4 encodings (16-bit and 32-bit code units, respectively), and the three main Unicode encodings: the fixed-length UTF-32 (using a single 32-bit code unit for each code point) and variable-length UTF-8 and UTF-16 encodings (representing each code point as one or more 8- or 16-bit code units, respectively; up to 4 units for UTF-8, and 2 for UTF-16). The details of Unicode and the various character encodings and conversions could fill a book; we explain here what you need to know in practice to convert between encodings.
The class template for objects that contain the low-level encoding-conversion logic is std::codecvt<CharType1, CharType2, State>
(cvt is likely short for converter). It is defined in <locale> (as you see in the next section, this is actually a locale facet). The first two parameters are the C++ character types used to represent the code units of both encodings. For all standard instantiations, CharType2 is char. State is an advanced parameter we do not explain further (all standard specializations use std::mbstate_t from <cwchar>).
The four
codecvt specializations
listed in Table 6-2 are defined in <locale>. Additionally, the <codecvt> header defines the three std::codecvt subclasses
listed in Table 6-3.1 For these, CharT corresponds to the CharType1 parameter of the codecvt base class; as stated earlier, CharType2 is always char.
Table 6-2.
Character-Encoding Conversion
Classes Defined in <locale>
codecvt<char,char,mbstate_t>
| Identity conversion |
codecvt<char16_t,char,mbstate_t>
| Conversion between UTF-16 and UTF-8 |
codecvt<char32_t,char,mbstate_t>
| Conversion between UTF-32 and UTF-8 |
codecvt<wchar_t,char,mbstate_t>
| Conversion between native wide and narrow character encodings (implementation specific) |
Table 6-3.
Character-Encoding
Conversion Classes Defined in <codecvt>
codecvt_utf8<CharT>
codecvt_utf16<CharT>
| Conversion between UCS-2 (for 16-bit CharTs) or UCS-4 (for 32-bit CharTs) and UTF-8 / UTF-16. The UTF-16 string is represented using 8-bit chars as well, so this is intended for binary UTF-16 encodings. |
codecvt_utf8_utf16<CharT>
| Conversion between UTF-16 and UTF-8 (CharT must be at least 16-bit). |
Although codecvt instances could in theory be used directly, it is far easier to use the std::wstring_convert<CodecvtT, WCharT=wchar_t> class from <locale>. This helper class facilitates conversions between
char strings and strings
of a (generally wider) character type WCharT in both directions. Despite its misleading (outdated) name,
wstring_convert
can also convert from and to, for example, u16strings or u32strings, not just wstrings. These members are provided:
Method | Description |
---|---|
(constructor) | Constructors exist that take a pointer to an existing CodecvtT (of which wstring_convert takes ownership) and an initial state (not discussed further). Both are optional. A final constructor accepts two error strings: one to be returned by to_bytes() upon failure, and one by from_bytes() (the latter is optional). |
from_bytes()
| Converts either a single char or a string of chars (a C-style char* string, a std::string, or a sequence bounded by two char* pointers) to a std::basic_string<WCharT>, and returns the result. Throws std::range_error upon failure, unless an error string was provided upon construction: in that case, this error string is returned. |
to_bytes()
| Opposite conversion from WCharT to char, with analogous overloads. |
converted()
| Returns the number of input characters processed by the last from_bytes() or to_bytes() conversion. |
state()
| Returns the current state (mostly mbstate_t: not discussed further). |
Recall the following example from the section on std::string lengths:
To convert this string to UTF-32, you would hope the following is possible:
Unfortunately, this does not compile. For the converter subclasses defined in
<codecvt>
, this would compile. But the destructor of the codecvt base class is protected (like all standard locale facets: discussed later), and the wstring_convert destructor calls
it to delete the converter instance it owns. This design defect can be circumvented using a helper wrapper such as the following (similar tricks can be applied to make any protected function publically accessible, not just a destructor):
To make the code compile, you then replace the first line with the following2:
typedef deletable<std::codecvt<char32_t,char,std::mbstate_t>> cvt;
To use the potentially locale-specific variants of these converters (see the next section), use the following (other locale name besides "" may be used as well):
typedef deletable<std::codecvt_byname<char32_t,char,std::mbstate_t>> cvt;
std::wstring_convert<cvt, char32_t> convertor(new cvt(""));
A related class is
wbuffer_convert<CodecvtT, WCharT=wchar_t>
, which wraps a basic_streambuf<char> and makes it act as a basic_streambuf<WCharT> (stream buffers are very briefly explained in Chapter 5). A wbuffer_convert instance is constructed with an optional basic_streambuf<char>*, CodecvtT*, and state. Both the getter and setter for the wrapped buffer are called rdbuf(), and the current conversion state may be obtained using state(). The following constructs a stream that accepts wide character strings, but writes it to an UTF-8 encoded file (needs <fstream>):
Localization <locale>
Textual representations of dates, monetary values, and numbers are governed by regional and cultural conventions. To illustrate, the following three sentences are analogous but written using local currencies, numeric, and date formats:
In the U.S., John Doe has won $100,000.00 on the lottery on 3/14/2015.
In India, Ashok Kumar has won ₹1,00,000.00 on the lottery on 14-03-2015.
En France, Monsieur Brun a gagné 100.000,00 € à la loterie sur 14/3/2015.
In C++, all parameters and functionality related to processing text in a locale-specific manner—that is, adapted to local conditions—are contained in a std::locale object. These include not only formatting of numeric values and dates as just illustrated, but also locale-specific sorting and conversions of strings.
Locale Names
Standard locale objects are constructed from a
locale name
:
std::locale(const char* locale_name);
std::locale(const std::string& locale_name);
These names commonly consist of a two-letter ISO-639 language code followed by a two-letter ISO-3166 country code. The precise format, however, is platform-specific: on Windows, for instance, the name for the English-American locale is "en-US", whereas on POSIX-based systems
it is "en_US". Most platforms support, or sometimes require, additional specifications such as region codes, character encodings, and so on. Consult your platform’s documentation for a full list of supported locale names and options.
There are only two portable locale names, "" and "C":
- With "", you construct a std::locale with the user’s preferred regional and language settings, taken from the program’s execution environment (that is, the operating system).
- The "C" locale denotes the classic or neutral locale, which is the standardized, portable locale that all C and C++ programs use by default.
Using the "C" locale, the earlier example sentence becomes
Anywhere, a C/C++ programmer may win 100000 on the lottary on 3/14/2015.
Tip
When writing to a file intended to be read by computer programs (configuration files, numeric data output, and so on), it is highly recommended that you use the neutral "C" locale, to avoid problems during parsing. When displaying values to the user, you should consider using a locale based on the user’s preferences ("").
The Global Locale
The active
global locale
affects various Standard C++ functions that format or parse text, most directly the regular expression algorithms discussed later in this chapter and the I/O streams seen in Chapter 5. It is implementation dependent whether there is one program-wide global locale instance or one per thread of execution.
The global locale always starts out as the classic "C" locale. To set the global locale, you use the static std::locale::global() function
. To get a copy of the currently active global locale, simply default-construct a std::locale. For example:
Note
To avoid race conditions, Standard C++ objects (such as newly created stream or regex objects) always copy the global locale upon construction. Calling global() therefore does not affect existing objects, including std::cout and the other standard streams of <iostream>. To change their locale, you must call their imbue() member.
Basic std::locale Members
The following table lists most basic functions offered by a std::locale, not including the copy members. More advanced members to combine or customize locales are discussed near the end of the section:
Member | Description |
---|---|
global()
| Static function to set the active global locale (discussed earlier). |
classic()
| Static function returning a constant reference to a classic "C" locale. |
locale()
| Default constructor: creates a copy of the global locale. |
locale(name)
| Construction from locale name, as discussed earlier. Throws a std::runtime_exception if a nonexistent name is passed. |
name()
| Returns the locale name, if any. If the locale represents a customized or combined locale (discussed later), "*" is returned. |
== / !=
| Compares two locale objects. Customized or combined locales are equal only if they are the same object or one is a copy of the other.
|
Locale Facets
As obvious from the previous subsection, the std::locale public interface does not offer much functionality. All localization facilities are instead offered in the form of facets. Each locale object encapsulates a number of such facets, a reference to which may be obtained via the std::use_facet<FacetType>() function
. The following example, for instance, uses the classic locale’s numeric punctuation facet to print out the locale’s decimal mark for formatting floating-point numbers:
For all standard facets, the instance referred to by the result of use_facet() cannot be copied, moved, swapped, or deleted. This facet is (co-)owned by the given locale and is deleted together with the (last) locale that owns it. When requesting a FacetType the given locale does not own, a bad_cast exception is raised. To verify the presence of a facet, you can use std::has_facet<FacetType>().
Caution
Never do something like auto& f = use_facet<...>(std::locale("..."));: the facet f was owned by the temporary locale object, so using it will likely crash.
By default, locales contain specializations of all the facets introduced in the remainder of this section, each in turn specialized for at least the char and wchar_t character types (additional minimal requirements are discussed throughout the section). Implementations may include more facets, and programs can even add custom facets themselves, as explained later.
We now discuss the 12 standard facet classes
listed in Table 6-4 in order, grouped in sections by category. Afterwards, we show how to combine facets of different locales and create customized facets. Although this is perhaps not something most programmers will use regularly, occasionally the need does arise to customize facets. Regardless, it is worth knowing the scope and various effects of localization and to keep them in mind when developing programs that show or process user text (that is, most programs).
Table 6-4.
Overview of the 12 Basic Facet Classes
, Grouped by Category
Category | Facets |
---|---|
numeric
|
numpunct<C>, num_put<C>, num_get<C>
|
monetary
|
moneypunct<C, International>, money_put<C>, money_get<C>
|
time
|
time_put<C>, time_get<C>
|
ctype
|
ctype<C>, codecvt<C1, C2, State>
|
collate
|
collate<C>
|
messages
|
messages<C>
|
Numeric Formatting
The facets of the numeric and monetary category follow the same pattern: there is one punct facet (short for punctuation) with the locale-specific formatting parameters, plus both a put and a get facet responsible for the actual formatting and parsing of values, respectively. The latter two facets are mostly intended to be used by the stream objects introduced in Chapter 5. The concrete format they use to read or write values is determined by a combination of the parameters set in the punct facet and others set using the stream’s members or stream manipulators.
Numeric Punctuation
The std::numpunct<CharT> facet offers functions to retrieve the following information related to the formatting of numeric and Boolean values:
- decimal_point(): Returns the decimal separator
- thousands_sep(): Returns the thousands separator character
- grouping(): Returns a std::string encoding the digit grouping
- truename() and falsename(): Return basic_string<CharT>s with textual representations for Boolean values
In the lottery example at the beginning of the section, a numeric value of 100000.00 was formatted using three different locales: "100,000.00", "1,00,000.00", and "100.000,00". The first two locales use a comma (,) and dot (.) as thousands and decimal separator, respectively, whereas for the third it is the other way around.
The digit grouping() is encoded as a sequence of char values indicating the number of digits in each group, starting with the number in the rightmost group. The last char in the sequence is used for all subsequent groups as well. Most locales group digits in threes, for example, which is encoded as "3". (Note: do not use "3", because the '3' ASCII character results in a char with value 51; that is: '3' == '51'.) For Indian locales, however, as seen in "1,00,000.00", only the rightmost group contains three digits; all other groups contain only two. This is encoded as "32". To indicate an infinite group, a std::numeric_limits<char>::max() value may be used in the last position. An empty grouping() string denotes that no grouping should be used at all, which is the case, for instance, for the classic "C" locale.
Formatting and Parsing of Numeric Values
The std::num_put and num_get facets constitute the implementation of the << and >> stream operators described in Chapter 5 and provide two sets of methods with the following signature:
Iter put(Iter target, ios_base& stream, char fill, X value)
Iter get(Iter begin, Iter end, ios_base& stream, iostate& error, X& result)
Here X can be bool, long, long long, unsigned int, unsigned long, unsigned long long, double, long double, or a void pointer. For get(), unsigned short and float are also possible. These methods either format a given numeric value or try to parse the characters in the range [begin, end). In both cases, the ios_base parameter is a reference to a stream from which locale and formatting information is taken (including, for example, the stream’s formatting flags and precision: see Chapter 5).
All put() functions simply return target after writing the formatted character sequence there. The fill character is used for padding if the formatted length is less than stream.width() (see Chapter 5 for the padding rules).
If parsing succeeds, get() stores the numeric value in result. If the input did not match the format, result is set to zero and the failbit is set in the iostate parameter (see Chapter 5). If the parsed value is too large/small for type X, the failbit is set as well, and result is set to std::numeric_limits<X>::max()/lowest() (see Chapter 1). If the end of the input was reached (can be a success or a failure), the eofbit is set. An iterator to the first character after the parsed sequence is returned.
We do not show example code here, but these facets are analogous to the monetary formatting facets introduced next, for which we do include a full example.
Monetary Formatting
Monetary Punctuation
The std::moneypunct<CharType, International=false> facet offers functions to retrieve the following information related to formatting monetary values:
- decimal_point(), thousands_sep(), and grouping(): Analogous to the numeric punctuation members seen earlier.
- frac_digits(): Returns the number of digits after the decimal separator. A typical value is 2.
- curr_symbol(): Returns the currency symbol, such as '€', if the International template parameter is false, and the international currency code (usually three letters) followed by a space, such as "EUR", if International is true.
- pos_format() and neg_format() return a money_base::pattern structure (discussed later) describing how positive and negative monetary values are to be formatted.
- positive_sign() and negative_sign(): Return a formatting string for positive and negative monetary values.
The latter four members need more explanation. They use types defined in std::money_base, a base class of moneypunct. The money_base::pattern structure, defined as struct pattern{ char field[4]; }, is an array containing four values of the money_base::part enumeration, with these supported values:
part
| Description |
---|---|
none
| Optional whitespace characters, except when none appears last. |
space
| At least one whitespace character. |
symbol
| The currency symbol, curr_symbol(). |
sign
| The first character returned by positive_sign() or negative_sign(). Additional characters appear at the end of the formatted monetary value. |
value
| The monetary value. |
For example, assume that the neg_format() pattern is {none, symbol, sign, value}, that the currency symbol is '$', that negative_sign() returns "()", and that frac_digits() returns 2. Then the value -123456 is formatted as "$(1,234.56)".
Note
For American and many European locales, frac_digits() equals 2, meaning unformatted values are to be expressed in, for example, cents rather than dollars or euros. This is not always the case, though: for the Japanese locale, for example, frac_digits() is 0.
Formatting and Parsing of Monetary Values
The facets std::money_put and
money_get handle formatting and parsing of monetary values and are mainly intended to be used by the put_money() and get_money() I/O manipulators discussed in Chapter 5. The facets offer methods of this form:
Iter put(Iter target, bool intl, ios_base& stream, char fill, X value)
Iter get(Iter begin, Iter end, bool intl, ios_base& stream,
iostate& error, X& result)
Here X is either std::string or long double. The behavior and meaning of the parameters is similar to that discussed for num_put and num_get earlier. If intl is false, currency symbols like $ are used; otherwise, strings like USD are used.
The following illustrates how these facets can be used, although you normally simply use std::put_/get_money() (uses <cassert> and <sstream>):
Time and Date Formatting
The two facets std::time_get and time_put handle parsing and formatting of time and dates and power the get_time() and put_time() manipulators seen in Chapter 5. They provide methods with the following signatures:
Iter put(Iter target, ios_base& stream, char fill, tm* value, <format>)
Iter get(Iter begin, Iter end, ios_base& stream, iostate& error, tm* result,
<format>)
The <format> is either 'const char* from, const char* to', pointing to a time-formatting pattern expressed using the same syntax as explained for strftime() in Chapter 2, or a single time format specifier of the same grammar with optional modifier 'char format, char modifier'. The behavior and meaning of the parameters is analogous to those for the numeric and monetary formatting facets. The std::tm structure is explained in Chapter 2 as well. Only those members of the passed tm are used / written that are mentioned in the formatting pattern.
In addition to the generic get() functions, the time_get facet has a series of more restricted parsing functions, all with the following signature:
Iter get_x(Iter begin, Iter end, ios_base& stream, iostate& error, tm*)
Member | Description |
---|---|
get_time()
| Tries to parse a time as %H:%M:%S. |
get_date()
| Tries to parse a date using a format that depends on the value of the facet’s date_order() member: either no_order: %m%d%y, dmy: %d%m%y, mdy: %m%d%y, ymd: %y%m%d, or ydm: %y%d%m. This date_order() enumeration value reflects the locale’s %X date format. |
get_weekday()
get_monthname()
| Tries to parse a name for a weekday or month, possibly abbreviated. |
get_year()
| Tries to parse a year. Whether two-digit year numbers are supported depends on your implementation. |
Character Classification, Transformation, and Conversion
Character Classification and Transformation
The ctype<CharType> facets offer a series of locale-dependent character-classification and -transformation functions, including equivalents for those of the <cctype> and <cwctype> headers seen earlier.
For use in the character-classification functions listed next, 12 member constants of a bitmask type ctype_base::mask are defined (ctype_base is a base class of ctype), one for each character class. Their names equal the class names given in Table 6-1. Although their values are unspecified, alnum == alpha|digit and graph == alnum|punct. The following table lists all classification functions (input character ranges are represented using two CharType* pointers b and e):
Member | Description |
---|---|
is(mask,c)
| Checks whether a given character c belongs to any of the character classes specified by mask. |
is(b,e,mask*)
| Identifies for each character in the range [b, e) the complete mask value that encodes all classes it belongs to, and stores the result in the output range pointed to by the last argument. Returns e. |
scan_is(mask,b,e)
scan_not(mask,b,e)
| Scans the character range [b, e), and returns a pointer to the first character that belongs / does not belong to any of the classes specified by mask. If none is found, the result is e. |
The same facets also offer these transformation functions:
Member | Description |
---|---|
tolower(c)
toupper(c)
tolower(b,e)
toupper(b,e)
| Performs upper-to-lower transformation or vice versa on a single character (result is returned) or a character range [b, e) (transformed in-place; e is returned). Characters that cannot be transformed are left unchanged. |
widen(c)
widen(b,e,o)
| Transforms char values to the facet’s character type on a single character (result is returned) or a character range [b, e) (transformed characters are put in the output range starting at *o; e is returned). Transformed characters never belong to a class their source characters did not belong to. |
narrow(c,d)
narrow(b,e,d,o)
| Transformation to char; opposite of widen(). However, only for the 96 basic source characters (all space and printable ASCII characters except $, `, and @) the relation widen(narrow(c,0)) == c is guaranteed to hold. If no transformed character is readily available, the given default char d is used. |
The <locale> header
defines a series of convenience functions for those functions of the ctype facets that also exist in <cctype> and <cwctype>: std::is
class (c, locale&), with class a name from Table 6-1, and tolower(c, locale&) / toupper(c, locale&). Their implementations all have the following form (the return type is either bool or CharT):
template <typename CharT> ... function(CharT c, const std::locale& l) {
return std::use_facet<std::ctype<CharT>>(l).function(c);
}
Character-Encoding Conversions
A std::codecvt facet converts character sequences between two character encodings. This is explained earlier in “Character-Encoding Conversion,” because these facets are useful also outside the context of locales. Each std::locale contains at least instances of the four codecvt specializations listed in Table 6-2, which implement potentially locale-specific converters. These are used implicitly by the streams of Chapter 5 when converting, for example, between wide and narrow strings. Because directly using these low-level facets is not recommended, we do not explain their members here. Always use the helper classes discussed in the “Character-Encoding Conversion” section instead.
String Ordering and Hashing
The std::collate<CharType> facet implements the following locale-dependent string-ordering comparisons and hashing functions. All character sequences are specified using begin (inclusive) and end (exclusive) CharType* pointers:
Member | Description |
---|---|
compare()
| Locale-dependent three-way comparison of two character sequences, returning -1 if the first precedes the second, 0 if both are equivalent, and +1 otherwise. Not necessarily the same as naïve lexicographical sequence comparison. |
transform()
| Transforms a given character sequence to a specific normalized form, which is returned as a basic_string<CharType>. Applying naïve lexicographical ordering on two transformed strings (as with their operator<) returns the same result as applying the facet’s compare() function on the untransformed sequences. |
hash()
| Returns a long hash value for the given sequence (see Chapter 3 for hashing) that is the same for all sequences that transform() to the same normalized form. |
A std::locale itself is a std::less<std::basic_string<CharT>>-like functor (see Chapter 2) that compares two basic_string<CharT>s using its collate<CharT> facet’s compare() function. The following example sorts French strings lexicographically, using the classic locale, and using a French locale (the locale name to use is platform specific). In addition to <locale>, it needs <vector>, <string>, and <algorithm>:
Message Retrieval
The std::messages<CharT> facet facilitates retrieval of textual messages from message catalogs. These catalogs are essentially associative arrays that map a series of integers to a localized string. This could in principle be used, for instance, to retrieve translated error messages based on, for example, their error category and code (see Chapter 8). Which catalogs are available, and how they are structured, is entirely platform specific. For some, standardized message catalog APIs are used (such as POSIX’s catgets() or GNU’s gettext()), whereas others may not offer any catalogs (this is typically the case for Windows). The facet offers these functions:
Member | Description |
---|---|
open(n,l)
| Opens a catalog based on a given platform-specific string n (a basic_string<CharT>), and for the given std::locale l. Returns a unique identifier of some signed integer type catalog. |
get(c,set,id,def)
| Retrieves from the catalog with given catalog identifier c, the message identified by set and id (two int values whose interpretation is catalog specific), and returns it as a basic_string<CharT>. Returns def if no such message is found. |
close(c)
| Closes the catalog with the given catalog identifier c. |
Combining and Customizing Locales
The constructs of the <locale> library are designed to be very flexible when it comes to combining or customizing locale facets.
Combining Facets
std::locale provides combine<FacetType>(const locale& c), which returns a copy of the locale on which combine() is called, except for the FacetType facet, which is copied from the given argument. Here is an example (using namespace std is assumed):
Alternatively, std::locale has a constructor accepting a base locale and an overriding facet that does the same as combine(). For example, the creation of combined in the previous example can be expressed as follows:
locale combined(locale::classic(), &use_facet<moneypunct<char>>(chinese));
std::locale moreover has a number of constructors to override all facets of one or more categories at once (String is either a std::string or a C-style string representing the name of a specific locale):
locale(const locale& base, String name, category cat)
locale(const locale& base, const locale& overrides, category cat)
For each of the six categories listed in Table 6-4, std::locale defines a constant with that name. The std::locale::category type is a bitmask type, meaning categories can be combined using bitwise operators. The all constant, for example, is defined as collate | ctype | monetary | numeric | time | messages. These constructors can be used to create a combined facet similar to the one earlier:
locale combined(locale::classic(), chinese, locale::monetary);
Custom Facets
All public functions func
() of the facets simply call a protected virtual method on the facet called
do_
func
().
3 You can implement custom facets by inheriting from existing ones and overriding these do-methods.
This first simple example changes the behavior of the numpunct facet to use the strings "yes" and "no" instead of "true" and "false" for Boolean input and output:
class yes_no_numpunct : public std::numpunct<char> {
protected:
virtual string_type do_truename() const override { return "yes"; }
virtual string_type do_falsename() const override { return "no"; }
};
You can use this custom facet, for instance, by imbuing it on a stream. The following prints "yes / no" to the console:
std::cout.imbue(std::locale(std::cout.getloc(), new yes_no_numpunct));
std::cout << std::boolalpha << true << " / " << false << std::endl;
Recall that facets are reference counted and that the destructor of the std::locale hence properly cleans up your custom facet.
The disadvantage of deriving from facets such as numpunct and moneypunct is that those generic base classes implement locale-independent behavior. To start from a locale-specific facet instead, facet classes such as numpunct_byname are available. For all facets seen so far, except the numeric and monetary put and get facets, a facet subclass exists with the same name but appended with _byname. They are constructed passing a locale name (const char* or std::string) and then behave as if taken from the corresponding locale. You can override from these facets to modify only specific aspects of a facet for a given locale.
The next example modifies the monetary punctuation facet to facilitate output using a format standard in accounting: negative numbers are put between parentheses, and padding is done in a particular way. You do so without overriding a locale’s currency symbol or most other settings by starting from std::moneypunct_byname (string_type is defined in std::moneypunct):
This facet may then be used as follows (see Chapter 5 for details on the stream I/O manipulators of <iomanip>):
The output of this program should be
$ 1,000.00
$ (5.00)
You can in theory create a new facet class by directly inheriting from std::facet and add it to a locale using the same constructor to use it in your own library code later. The only additional requirement is that you define a default-constructed static constant named id of type std::locale::id.
C Locales <clocale>
Locale-sensitive functions from the C Standard Library (including most functions in <cctype> and the I/O operations of <cstdio> and <ctime>) are not directly affected by the global C++ locale. Instead, they are governed by a corresponding C locale. This C locale is changed by one of two functions:
- std::locale::global() is guaranteed to modify the C locale to match the given C++ locale, as long as the latter has a name. Otherwise, its effect on the C locale, if any, is implementation-defined.
- Using the std::setlocale() function of <clocale>. This does not affect the C++ global locale in any way.
In other words, when using standard locales, a C++ program should simply call std::locale::global(). To write portable code when combining multiple locales, however, you have to call both the C++ and the C function because not all implementations set the C locale as expected when changing the global() C++ locale to a combined locale. This is done as follows:
The setlocale() function takes a single category number (not a bitmask type; supported values include at least LC_ALL, LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TIME) and a locale name, all analogous to their C++ equivalents. It returns the name of the active C locale upon success as a char* pointer into a reused, global buffer, or nullptr upon failure. If nullptr is passed for the locale name, the C locale is not modified.
Unfortunately, the C locale functionality is far less powerful than the C++ one: customized facets or selecting individual facets for combining is not possible, making the use of such advanced locales impossible with portable code in general.
The <clocale> header has one more function: std::localeconv(). It returns a pointer to a global std::lconv struct with public members equivalent to the functions of the std::numpunct (decimal_point, thousands_sep, grouping) and std::moneypunct facets (mon_decimal_point, mon_thousands_sep, mon_grouping, positive_sign, negative_sign, currency_symbol, frac_digits, and so on). These values should be treated as read-only: writing to them results in undefined behavior.
Regular Expressions <regex>
A regular expression is a textual representation of a pattern or patterns to be matched against a target sequence of characters. The regular expression ab*a, for instance, matches any target sequence starting with the character a, followed by zero or more bs, and ending again with an a. Regular expressions can be used to search for or replace particular patterns in the target, or to verify that it matches a desired pattern. You see how to perform these operations using the
<regex> library
later; first we introduce how to form and create regular expressions.
The ECMAScript Regular Expression Grammar
The syntax used to express
patterns in textual form is defined by a grammar. By default, <regex> uses a modified version of the grammar used by the ECMAScript scripting language
(best known for its widely used dialects JavaScript, JScript, and ActionScript). What follows is a concise, comprehensive reference for this grammar.
A regular expression pattern is a disjunction of sequences of terms, with each term either an atom, an assertion, or a quantified atom. Supported atoms and assertions are listed in Table 6-5 and Table 6-6, and Table 6-7 shows how atoms are quantified to express repetitive patterns. These terms are concatenated without separators and then optionally combined into disjunctions using the | operator. Empty disjuncts are allowed, with pattern
| matching either the given pattern or the empty sequence. Some examples should clarify:
Table 6-5.
All Atoms
with a Special Meaning in the ECMAScript Grammar
Atom | Matches |
---|---|
.
| Any single character except line terminators4. |