2
Working with Strings and String Views

DYNAMIC STRINGS

Strings in languages that have supported them as first-class objects tend to have a number of attractive features, such as being able to expand to any size, or to have sub-strings extracted or replaced. In other languages, such as C, strings were almost an afterthought; there was no really good “string” data type, just fixed arrays of bytes. The “string library” was nothing more than a collection of rather primitive functions without even bounds checking. C++ provides a string type as a first-class data type.

C-Style Strings

In the C language, strings are represented as an array of characters. The last character of a string is a null character (‘’) so that code operating on the string can determine where it ends. This null character is officially known as NUL, spelled with one L, not two. NUL is not the same as the NULL pointer. Even though C++ provides a better string abstraction, it is important to understand the C technique for strings because they still arise in C++ programming. One of the most common situations is where a C++ program has to call a C-based interface in some third-party library or as part of interfacing to the operating system.

By far, the most common mistake that programmers make with C strings is that they forget to allocate space for the ‘’ character. For example, the string “hello” appears to be five characters long, but six characters worth of space are needed in memory to store the value, as shown in Figure 2-1.

Illustration of the string “hello” appearing to be five characters long, but six characters worth of space are needed in memory to store the value.

FIGURE 2-1

C++ contains several functions from the C language that operate on strings. These functions are defined in the <cstring> header. As a general rule of thumb, these functions do not handle memory allocation. For example, the strcpy() function takes two strings as parameters. It copies the second string onto the first, whether it fits or not. The following code attempts to build a wrapper around strcpy() that allocates the correct amount of memory and returns the result, instead of taking in an already allocated string. It uses the strlen() function to obtain the length of the string. The caller is responsible for freeing the memory allocated by copyString().

char* copyString(const char* str)
{
    char* result = new char[strlen(str)];  // BUG! Off by one!
    strcpy(result, str);
    return result;
}

The copyString() function as written is incorrect. The strlen() function returns the length of the string, not the amount of memory needed to hold it. For the string “hello”, strlen() returns 5, not 6. The proper way to allocate memory for a string is to add 1 to the amount of space needed for the actual characters. It seems a bit unnatural to have +1 all over the place. Unfortunately, that’s how it works, so keep this in mind when you work with C-style strings. The correct implementation is as follows:

char* copyString(const char* str)
{
    char* result = new char[strlen(str) + 1];
    strcpy(result, str);
    return result;
}

One way to remember that strlen()returns only the number of actual characters in the string is to consider what would happen if you were allocating space for a string made up of several other strings. For example, if your function took in three strings and returned a string that was the concatenation of all three, how big would it be? To hold exactly enough space, it would be the length of all three strings, added together, plus one space for the trailing ‘’ character. If strlen() included the ‘’ in the length of the string, the allocated memory would be too big. The following code uses the strcpy() and strcat() functions to perform this operation. The cat in strcat() stands for concatenate.

char* appendStrings(const char* str1, const char* str2, const char* str3)
{
    char* result = new char[strlen(str1) + strlen(str2) + strlen(str3) + 1];
    strcpy(result, str1);
    strcat(result, str2);
    strcat(result, str3);
    return result;
}

The sizeof() operator in C and C++ can be used to get the size of a certain data type or variable. For example, sizeof(char) returns 1 because a char has a size of 1 byte. However, in the context of C-style strings, sizeof() is not the same as strlen(). You should never use sizeof() to try to get the size of a string. It returns different sizes depending on how the C-style string is stored. If it is stored as a char[], then sizeof() returns the actual memory used by the string, including the ‘’ character, as in this example:

char text1[] = "abcdef";
size_t s1 = sizeof(text1);  // is 7
size_t s2 = strlen(text1);  // is 6

However, if the C-style string is stored as a char*, then sizeof() returns the size of a pointer!

const char* text2 = "abcdef";
size_t s3 = sizeof(text2);  // is platform-dependent
size_t s4 = strlen(text2);  // is 6

Here, s3 will be 4 when compiled in 32-bit mode, and 8 when compiled in 64-bit mode because it is returning the size of a const char*, which is a pointer.

A complete list of C functions to operate on strings can be found in the <cstring> header file.

String Literals

You’ve probably seen strings written in a C++ program with quotes around them. For example, the following code outputs the string hello by including the string itself, not a variable that contains it:

cout << "hello" << endl;

In the preceding line, “hello” is a string literal because it is written as a value, not a variable. String literals are actually stored in a read-only part of memory. This allows the compiler to optimize memory usage by reusing references to equivalent string literals. That is, even if your program uses the string literal “hello” 500 times, the compiler is allowed to create just one instance of hello in memory. This is called literal pooling.

String literals can be assigned to variables, but because string literals are in a read-only part of memory and because of the possibility of literal pooling, assigning them to variables can be risky. The C++ standard officially says that string literals are of type “array of n const char”; however, for backward compatibility with older non-const-aware code, most compilers do not enforce your program to assign a string literal to a variable of type const char*. They let you assign a string literal to a char* without const, and the program will work fine unless you attempt to change the string. Generally, the behavior of modifying string literals is undefined. It could, for example, cause a crash, or it could keep working with seemingly inexplicable side effects, or the modification could silently be ignored, or it could just work; it all depends on your compiler. For example, the following code exhibits undefined behavior:

char* ptr = "hello";       // Assign the string literal to a variable.
ptr[1] = 'a';              // Undefined behavior!

A much safer way to code is to use a pointer to const characters when referring to string literals. The following code contains the same bug, but because it assigned the literal to a const char*, the compiler catches the attempt to write to read-only memory:

const char* ptr = "hello"; // Assign the string literal to a variable.
ptr[1] = 'a';              // Error! Attempts to write to read-only memory

You can also use a string literal as an initial value for a character array (char[]). In this case, the compiler creates an array that is big enough to hold the string and copies the string to this array. The compiler does not put the literal in read-only memory and does not do any literal pooling.

char arr[] = "hello"; // Compiler takes care of creating appropriate sized
                      // character array arr.
arr[1] = 'a';         // The contents can be modified.

Raw String Literals

Raw string literals are string literals that can span across multiple lines of code, that don’t require escaping of embedded double quotes, and where escape sequences like and are processed as normal text and not as escape sequences. Escape sequences are discussed in Chapter 1. For example, if you write the following with a normal string literal, you will get a compilation error because the string contains non-escaped double quotes:

const char* str = "Hello "World"!";    // Error!

Normally you have to escape the double quotes as follows:

const char* str = "Hello "World"!";

With a raw string literal, you can avoid the need to escape the quotes. The raw string literal starts with R"( and ends with )".

const char* str = R"(Hello "World"!)";

If you need a string consisting of multiple lines, without raw string literals you need to embed escape sequences in your string where you want to start a new line. For example:

const char* str = "Line 1
Line 2";

If you output this string to the console, you get the following:

Line 1
Line 2

With a raw string literal, instead of using escape sequences to start new lines, you can simply press enter to start real physical new lines in your source code as follows. The output is the same as the previous code snippet using the embedded .

const char* str = R"(Line 1
Line 2)";

Escape sequences are ignored in raw string literals. For example, in the following raw string literal, the escape sequence is not replaced with a tab character, but is kept as the sequence of a backslash followed by the letter t:

const char* str = R"(Is the following a tab character? 	)";

So, if you output this string to the console, you get:

Is the following a tab character? 	

Because a raw string literal ends with )" you cannot embed a )" in your string using this syntax. For example, the following string is not valid because it contains the )" sequence in the middle of the string:

const char* str = R"(Embedded )" characters)";    // Error!

If you need embedded )" characters, you need to use the extended raw string literal syntax, which is as follows:

R"d-char-sequence(r-char-sequence)d-char-sequence"

The r-char-sequence is the actual raw string. The d-char-sequence is an optional delimiter sequence, which should be the same at the beginning and at the end of the raw string literal. This delimiter sequence can have at most 16 characters. You should choose this delimiter sequence as a sequence that will not appear in the middle of your raw string literal.

The previous example can be rewritten using a unique delimiter sequence as follows:

const char* str = R"-(Embedded )" characters)-";

Raw string literals make it easier to work with database querying strings, regular expressions, file paths, and so on. Regular expressions are discussed in Chapter 19.

The C++ std::string Class

C++ provides a much-improved implementation of the concept of a string as part of the Standard Library. In C++, std::string is a class (actually an instantiation of the basic_string class template) that supports many of the same functionalities as the <cstring> functions, but that takes care of memory allocation for you. The string class is defined in the <string> header in the std namespace, and has already been introduced in the previous chapter. Now it’s time to take a deeper look at it.

What Is Wrong with C-Style Strings?

To understand the necessity of the C++ string class, consider the advantages and disadvantages of C-style strings.

Advantages:
  • They are simple, making use of the underlying basic character type and array structure.
  • They are lightweight, taking up only the memory that they need if used properly.
  • They are low level, so you can easily manipulate and copy them as raw memory.
  • They are well understood by C programmers—why learn something new?
Disadvantages:
  • They require incredible efforts to simulate a first-class string data type.
  • They are unforgiving and susceptible to difficult-to-find memory bugs.
  • They don’t leverage the object-oriented nature of C++.
  • They require knowledge of their underlying representation on the part of the programmer.

The preceding lists were carefully constructed to make you think that perhaps there is a better way. As you’ll learn, C++ strings solve all the problems of C strings and render most of the arguments about the advantages of C strings over a first-class data type irrelevant.

Using the string Class

Even though string is a class, you can almost always treat it as if it were a built-in type. In fact, the more you think of it that way, the better off you are. Through the magic of operator overloading, C++ strings are much easier to use than C-style strings. For example, the + operator is redefined for strings to mean “string concatenation.” The following code produces 1234:

string A("12");
string B("34");
string C;
C = A + B;    // C is "1234"

The += operator is also overloaded to allow you to easily append a string:

string A("12");
string B("34");
A += B;    // A is "1234"

Another problem with C strings is that you cannot use == to compare them. Suppose you have the following two strings:

char* a = "12";
char b[] = "12";

Writing a comparison as follows always returns false, because it compares the pointer values, not the contents of the strings:

if (a == b)

Note that C arrays and pointers are related. You can think of C arrays, like the b array in the example, as pointers to the first element in the array. Chapter 7 goes deeper in on the array-pointer duality.

To compare C strings, you have to write something as follows:

if (strcmp(a, b) == 0)

Furthermore, there is no way to use <, <=, >=, or > to compare C strings, so strcmp() returns -1, 0, or 1, depending on the lexicographic relationship of the strings. This results in very clumsy and hard-to-read code, which is also error-prone.

With C++ strings, operator==, operator!=, operator<, and so on are all overloaded to work on the actual string characters. Individual characters can still be accessed with operator[].

As the following code shows, when string operations require extending the string, the memory requirements are automatically handled by the string class, so memory overruns are a thing of the past:

string myString = "hello";
myString += ", there";
string myOtherString = myString;
if (myString == myOtherString) {
    myOtherString[0] = 'H';
}
cout << myString << endl;
cout << myOtherString << endl;

The output of this code is

hello, there
Hello, there

There are several things to note in this example. One point is that there are no memory leaks even though strings are allocated and resized on a few places. All of these string objects are created as stack variables. While the string class certainly has a bunch of allocating and resizing to do, the string destructors clean up this memory when string objects go out of scope.

Another point to note is that the operators work the way you want them to. For example, the = operator copies the strings, which is most likely what you want. If you are used to working with array-based strings, this will either be refreshingly liberating for you or somewhat confusing. Don’t worry—once you learn to trust the string class to do the right thing, life gets so much easier.

For compatibility, you can use the c_str() method on a string to get a const character pointer, representing a C-style string. However, the returned const pointer becomes invalid whenever the string has to perform any memory reallocation, or when the string object is destroyed. You should call the method just before using the result so that it accurately reflects the current contents of the string, and you must never return the result of c_str() called on a stack-based string object from a function.

There is also a data() method which, up until C++14, always returned a const char* just as c_str(). Starting with C++17, however, data() returns a char* when called on a non-const string.

Consult a Standard Library Reference, see Appendix B, for a complete list of all supported operations that you can perform on string objects.

std::string Literals

A string literal in source code is usually interpreted as a const char*. You can use the standard user-defined literal “s” to interpret a string literal as an std::string instead.

auto string1 = "Hello World";    // string1 is a const char*
auto string2 = "Hello World"s;   // string2 is an std::string

The standard user-defined literal “s” requires a using namespace std::string_literals; or using namespace std;.

High-Level Numeric Conversions

The std namespace includes a number of helper functions that make it easy to convert numerical values into strings or strings into numerical values. The following functions are available to convert numerical values into strings. All these functions take care of memory allocations. A new string object is created and returned from them.

  • string to_string(int val);
  • string to_string(unsigned val);
  • string to_string(long val);
  • string to_string(unsigned long val);
  • string to_string(long long val);
  • string to_string(unsigned long long val);
  • string to_string(float val);
  • string to_string(double val);
  • string to_string(long double val);

These functions are pretty straightforward to use. For example, the following code converts a long double value into a string:

long double d = 3.14L;
string s = to_string(d);

Converting in the other direction is done by the following set of functions, also defined in the std namespace. In these prototypes, str is the string that you want to convert, idx is a pointer that receives the index of the first non-converted character, and base is the mathematical base that should be used during conversion. The idx pointer can be a null pointer, in which case it will be ignored. These functions ignore leading whitespace, throw invalid_argument if no conversion could be performed, and throw out_of_range if the converted value is outside the range of the return type.

  • int stoi(const string& str, size_t *idx=0, int base=10);
  • long stol(const string& str, size_t *idx=0, int base=10);
  • unsigned long stoul(const string& str, size_t *idx=0, int base=10);
  • long long stoll(const string& str, size_t *idx=0, int base=10);
  • unsigned long long stoull(const string& str, size_t *idx=0, int base=10);
  • float stof(const string& str, size_t *idx=0);
  • double stod(const string& str, size_t *idx=0);
  • long double stold(const string& str, size_t *idx=0);

Here is an example:

const string toParse = "   123USD";
size_t index = 0;
int value = stoi(toParse, &index);
cout << "Parsed value: " << value << endl;
cout << "First non-parsed character: '" << toParse[index] << "'" << endl;

The output is as follows:

Parsed value: 123
First non-parsed character: 'U'

image Low-Level Numeric Conversions

The C++17 standard also provides a number of lower-level numerical conversion functions, all defined in the <charconv> header. These functions do not perform any memory allocations, but instead use buffers allocated by the caller. Additionally, they are tuned for high performance and are locale-independent (see Chapter 19 for details on localization). The end result is that these functions can be orders of magnitude faster than other higher-level numerical conversion functions. You should use these functions if you want high performant, locale-independent conversions, for example to serialize/deserialize numerical data to/from human readable formats such as JSON, XML, and so on.

For converting integers to characters, the following set of functions is available:

to_chars_result to_chars(char* first, char* last, IntegerT value, int base = 10);

Here, IntegerT can be any signed or unsigned integer type or char. The result is of type to_chars_result, a type defined as follows:

struct to_chars_result {
    char* ptr;
    errc ec;
};

The ptr member is either equal to the one-past-the-end pointer of the written characters if the conversion was successful, or it is equal to last if the conversion failed (in which case, ec == errc::value_too_large).

Here is an example of its use:

std::string out(10, ' ');
auto result = std::to_chars(out.data(), out.data() + out.size(), 12345);
if (result.ec == std::errc()) { /* Conversion successful. */ }

Using C++17 structured bindings introduced in Chapter 1, you can write it as follows:

std::string out(10, ' ');
auto [ptr, ec] = std::to_chars(out.data(), out.data() + out.size(), 12345);
if (ec == std::errc()) { /* Conversion successful. */ }

Similarly, the following set of conversion functions is available for floating point types:

to_chars_result to_chars(char* first, char* last, FloatT value);
to_chars_result to_chars(char* first, char* last, FloatT value,
                         chars_format format);
to_chars_result to_chars(char* first, char* last, FloatT value,
                         chars_format format, int precision);

Here, FloatT can be float, double, or long double. Formatting can be specified with a combination of chars_format flags:

enum class chars_format {
    scientific,                  // Style: (-)d.ddde±dd
    fixed,                       // Style: (-)ddd.ddd
    hex,                         // Style: (-)h.hhhp±d (Note: no 0x!)
    general = fixed | scientific // See next paragraph
};

The default format is chars_format::general, which causes to_chars() to convert the floating point value to a decimal notation in the style of (-)ddd.ddd, or to a decimal exponent notation in the style of (-)d.ddde±dd, whichever results in the shortest representation with at least one digit before the decimal point (if present). If a format is specified but no precision, the precision is automatically determined to result in the shortest possible representation for the given format, with a maximum precision of 6 digits.

For the opposite conversion—that is, converting character sequences into numerical values—the following set of functions is available:

from_chars_result from_chars(const char* first, const char* last,
                             IntegerT& value, int base = 10);
from_chars_result from_chars(const char* first, const char* last,
                             FloatT& value,
                             chars_format format = chars_format::general);

Here, from_chars_result is a type defined as follows:

struct from_chars_result {
    const char* ptr;
    errc ec;
};

The ptr member of the result type is a pointer to the first character that was not converted, or it equals last if all characters were successfully converted. If none of the characters could be converted, ptr equals first, and the value of the error code will be errc::invalid_argument. If the parsed value is too large to be representable by the given type, the value of the error code will be errc::result_out_of_range. Note that from_chars() does not skip any leading whitespace.

image The std::string_view Class

Before C++17, there was always a dilemma of choosing the parameter type for a function that accepted a read-only string. Should it be a const char*? In that case, if a client had an std::string available, they had to call c_str() or data() on it to get a const char*. Even worse, the function would lose the nice object-oriented aspects of the std::string and all its nice helper methods. Maybe the parameter could instead be a const std::string&? In that case, you always needed an std::string. If you passed a string literal, for example, the compiler silently created a temporary string object that contained a copy of your string literal and passed that object to your function, so there was a bit of overhead. Sometimes people would write multiple overloads of the same function—one that accepted a const char*, and another that accepted a const string&—but that was obviously a less-than-elegant solution.

With C++17, all those problems are solved with the introduction of the std::string_view class, which is an instantiation of the std::basic_string_view class template, and is defined in the <string_view> header. A string_view is basically a drop-in replacement for const string&, but without the overhead. It never copies strings! A string_view supports an interface similar to std::string. One exception is the absence of c_str(), but data() is available. On the other hand, string_view does add the methods remove_prefix(size_t) and remove_suffix(size_t), which shrink the string by advancing the starting pointer by a given offset, or by moving the end pointer backward by a given offset.

Note that you cannot concatenate a string and a string_view. The following code does not compile:

string str = "Hello";
string_view sv = " world";
auto result = str + sv;

To make it compile, you need to replace the last line with:

auto result = str + sv.data();

If you know how to use std::string, then using a string_view is very straightforward, as the following example code demonstrates. The extractExtension() function extracts and returns the extension of a given filename. Note that string_views are usually passed by value because they are extremely cheap to copy. They just contain a pointer to, and the length of, a string.

string_view extractExtension(string_view fileName)
{
    return fileName.substr(fileName.rfind('.'));
}

This function can be used with all kinds of different strings:

string fileName = R"(c:	empmy file.ext)";
cout << "C++ string: " << extractExtension(fileName) << endl;

const char* cString = R"(c:	empmy file.ext)";
cout << "C string: " << extractExtension(cString) << endl;

cout << "Literal: " << extractExtension(R"(c:	empmy file.ext)") << endl;

There is not a single copy being made in all these calls to extractExtension(). The fileName parameter of the extractExtension() function is just a pointer and a length, and so is the return type of the function. This is all very efficient.

There is also a string_view constructor that accepts any raw buffer and a length. This can be used to construct a string_view out of a string buffer that is not NUL terminated. It is also useful when you do have a NUL-terminated string buffer, but you already know the length of the string, so the constructor does not need to count the number of characters again.

const char* raw = /* … */;
size_t length = /* … */;
cout << "Raw: " << extractExtension(string_view(raw, length)) << endl;

You cannot implicitly construct a string from a string_view. Either you use an explicit string constructor, or you use the string_view::data() member. For example, suppose you have the following function that accepts a const string&:

void handleExtension(const string& extension) { /* … */ }

Calling this function as follows does not work:

handleExtension(extractExtension("my file.ext"));

The following are two possible options you can use:

handleExtension(extractExtension("my file.ext").data());  // data() method
handleExtension(string(extractExtension("my file.ext"))); // explicit ctor

std::string_view Literals

You can use the standard user-defined literal “sv” to interpret a string literal as an std::string_view. For example:

auto sv = "My string_view"sv;

The standard user-defined literal “sv” requires a using namespace std::string_view_ literals; or using namespace std;.

Nonstandard Strings

There are several reasons why many C++ programmers don’t use C++-style strings. Some programmers simply aren’t aware of the string type because it was not always part of the C++ specification. Others have discovered over the years that the C++ string doesn’t provide the behavior they need, and so have developed their own string type. Perhaps the most common reason is that development frameworks and operating systems tend to have their own way of representing strings, such as the CString class in the Microsoft MFC. Often, this is for backward compatibility or to address legacy issues. When starting a project in C++, it is very important to decide ahead of time how your group will represent strings. Some things are for sure:

  • You should not pick the C-style string representation.
  • You can standardize on the string functionality available in the framework you are using, such as the built-in string features of MFC, QT, …
  • If you use std::string for your strings, then use std::string_view to pass read-only strings as parameters to functions; otherwise, see if your framework has support for something similar like string_views.

SUMMARY

This chapter discussed the C++ string and string_view classes and what their benefits are compared to plain old C-style character arrays. It also explained how a number of helper functions make it easier to convert numerical values into strings and vice versa, and it introduced the concept of raw string literals.

The next chapter discusses guidelines for good coding style, including code documentation, decomposition, naming, code formatting, and other tips.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset