Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

25
Writing Efficient C++

WROX.COM DOWNLOADS FOR THIS CHAPTER

Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/proc++4e on the Download Code tab.

The efficiency of your programs is important regardless of your application domain. If your product competes with others in the marketplace, speed can be a major differentiator: given the choice between a slower and a faster program, which one would you choose? No one would buy an operating system that takes two weeks to boot up, unless it was the only option. Even if you don’t intend to sell your products, they will have users. Those users will not be happy with you if they end up wasting time waiting for your programs to complete tasks.

Now that you understand the concepts of Professional C++ design and coding, and have tackled some of the more complex facilities that the language provides, you are ready to incorporate performance into your programs. Writing efficient programs involves thought at the design level, as well as details at the implementation level. Although this chapter falls late in this book, remember to consider performance from the beginning of your projects.

OVERVIEW OF PERFORMANCE AND EFFICIENCY

Before delving further into the details, it’s helpful to define the terms performance and efficiency, as used in this book. The performance of a program can refer to several areas, such as speed, memory usage, disk access, and network use. This chapter focuses on speed performance. The term efficiency, when applied to programs, means running without wasted effort. An efficient program completes its tasks as quickly as possible within the given circumstances. A program can be efficient without being fast, if the application domain is inherently prohibitive to quick execution.

Note that the title of this chapter, “Writing Efficient C++,” means writing programs that run efficiently, not efficiently writing programs. That is, the time you learn to save by reading this chapter will be your users’, not your own!

Two Approaches to Efficiency

Language-level efficiency involves using the language as efficiently as possible, for example, passing objects by reference instead of by value. However, this will only get you so far. Much more important is design-level efficiency, which includes choosing efficient algorithms, avoiding unnecessary steps and computations, and selecting appropriate design optimizations. More often than not, optimizing existing code involves replacing a bad algorithm or data structure with a better, more efficient one.

Two Kinds of Programs

As I’ve noted, efficiency is important for all application domains. Additionally, there is a small subset of programs, such as system-level software, embedded systems, intensive computational applications, and real-time games, that require extremely high levels of efficiency. Most programs don’t. Unless you write those types of high-performance applications, you probably don’t need to worry about squeezing every ounce of speed out of your C++ code. Think of it as the difference between building normal family cars and building sports cars. Every car must be reasonably efficient, but sports cars require extremely high performance. You wouldn’t want to waste your time optimizing family cars for speed when they’ll never go faster than 70 miles per hour.

Is C++ an Inefficient Language?

C programmers often resist using C++ for high-performance applications. They claim that the language is inherently less efficient than C or a similar procedural language because C++ includes high-level concepts, such as exceptions and virtual methods. However, there are problems with this argument.

When discussing the efficiency of a language, you must separate the performance capabilities of the language itself from the effectiveness of its compilers at optimizing it, that is, you cannot ignore the effect of compilers. Recall that the C or C++ code you write is not the code that the computer executes. A compiler first translates that code into machine language, applying optimizations in the process. This means that you can’t simply run benchmarks of C and C++ programs and compare the results. You’re really comparing the compiler optimizations of the languages, not the languages themselves. C++ compilers can optimize away many of the high-level constructs in the language to generate machine code similar to, or even better than, the machine code generated from a comparable C program. These days, much more research and development is poured into C++ compilers than into C compilers, so C++ code might actually get better optimized and might run faster than C code.

Critics, however, still maintain that some features of C++ cannot be optimized away. For example, as Chapter 10 explains, virtual methods require the existence of a vtable and an additional level of indirection at run time, possibly making them slower than regular non-virtual function calls. However, when you really think about it, this argument is unconvincing. Virtual method calls provide more than just a function call: they also give you a run-time choice of which function to call. A comparable non-virtual function call would need a conditional statement to decide which function to call. If you don’t need those extra semantics, you can use a non-virtual function. A general design rule in the C++ language is that “if you don’t use a feature, you don’t need to pay for it.” If you don’t use virtual methods, you pay no performance penalty for the fact that you could use them. Thus, non-virtual function calls in C++ are identical to function calls in C in terms of performance. However, because virtual function calls have such a tiny overhead, I recommend making all your class methods, including destructors but not constructors, virtual for all your non-final classes.

Far more important, the high-level constructs of C++ enable you to write cleaner programs that are more efficient at the design level, are more readable, more easily maintained, and avoid accumulating unnecessary and dead code.

I believe that you will be better served in your development, performance, and maintenance by choosing C++ instead of a procedural language such as C.

There are also other higher-level object-oriented languages such as C# and Java, both of which run on top of a virtual machine. C++ code is executed directly by a CPU; there is no such thing as a virtual machine to run your code. C++ is closer to the hardware, which means that in most cases it runs faster than languages such as C# and Java.

LANGUAGE-LEVEL EFFICIENCY

Many books, articles, and programmers spend a lot of time trying to convince you to apply language-level optimizations to your code. These tips and tricks are important, and can speed up your programs in some cases. However, they are far less important than the overall design and algorithm choices in your program. You can pass-by-reference all you want, but it won’t make your program fast if you perform twice as many disk writes as you need to. It’s easy to get bogged down in references and pointers and forget about the big picture.

Furthermore, some of these language-level tricks can be performed automatically by good optimizing compilers. You should never spend time optimizing a particular area, unless a profiler, discussed later in this chapter, tells you that that particular area is a bottleneck.

That being said, using certain language-level optimizations, such as pass-by-reference, is just considered good coding style.

In this book, I’ve tried to present a balance of strategies. So, I’ve included here what I feel are the most useful language-level optimizations. This list is not comprehensive, but is a good start to write optimized code. However, make sure to read, and practice, the design-level efficiency advice that I offer later in this chapter as well.

Handle Objects Efficiently

C++ does a lot of work for you behind the scenes, particularly with regard to objects. You should always be aware of the performance impact of the code you write. If you follow a few simple guidelines, your code will become more efficient. Note that these guidelines are only relevant for objects, and not for primitive types such as bool, int, float, and so on.

Pass-by-Reference

The pass-by-reference rule is discussed elsewhere in this book, but it’s worth repeating here.

When you pass an object of a derived class by value as argument for a function parameter with as type one of the base classes, then the derived object is sliced to fit into the base class type. This causes information to be lost, see Chapter 10 for details.

Pass-by-value also incurs copying costs that are avoided with pass-by-reference. One reason why this rule can be difficult to remember is that on the surface there doesn’t appear to be any problem when you pass-by-value. Consider a class to represent a person that looks like this:

class Person
{
    public:
        Person() = default;
        Person(std::string_view firstName, std::string_view lastName, int age);
        virtual ~Person() = default;

        std::string_view getFirstName() const { return mFirstName; }
        std::string_view getLastName() const { return mLastName; }
        int getAge() const { return mAge; }

     private:
        std::string mFirstName, mLastName;
        int mAge = 0;
};

You could write a function that takes a Person object as follows:

void processPerson(Person p)
{
    // Process the person.
}

You can call this function like this:

Person me("Marc", "Gregoire", 38); 
processPerson(me);

This doesn’t look like there’s any more code than if you write the function like this instead:

void processPerson(const Person& p)
{
    // Process the person.
}

The call to the function remains the same. However, consider what happens when you pass-by-value in the first version of the function. In order to initialize the p parameter of processPerson(), me must be copied with a call to its copy constructor. Even though you didn’t write a copy constructor for the Person class, the compiler generates one that copies each of the data members. That still doesn’t look so bad: there are only three data members. However, two of them are strings, which are themselves objects with copy constructors. So, each of their copy constructors will be called as well. The version of processPerson() that takes p by reference incurs no such copying costs. Thus, pass-by-reference in this example avoids three copy constructor calls when the code enters the function.

And you’re still not done. Remember that p in the first version of processPerson() is a local variable to the processPerson() function, and so must be destroyed when the function exits. This destruction requires a call to the Person destructor, which will call the destructor of all of the data members. strings have destructors, so exiting this function (if you passed by value) incurs calls to three destructors. None of those calls are needed if the Person object is passed by reference.

Return-by-Reference

Just as you should pass objects by reference to functions, you should also return them by reference from functions in order to avoid copying the objects unnecessarily. Unfortunately, it is sometimes impossible to return objects by reference, such as when you write overloaded operator+ and other similar operators. And, you should never return a reference or a pointer to a local object that will be destroyed when the function exits!

Since C++11, the language has support for move semantics, which allows you to efficiently return objects by value, instead of using reference semantics.

Catch Exceptions by Reference

As noted in Chapter 14, you should catch exceptions by reference in order to avoid slicing and unnecessary copying. Throwing exceptions is heavy in terms of performance, so any little thing you can do to improve their efficiency will help.

Use Move Semantics

You should make sure your classes have a move constructor and move assignment operator to allow the C++ compiler to use move semantics with objects of those classes. According to the rule of zero (see Chapter 9), you should try to design your classes such that the compiler generated copy and move constructors and copy and move assignment operators are sufficient. If the compiler cannot implicitly define these for a class, try to explicitly default them if that works for your class. If that is also not an option, you should implement them yourself. With move semantics for your objects, returning them by value from a function will be efficient without incurring large copying costs. Consult Chapter 9 for details on move semantics.

Avoid Creating Temporary Objects

The compiler creates temporary, unnamed objects in several circumstances. Chapter 9 explains that after writing a global operator+ for a class, you can add objects of that class to other types, as long as those types can be converted to objects of that class. For example, the SpreadsheetCell class definition looks in part like this:

class SpreadsheetCell
{
    public:
        // Other constructors omitted for brevity
         SpreadsheetCell(double initialValue);
        // Remainder omitted for brevity
};

SpreadsheetCell operator+(const SpreadsheetCell& lhs,
    const SpreadsheetCell& rhs);

The constructor that takes a double allows you to write code like this:

SpreadsheetCell myCell(4), aThirdCell;
aThirdCell = myCell + 5.6;
aThirdCell = myCell + 4;

The second line constructs a temporary SpreadsheetCell object from the 5.6 argument; it then calls the operator+ with myCell and this temporary object as arguments. The result is stored in aThirdCell. The third line does the same thing, except that 4 must be coerced to a double in order to call the double constructor of the SpreadsheetCell.

The important point in this example is that the compiler generates code to create an extra, unnamed SpreadsheetCell object for both addition operations in this example. That object must be constructed and destructed with calls to its constructor and destructor. If you’re still skeptical, try inserting cout statements in your constructor and destructor, and watch the printout.

In general, the compiler constructs a temporary object whenever your code converts a variable of one type to another type for use in a larger expression. This rule applies mostly to function calls. For example, suppose that you write a function with the following prototype:

void doSomething(const SpreadsheetCell& s);

You can call this function like this:

doSomething(5.56);

The compiler constructs a temporary SpreadsheetCell object from 5.56 using the double constructor. This temporary object is then passed to doSomething(). Note that if you remove the const from the s parameter, you can no longer call doSomething() with a constant; you must pass a variable.

You should generally attempt to avoid cases in which the compiler is forced to construct temporary objects. Although it is impossible to avoid in some situations, you should at least be aware of the existence of this “feature” so you aren’t surprised by performance and profiling results.

Move semantics is used by the compiler to make working with temporary objects more efficient. That’s another reason to make sure your classes support move semantics. See Chapter 9 for details.

The Return-Value Optimization

A function that returns an object by value can cause the creation of a temporary object. Continuing with the Person example from earlier, consider this function:

Person createPerson()
{
    Person newP("Marc", "Gregoire", 38);
    return newP;
}

Suppose that you call it like this (assuming that operator<< is implemented for the Person class):

cout << createPerson();

Even though this call does not store the result of createPerson() anywhere, the result must be stored somewhere in order to pass it to operator<<. In order to generate code for this behavior, the compiler is allowed to create a temporary variable in which to store the Person object returned from createPerson().

Even if the result of the function is not used anywhere, the compiler might still generate code to create the temporary object. For example, suppose that you have this code:

createPerson();

The compiler might generate code to create a temporary object for the return value, even though it is not used.

However, you usually don’t need to worry about this issue because in most cases, the compiler optimizes away the temporary variable to avoid all copying and moving. For the createPerson() example, this optimization is called named return value optimization (NRVO) because the return statement returns a named variable. In the case the return statement has as argument a nameless temporary value, then this optimization is called return value optimization (RVO). These kinds of optimizations are usually only enabled for release builds. For NRVO to work, the argument to the return statement must be a single local variable. For example, in the following case, the compiler cannot do NRVO:

Person createPerson()
{
    Person person1;
    Person person2;
    return getRandomBool() ? person1 : person2;
}

If NRVO and RVO are not applicable, then copying or moving will happen. If the object you want to return from a function supports move semantics, then it is moved out of the function instead of copied.

Pre-allocate Memory

One of the main advantages of using containers such as those from the C++ Standard Library is that they handle all memory management for you. The containers grow automatically when you add more elements to them. However, sometimes this causes a performance penalty. For example, an std::vector container stores its elements contiguously in memory. If it needs to grow in size, it needs to allocate a new block of memory, and then move (or copy) all elements to this new memory. This has serious performance implications, for example, if you use push_back() in a loop to add millions of elements to a vector.

If you know in advance how many elements you are going to add to a vector, or if you have a rough estimate, you should pre-allocate enough memory before starting to add your elements. A vector has a capacity, that is, the number of elements that can be added without reallocation, and a size, that is, the actual number of elements in the container. You can pre-allocate memory by changing the capacity using the reserve() method, or by resizing the vector using resize(). See Chapter 17 for details.

Use Inline Methods and Functions

As described in Chapter 9, the code for an inline method or function is inserted directly into the code where it is called, avoiding the overhead of a function call. You should mark as inline all functions and methods that you think can qualify for this optimization. However, do not overuse this feature, because it basically throws away a fundamental design principle, which states that the interface and the implementation should be separated, such that the implementation can evolve without any changes to the interface. Consider using this feature only for often-used, fundamental classes. Also, remember that inlining requests by the programmer are only a recommendation to the compiler, which is allowed to refuse them.

On the other hand, some compilers inline appropriate functions and methods during their optimization steps, even if those functions aren’t marked with the inline keyword, and even if those functions are implemented in a source file instead of a header file. Thus, you should read your compiler documentation before wasting a lot of effort deciding which functions to inline.

DESIGN-LEVEL EFFICIENCY

The design choices in your program affect its performance far more than do language details such as pass-by-reference. For example, if you choose an algorithm for a fundamental task in your application that runs in O(n²) time instead of a simpler one that runs in O(n) time, you could potentially perform the square of the number of operations that you really need. To put numbers on that, a task that uses an O(n²) algorithm and performs one million operations would perform only one thousand with an O(n) algorithm. Even if that operation is optimized beyond recognition at the language level, the simple fact that you perform one million operations when a better algorithm would use only one thousand will make your program very inefficient. Always choose your algorithms carefully. Refer to Part II, specifically Chapter 4, of this book for a detailed discussion of algorithm design choices and big-O notation.

In addition to your choice of algorithms, design-level efficiency includes specific tips and tricks. Instead of writing your own data structures and algorithms, you should use existing ones, such as those from the C++ Standard Library, the Boost libraries, or other libraries, as much as possible because they are written by experts. These libraries have been, and are being, used a lot, so you can expect most bugs to have been discovered and fixed. You should also think about incorporating multithreading in your design to take full advantage of all processing power available on the machine. See Chapter 23 for more details. The remainder of this section presents two more design techniques for optimizing your program: caching and using object pools.

Cache Where Necessary

Caching means storing items for future use in order to avoid retrieving or recalculating them. You might be familiar with the principle from its use in computer hardware. Modern computer processors are built with memory caches that store recently and frequently accessed memory values in a location that is quicker to access than main memory. Most memory locations that are accessed at all are accessed more than once in a short time period, so caching at the hardware level can significantly speed up computations.

Caching in software follows the same approach. If a task or computation is particularly slow, you should make sure that you are not performing it more than necessary. Store the results in memory the first time you perform the task so that they are available for future needs. Here is a list of tasks that are usually slow.

Disk access: You should avoid opening and reading the same file more than once in your program. If memory is available, save the file contents in RAM if you need to access it frequently.
Network communication: Whenever you need to communicate over a network, your program is subject to the vagaries of the network load. Treat network accesses like file accesses, and cache as much static information as possible.
Mathematical computations: If you need the result of a very complex computation in more than one place, perform the calculation once and share the result. However, if it’s not very complex, then it’s probably faster to just calculate it instead of retrieving it from a cache. Use a profiler to be sure.
Object allocation: If you need to create and use a large number of short-lived objects in your program, consider using an object pool, described later in this chapter.
Thread creation: Creating threads is slow. You can “cache” threads in a thread-pool, similar to caching objects in an object-pool.

One common problem with caching is that the data you store often comprises only copies of the underlying information. The original data might change during the lifetime of the cache. For example, you might want to cache the values in a configuration file so that you don’t need to read it repeatedly. However, the user might be allowed to change the configuration file while your program is running, which would make your cached version of the information obsolete. In cases like this, you need a mechanism for cache invalidation: when the underlying data changes, you must either stop using your cached information or repopulate your cache.

One technique for cache invalidation is to request that the entity managing the underlying data notifies your program of the data change. It could do this through a callback that your program registers with the manager. Alternatively, your program could poll for certain events that would trigger it to repopulate the cache automatically. Regardless of your specific cache invalidation technique, make sure that you think about these issues before relying on a cache in your program.

Use Object Pools

There are different kinds of object pools. One kind of object pool is where it allocates a large chunk of memory at once, in which the object pool creates smaller objects in-place. These objects can be handed out to clients, and reused when the clients are done with them, without incurring any additional calls to the memory manager to allocate or deallocate memory for individual objects.

This section describes another kind of object pool. If your program needs a large number of short-lived objects of the same type that have an expensive constructor (for example, a constructor creating a large, pre-sized vector for storing data), and a profiler confirms that allocating and deallocating these objects is a bottleneck, then you can create a pool, or cache, of those objects. Whenever you need an object in your code, you ask the pool for one. When you are done with the object, you return it to the pool. The object pool creates the objects only once, so their constructor is called only once, not each time they are used. Thus, object pools are appropriate when the constructor performs some setup actions that apply to many uses of the object, and when you can set instance-specific parameters on the object through non-constructor method calls.

An Object Pool Implementation

This section provides an implementation of an object pool class template that you can use in your programs. The pool hands out objects via the acquireObject() method. If acquireObject() is called but there are no free objects, then the pool allocates another instance of the object. acquireObject() returns an Object that is an std::shared_ptr with a custom deleter. The custom deleter doesn’t actually delete the memory; it simply puts the object back on the list of free objects.

The most difficult aspect of an object pool implementation is keeping track of which objects are free and which are in use. This implementation takes the approach of storing free objects in a queue. Each time a client requests an object, the pool gives that client the top object from the queue. The code uses the std::queue class from the Standard Library, as discussed in Chapter 17. Because this standard data structure is used, this implementation is not thread safe. One way to make it thread safe is to use a lock-free concurrent queue. However, the Standard Library does not provide any concurrent data structures, so you’ll have to use third-party libraries.

Here is the class definition, with comments that explain the details. The class template is parameterized on the class type from which the objects in the pool are to be constructed.

#include <queue>
#include <memory>

// Provides an object pool that can be used with any class that provides a
// default constructor.
//
// acquireObject() returns an object from the list of free objects. If
// there are no more free objects, acquireObject() creates a new instance.
// The pool only grows: objects are never removed from the pool (freed),
// until the pool is destroyed.
// acquireObject() returns an Object which is an std::shared_ptr with a
// custom deleter that automatically puts the object back into the object
// pool when the shared_ptr is destroyed and its reference reaches 0.
//
// The constructor and destructor on each object in the pool will be called
// only once each for the lifetime of the object pool, not once per
// acquisition and release.
//
// The primary use of an object pool is to avoid creating and deleting
// objects repeatedly. This object pool is most suited to applications that
// use large numbers of objects with expensive constructors for short
// periods of time, and if a profiler tells you that allocating and
// deallocating these objects is a bottleneck.
template <typename T>
class ObjectPool
{
    public:
        ObjectPool() = default;
        virtual ~ObjectPool() = default;

        // Prevent assignment and pass-by-value
        ObjectPool(const ObjectPool<T>& src) = delete;
        ObjectPool<T>& operator=(const ObjectPool<T>& rhs) = delete;

        // The type of smart pointer returned by acquireObject().
        using Object = std::shared_ptr<T>;

        // Reserves and returns an object for use.
        Object acquireObject();
    private:
        // Stores the objects that are not currently in use by clients.
        std::queue<std::unique_ptr<T>> mFreeList;
};

When using this object pool, you have to make sure that the object pool itself outlives all the objects handed out by the pool. A user of the object pool specifies through the template parameter the name of the class from which objects can be created.

acquireObject() returns the top object from the free list, first allocating a new object if there are no free objects:

template <typename T>
typename ObjectPool<T>::Object ObjectPool<T>::acquireObject()
{
    if (mFreeList.empty()) {
        mFreeList.emplace(std::make_unique<T>());
    }

    // Move next free object from the queue to a local unique_ptr.
    std::unique_ptr<T> obj(std::move(mFreeList.front()));
    mFreeList.pop();

    // Convert the object pointer to an Object (a shared_ptr with
    // a custom deleter).
    Object smartObject(obj.release(), [this](T* t){
        // The custom deleter doesn't actually deallocate the
        // memory, but simply puts the object back on the free list.
        mFreeList.emplace(t);
    });

    // Return the Object.
    return smartObject;
}

Using the Object Pool

Consider an application that uses a lot of short-lived objects with an expensive constructor. Let’s assume the ExpensiveObject class definition looks as follows:

class ExpensiveObject
{
    public:
        ExpensiveObject() { /* Expensive construction … */ }
        virtual ~ExpensiveObject() = default;
         // Methods to populate the object with specific information.
         // Methods to retrieve the object data.
         // (not shown)
    private:
         // Data members (not shown)
};

Instead of creating and deleting large numbers of such objects throughout the lifetime of your program, you can use the object pool developed in the previous section. Your program structure could be something like this:

ObjectPool<ExpensiveObject>::Object
    getExpensiveObject(ObjectPool<ExpensiveObject>& pool)
{
    // Obtain an ExpensiveObject object from the pool.
    auto object = pool.acquireObject();

    // Populate the object. (not shown)

    return object;
}

void processExpensiveObject(ObjectPool<ExpensiveObject>::Object& object)
{
    // Process the object. (not shown)
}

int main()
{
    ObjectPool<ExpensiveObject> requestPool;

    {
        vector<ObjectPool<ExpensiveObject>::Object> objects;
        for (size_t i = 0; i < 10; ++i) {
            objects.push_back(getExpensiveObject(requestPool));
        }
    }

    for (size_t i = 0; i < 100; ++i) {
        auto req = getExpensiveObject(requestPool);
        processExpensiveObject(req);
    }
    return 0;
}

The first part of this main() function contains an inner code block which creates ten expensive objects and stores them in the objects container. Because all created Objects are stored in a vector and thus kept alive, the object pool is forced to create ten ExpensiveObject instances. At the closing brace of the inner code block, the vector goes out of scope, and all Objects contained in it are automatically released back to the object pool.

In the second for loop, the Objects (= shared_ptrs) returned by getExpensiveObject() go out of scope at the end of each iteration of the for loop, and so are automatically released back to the pool. If you add an output statement to the constructor of the ExpensiveObject class, you’ll see that the constructor is only called ten times during the entire program, even though the second for loop in main() loops a hundred times.

PROFILING

It is good to think about efficiency as you design and code. There is no point in writing obviously inefficient programs if this can be avoided with some common sense, or experience-based intuition. However, I urge you not to get too obsessed with performance during the design and coding phases. It’s best to first make a clean, well-structured design and implementation, then use a profiler, and only optimize parts that are flagged by the profiler as being performance bottlenecks. Remember the “90/10” rule, introduced in Chapter 4, which states that 90 percent of the running time of most programs is spent in only 10 percent of the code (Hennessy and Patterson, Computer Architecture, A Quantitative Approach, Fourth Edition, [Morgan Kaufmann, 2006]). This means that you could optimize 90 percent of your code, but still only improve the running time of the program by 10 percent. Obviously, you want to optimize the parts of the code that are exercised the most for the specific workload that you expect the program to run.

Consequently, it is often helpful to profile your program to determine which parts of the code require optimization. There are many profiling tools available that analyze programs as they run in order to generate data about their performance. Most profiling tools provide analysis at the function level by specifying the amount of time (or percent of total execution time) spent in each function in the program. After running a profiler on your program, you can usually tell immediately which parts of the program need optimization. Profiling before and after optimizing is essential to prove that your optimizations had an effect.

If you are using Microsoft Visual C++ 2017, you already have a great built-in profiler, which is discussed later in this chapter. If you are not using Visual C++, Microsoft has a Community edition available which is free of charge for students, open-source developers, and individual developers to create both free and paid applications. It’s also free of charge for up to five users in small organizations. Another great profiling tool is Rational PurifyPlus from IBM. There are also a number of smaller free profiling tools available: Very Sleepy and Luke Stackwalker are popular profilers for Windows, Valgrind and gprof (GNU profiler) are well-known profilers for Unix/Linux systems, and there are plenty of other choices.

Profiling Example with gprof

The power of profiling can best be seen with a real coding example. As a disclaimer, the performance bugs in the first implementation shown are not subtle! Real efficiency issues would probably be more complex, but a program long enough to demonstrate them would be too lengthy for this book.

Suppose that you work for the United States Social Security Administration. Every year the administration puts up a website that allows users to look up the popularity of new baby names from the previous year. Your job is to write the back-end program that looks up names for users. Your input is a file containing the name of every new baby. This file will obviously contain redundant names. For example, in the file for boys for 2003, the name Jacob was the most popular, showing up 29,195 times. Your program must read the file to construct an in-memory database. A user may then request the absolute number of babies with a given name, or the rank of that name among all the babies.

First Design Attempt

A logical design for this program consists of a NameDB class with the following public methods:

#include <string_view>

class NameDB
{
    public:
        // Reads list of baby names in nameFile to populate the database.
        // Throws invalid_argument if nameFile cannot be opened or read.
        NameDB(std::string_view nameFile);

        // Returns the rank of the name (1st, 2nd, etc).
        // Returns –1 if the name is not found.
        int getNameRank(std::string_view name) const;

        // Returns the number of babies with this name.
        // Returns –1 if the name is not found.
        int getAbsoluteNumber(std::string_view name) const;

        // Protected and private members and methods not shown
};

The hard part is choosing a good data structure for the in-memory database. My first attempt is a vector of name/count pairs. Each entry in the vector stores one of the names, along with a count of the number of times that name shows up in the raw data file. Here is the complete class definition with this design:

#include <string_view>
#include <string>
#include <vector>
#include <utility>

class NameDB
{
    public:
        NameDB(std::string_view nameFile);
        int getNameRank(std::string_view name) const;
        int getAbsoluteNumber(std::string_view name) const;
    private:
        std::vector<std::pair<std::string, int>> mNames;

        // Helper methods
        bool nameExists(std::string_view name) const;
        void incrementNameCount(std::string_view name);
        void addNewName(std::string_view name);
};

Note the use of the Standard Library vector and pair classes, both of which are discussed in Chapter 17. A pair is a utility class that combines two values of possibly different types.

Here are the implementations of the constructor and the helper methods nameExists(), incrementNameCount(), and addNewName(). The loops in nameExists() and incrementNameCount() iterate over all the elements of the vector.

// Reads the names from the file and populates the database.
// The database is a vector of name/count pairs, storing the
// number of times each name shows up in the raw data.
NameDB::NameDB(string_view nameFile)
{
    // Open the file and check for errors.
    ifstream inputFile(nameFile.data());
    if (!inputFile) {
        throw invalid_argument("Unable to open file");
    }

    // Read the names one at a time.
    string name;
    while (inputFile >> name) {
        // Look up the name in the database so far.
        if (nameExists(name)) {
            // If the name exists in the database, just increment the count.
            incrementNameCount(name);
        } else {
            // If the name doesn't yet exist, add it with a count of 1.
            addNewName(name);
        }
    }
}

// Returns true if the name exists in the database, false otherwise.
bool NameDB::nameExists(string_view name) const
{
    // Iterate through the vector of names looking for the name.
    for (auto& entry : mNames) {
        if (entry.first == name) {
            return true;
        }
    }
    return false;
}

// Precondition: name exists in the vector of names.
// Postcondition: the count associated with name is incremented.
void NameDB::incrementNameCount(string_view name)
{
    for (auto& entry : mNames) {
        if (entry.first == name) {
            entry.second++;
            return;
        }
    }
}

// Adds a new name to the database.
void NameDB::addNewName(string_view name)
{
    mNames.push_back(make_pair(name.data(), 1));
}

Note that in the preceding example, you could use an algorithm like find_if() to accomplish the same thing as the loops in nameExists() and incrementNameCount(). The loops are shown explicitly in order to emphasize the performance problems.

You might have noticed some performance problems already. What if there are hundreds of thousands of names? The many linear searches involved in populating the database will become slow.

In order to complete the example, here are the implementations of the two public methods:

// Returns the rank of the name.
// First looks up the name to obtain the number of babies with that name.
// Then iterates through all the names, counting all the names with a higher
// count than the specified name. Returns that count as the rank.
int NameDB::getNameRank(string_view name) const
{
    // Make use of the getAbsoluteNumber() method.
    int num = getAbsoluteNumber(name);

    // Check if we found the name.
    if (num == -1) {
        return -1;
    }

    // Now count all the names in the vector that have a
    // count higher than this one. If no name has a higher count,
    // this name is rank number 1. Every name with a higher count
    // decreases the rank of this name by 1.
    int rank = 1;
    for (auto& entry : mNames) {
        if (entry.second > num) {
            rank++;
        }
    }
    return rank;
}

// Returns the count associated with this name.
int NameDB::getAbsoluteNumber(string_view name) const
{
    for (auto& entry : mNames) {
        if (entry.first == name) {
            return entry.second;
        }
    }
    return -1;
}

Profiling the First Design Attempt

In order to test the program, you need a main() function:

#include "NameDB.h"
#include <iostream>
using namespace std;

int main()
{
    NameDB boys("boys_long.txt");
    cout << boys.getNameRank("Daniel") << endl;
    cout << boys.getNameRank("Jacob") << endl;
    cout << boys.getNameRank("William") << endl;
    return 0;
}

This main() function creates one NameDB database called boys, telling it to populate itself with the file boys_long.txt, which contains 500,500 names.

There are three steps to using gprof:

Compile your program with a special flag that causes it to log raw execution information when it is run. When using GCC as your compiler, the flag is –pg, as in this example:
```
> gcc -lstdc++ -std=c++17 -pg -o namedb NameDB.cpp NameDBTest.cpp
```
Run your program. This should generate a file called gmon.out in the working directory. Be patient when you run the program because this first version is slow.
Run the gprof command. This final step enables you to analyze the gmon.out profiling information and produce a (somewhat) readable report. gprof outputs to standard out, so you should redirect the output to a file:

> gprof namedb gmon.out > gprof_analysis.out

Now you can analyze the data. Unfortunately, the output file is somewhat cryptic and intimidating, so it takes a little while to learn how to interpret it. gprof provides two separate sets of information. The first set summarizes the amount of time spent executing each function in the program. The second and more useful set summarizes the amount of time spent executing each function and its descendants; this set is also called a call graph. Here is some of the output from the gprof_analysis.out file, edited to make it more readable. Note that the numbers will be different on your machine.

index  %time    self  children    called     name 
[1]    100.0    0.00   14.06                 main [1]
                0.00   14.00       1/1           NameDB::NameDB [2]
                0.00    0.04       3/3           NameDB::getNameRank [25]
                0.00    0.01       1/1           NameDB::~NameDB [28]

The following list explains the different columns.

index: an index to be able to refer to this entry in the call graph.
%time: the percentage of the total execution time of the program required by this function and its descendants.
self: how many seconds the function itself was executing.
children: how many seconds the descendants of this function were executing.
called: how often this function was called.
name: the name of the function. If the name of the function is followed by a number between square brackets, that number refers to another index in the call graph.

The preceding extract tells you that main() and its descendants took 100 percent of the total execution time of the program, for a total of 14.06 seconds. The second line shows that the NameDB constructor took 14.00 seconds of the total 14.06 seconds. So, it’s immediately clear where the performance issue is situated. To track down which part of the constructor is taking so long, you need to jump to the call graph entry with index 2, because that’s the index in square brackets behind the name in the last column. The call graph entry with index 2 is as follows on my test system:

[2] 99.6    0.00   14.00       1         NameDB::NameDB [2]
            1.20    6.14  500500/500500      NameDB::nameExists [3]
            1.24    5.24  499500/499500      NameDB::incrementNameCount [4]
            0.00    0.18    1000/1000        NameDB::addNewName [19]
            0.00    0.00       1/1           vector::vector [69]

The nested entries below NameDB::NameDB show which of its descendants took the most time. Here you can see that nameExists() took 6.14 seconds, and incrementNameCount()took 5.24 seconds. These times are the sums of all the calls to the functions. The fourth column in those lines shows the number of calls to the function (500,500 to nameExists() and 499,500 to incrementNameCount()). No other function took a significant amount of time.

Without going any further in this analysis, two things should jump out at you:

Taking 14 seconds to populate the database of approximately 500,000 names is slow. Perhaps you need a better data structure.
nameExists() and incrementNameCount() take an almost identical amount of time, and are called almost the same number of times. If you think about the application domain, that makes sense: most names in the input text file are duplicates, so the vast majority of the calls to nameExists() are followed by a call to incrementNameCount(). If you look back at the code, you can see that these functions are almost identical; they could probably be combined. In addition, most of what they are doing is searching the vector. It would probably be better to use a sorted data structure to reduce the searching time.

Second Design Attempt

With these two observations from the gprof output, it’s time to redesign the program. The new design uses a map instead of a vector. Chapter 17 explains that the Standard Library map keeps the entries sorted, and provides O(log n) lookup instead of the O(n) searches in a vector. A good exercise for you to try would be to use an std::unordered_map, which has an expected O(1) for lookups, and to use a profiler to see if that is faster than std::map for this application.

The new version of the program also combines nameExists() and incrementNameCount() into one nameExistsAndIncrement() method.

Here is the new class definition:

#include <string_view>
#include <string>
#include <map>

class NameDB
{
    public:
        NameDB(std::string_view nameFile);
        int getNameRank(std::string_view name) const;
        int getAbsoluteNumber(std::string_view name) const;
    private:
        std::map<std::string, int> mNames;
        bool nameExistsAndIncrement(std::string_view name);
        void addNewName(std::string_view name);
};

Here are the new method implementations:

// Reads the names from the file and populates the database.
// The database is a map associating names with their frequency.
NameDB::NameDB(string_view nameFile)
{
    // Open the file and check for errors.
    ifstream inputFile(nameFile.data());
    if (!inputFile) {
        throw invalid_argument("Unable to open file");
    }

    // Read the names one at a time.
    string name;
    while (inputFile >> name) {
        // Look up the name in the database so far.
        if (!nameExistsAndIncrement(name)) {
            // If the name exists in the database, the
            // method incremented it, so we just continue.
            // We get here if it didn't exist, in which case
            // we add it with a count of 1.
            addNewName(name);
        }
    }
}

// Returns true if the name exists in the database, false
// otherwise. If it finds it, it increments it.
bool NameDB::nameExistsAndIncrement(string_view name)
{
    // Find the name in the map.
    auto res = mNames.find(name.data());
    if (res != end(mNames)) {
        res->second++;
        return true;
    }
    return false;
}

// Adds a new name to the database.
void NameDB::addNewName(string_view name)
{
    mNames[name.data()] = 1;
}

int NameDB::getNameRank(string_view name) const
{
    // Implementation omitted, same as before.
}

// Returns the count associated with this name.
int NameDB::getAbsoluteNumber(string_view name) const
{
    auto res = mNames.find(name.data());
    if (res != end(mNames)) {
        return res->second;
    }
    return -1;
}

Profiling the Second Design Attempt

By following the same steps shown earlier, you can obtain the gprof performance data on the new version of the program. The data is quite encouraging:

index %time  self  children    called        name 
[1]   100.0  0.00    0.21                    main [1]
             0.02    0.18      1/1           NameDB::NameDB [2]
             0.00    0.01      1/1           NameDB::~NameDB [13]
             0.00    0.00      3/3           NameDB::getNameRank [28] 
[2]    95.2  0.02    0.18      1             NameDB::NameDB [2]
             0.02    0.16 500500/500500      NameDB::nameExistsAndIncrement 
[3]          0.00    0.00   1000/1000        NameDB::addNewName [24]
             0.00    0.00      1/1           map::map [87]

If you run this on your machine, the output will be different. It’s even possible that you will not see the data for NameDB methods in your output. Because of the efficiency of this second attempt, the timings are getting so small that you might see more map methods in the output than NameDB methods.

On my test system, main() now takes only 0.21 seconds—a 67-fold improvement! There are certainly further improvements that you could make on this program. For example, the current constructor performs a lookup to see if the name is already in the map, and if not, adds it to the map. You could combine these two operations simply with the following single line:

++mNames[name];

If the name is already in the map, this statement just increments its counter. If the name is not yet in the map, this statement first adds an entry to the map with the given name as key and a zero-initialized value, and then increments the value, resulting in a counter of 1.

With this improvement, you can remove the nameExistsAndIncrement() and addNewName() methods, and change the constructor as follows:

NameDB::NameDB(string_view nameFile)
{
    // Open the file and check for errors
    ifstream inputFile(nameFile.data());
    if (!inputFile) {
        throw invalid_argument("Unable to open file");
    }

    // Read the names one at a time.
    string name;
    while (inputFile >> name) {
        ++mNames[name];
    }
}

getNameRank() still uses a loop that iterates over all elements in the map. A good exercise for you to try is to come up with another data structure so that the linear iteration in getNameRank() can be avoided.

Profiling Example with Visual C++ 2017

Most editions of Microsoft Visual C++ 2017 come with a great built-in profiler, which is briefly discussed in this section. The VC++ profiler has a complete graphical user interface. This book does not recommend one profiler over another, but it is always good to have an idea of what a command line–based profiler like gprof can provide in comparison with a GUI-based profiler like the one included with VC++.

To start profiling an application in Visual C++ 2017, you first need to open the project in Visual Studio. This example uses the same NameDB code as in the first inefficient design attempt from earlier in the chapter. That code is not repeated here. Once your project is opened in Visual Studio, click on the Analyze menu, and then choose Performance Profiler. A new window appears, as shown in Figure 25-1.

In this new window, enable the “Performance Wizard” option and click the Start button. This starts a wizard, as shown in Figure 25-2.

Depending on your version of VC++ 2017, there are a number of different profiling methods available. The following non-exhaustive list explains three of them.

CPU Sampling: This method is used to monitor applications with low overhead. This means that the act of profiling the application will not have a big performance impact on the target application.
Instrumentation: This method adds extra code to the application to be able to accurately count the number of function calls and to time individual function calls. However, this method has a much bigger performance impact on the application. It is recommended to use the CPU Sampling method first to get an idea about the bottlenecks in your application. If that method does not give you enough information, you can try the Instrumentation method.
Resource contention data (concurrency): This method allows you to graphically monitor multithreaded applications. It allows you to see which threads are running, which threads are waiting for something, and so on.

For this profiling example, leave the default CPU sampling method selected and click the Next button. The next page of the wizard asks you to select the application that you want to profile. Here you should select your NameDB project and click the Next button. On the last page of the wizard, you can enable the “Launch profiling after the wizard finishes” option and then click the Finish button. You may get a message saying that you don’t have the right credentials for profiling, and asking whether you would like to upgrade your credentials. If you get this message, you should allow VC++ to upgrade your credentials; otherwise, the profiler will not work.

When the program execution is finished, Visual Studio automatically opens the profiling report. Figure 25-3 shows how this report might look like when profiling the first attempt of the NameDB application.

From this report, you can immediately see the hot path. Just like with gprof, it shows that the NameDB constructor takes up most of the running time of the program, and that incrementNameCount() and nameExists() both take almost the same time. The Visual Studio profiling report is interactive. For example, you can drill down the NameDB constructor by clicking on it. This results in a drill-down report for that function, as show in Figure 25-4.

This drill-down view shows a graphical breakdown at the top, and the actual code of the method at the bottom. The code view shows the percentage of the running time that a line needed. The lines using up most of the time are shown in red. When you are interactively browsing the profiling report, you can always go back by using the back arrow in the top-left corner of the report.

At the top of the report there is also a drop-down menu, which you can use to quickly jump to certain summary or details pages.

If you go back to the “Summary” page of the profiling report, you can see that there is a “Show Trimmed Call Tree” link on the right. Clicking that link displays a trimmed call tree showing you an alternative view of the hot path in your code, as shown in Figure 25-5.

Also in this view, you immediately see that main() is calling the NameDB constructor, which is using up most of the running time.

SUMMARY

This chapter discussed the key aspects of efficiency and performance in C++ programs, and provided several specific tips and techniques for designing and writing more efficient applications. Hopefully you gained an appreciation for the importance of performance and for the power of profiling tools.

There are two important things to remember from this chapter. The first thing is that you should not get too obsessed with performance while designing and coding. It’s recommended to first make a correct, well-structured design and implementation, then use a profiler, and only optimize those parts that are flagged by a profiler as being a performance bottleneck.

The second thing to remember is that design-level efficiency is far more important than language-level efficiency. For example, you shouldn’t use algorithms or data structures with bad complexity if there are better ones available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.