How it works...

One of the best ways to learn how to optimize your C++ code is to learn how to analyze the resulting assembly code that the compiler generates after compilation. In this recipe, we will learn how this analysis is done by looking at two different examples: loop unrolling and pass-by-reference parameters.

Before we look at these examples, let's look at a simple example:

int main(void)
{ }

In the preceding example, we have nothing more than a main() function. We haven't included any C or C++ libraries and the main() function itself is empty. If we compile this example, we will see that the resulting binary is still pretty large:

In this case, the example is 22kb in size. To show the resulting assembly that the compiler generated for this code, we can do the following:

> objdump -d recipe02_example01

The resulting output of the preceding command should be surprising as there is a lot of code for an application that does absolutely nothing.

To get a better feel for how much code there really is, we can refine the output by using grep, a tool that lets us filter text from any command. Let's look at all of the functions in the code:

As we can see, there are several functions the compiler automatically adds to the code for you. This includes the _init(), _fini(), and _start() functions. We can also look at a specific function, such as our main function, as follows:

In the preceding example, we search the output of objdump for main>: and RETQ. All the function names end with >: and the last instruction (typically) for each function is RETQ on an Intel 64-bit system.

The following is the resulting assembly:

  401106: push %rbp
  401107: mov %rsp,%rbp

First, it stores the current stack frame pointer (rbp) to the stack and loads the stack frame pointer with the current address of the stack (rsp) for the main() function.

This can be seen in every function and is called the function's prolog. The only code that main() executes is return 0, which was added to the code automatically by the compiler:

  40110a: mov $0x0,%eax

Finally, the last assembly in this function contains the function's epilog, which restores the stack frame pointer and returns:


  40110f: pop %rbp
  401110: retq

Now that we have a better understanding of how to get and read the resulting assembly for compiled C++, let's look at an example of loop unrolling, which is the process of replacing a loop of instructions with its equivalent version of the instructions without a loop. To do this, ensure that the examples are compiled in release mode (that is, with compiler optimizations enabled) by configuring them using the following command:

> cmake -DCMAKE_BUILD_TYPE=Release .
> make

To understand loop unrolling, let's look at the following code:

volatile int data[1000];

int main(void)
{
    for (auto i = 0U; i < 1000; i++) {
        data[i] = 42;
    }
}

When the compiler encounters a loop, the resulting assembly it generates contains the following code:

Let's break this down:

  401020: xor %eax,%eax
  401022: nopw 0x0(%rax,%rax,1)

The first two instructions belong to the for (auto i = 0U; portion of the code. In this case, the i variable is stored in the EAX register and is set to 0 using the XOR instruction (the XOR instruction is faster on Intel for setting a register to 0 than a MOV instruction). The NOPW instruction can be safely ignored.

The next couple of instructions are interleaved, as follows:

  401028: mov %eax,%edx
  40102a: add $0x1,%eax
  40102d: movl $0x2a,0x404040(,%rdx,4)

These instructions represent the i++; and data[i] = 42; code. The first instruction stores the current value of the i variable and then increments it by one before storing 42 into the memory address indexed by i. Conveniently, this resulting assembly demonstrates a possible opportunity for optimization as the compiler could have achieved the same functionality using the following:

 movl $0x2a,0x404040(,%rax,4)
 add $0x1,%eax

The preceding code stores the value 42 before executing i++, thus removing the need for the following:

  mov %eax,%edx

A number of methods exist to realize this potential optimization, including using a different compiler or handwriting the assembly. The next set of instructions execute the i < 1000; portion of our for loop:

  401038: cmp $0x3e8,%eax
  40103d: jne 401028 <main+0x8>

The CMP instruction checks to see if the i variable is 1000 and, if not, uses the JNE instruction to jump to the top of the function to continue the loop. Otherwise, the remaining code executes:

  40103f: xor %eax,%eax
  401041: retq

To see how loop unrolling works, let's change the number of iterations the loop takes from 1000 to 4, as follows:

volatile int data[4];

int main(void)
{
    for (auto i = 0U; i < 4; i++) {
        data[i] = 42;
    }
}

As we can see, the code is identical except for the number of iterations the loop takes. The resulting assembly is as follows:

As we can see, the CMP and JNE instructions are missing. Now, the following code is compiled (but there's more!):

    for (auto i = 0U; i < 4; i++) {
        data[i] = 42;
    }

The compiled code is converted into the following code:

        data[0] = 42;
        data[1] = 42;
        data[2] = 42;
        data[3] = 42;

return 0; shows up in the assembly in-between the assignments. This is allowed because the return value of the function is independent of the assignment (since the assignment instructions never touch RAX), which provides the CPU with an additional optimization (as it can execute return 0; in parallel, though this is a topic that is out of the scope of this book). It should be noted that loop unrolling doesn't require a small number of loop iterations to be used. Some compilers will partially unroll a loop to achieve optimizations (for example, executing the loop in groups of 4 instead of 1 at a time).

Our last example will look at pass-by-reference instead of pass-by-value. To start, recompile the code in debug mode:

> cmake -DCMAKE_BUILD_TYPE=Debug .
> make

Let's look at the following example:

struct mydata {
    int data[100];
};

void foo(mydata d)
{
    (void) d;
}

int main(void)
{
    mydata d;
    foo(d);
}

In this example, we've created a large structure and passed it by-value to a function named foo() in our main function. The resulting assembly for the main function is as follows:

The important instructions from the preceding example are as follows:

  401137: rep movsq %ds:(%rsi),%es:(%rdi)
  40113a: callq 401106 <_Z3foo6mydata>

The preceding instructions copy the large structure to the stack and then call our foo() function. The copy occurs because the structure is passed by value, which means the compiler must perform a copy. As a side note, if you would like to see the output in a readable format and not a mangled format, add a C to the options, as follows:

Finally, let's pass-by-reference to see the resulting improvement:

struct mydata {
    int data[100];
};

void foo(mydata &d)
{
    (void) d;
}

int main(void)
{
    mydata d;
    foo(d);
}

As we can see, we pass the structure by-reference instead of by-value. The resulting assembly is as follows:

Here, there is far less code, resulting in a faster executable. As we have learned, examining what the compiler produces can be effective if we wish to understand what the compiler is producing as this provides more information about potential changes you can make to write more efficient C++ code.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...