As engineers, we are constantly striving to perfect our craft: aspiring to architect flawless systems, diligently pursuing bug-free deliverables, and zealousl y sharpening our skillset on the proverbial whetstone of experience. As a product of these joint experiences, the relatively young discipline of software engineering continues to rapidly mature in fields such as validation and verification, software prototyping, and process modeling. However, one important aspect of software engineering is often overlooked: software optimization, that is, the craft of tuning software to efficiently utilize resources in such a way as to impact the user’s experience.

This definition yields three important insights. Firstly, because software optimization is the craft of tuning software, it is an ongoing data-driven process that requires constant feedback in the form of performance measurements and profiling. Secondly, all computing resources are limited and of varying scarcity, and thus their allocation must be carefully planned. Thirdly, optimizations must have an impact on the user experience. This impact can manifest itself in many different ways, such as improved battery life, the ability to handle increased datasets, or improved responsiveness.

Unfortunately, some developers look at performance analysis and optimization as arcane and unnecessary; a fossil of computing antiquity predating optimizing compilers and multi-core systems. So are these engineers correct in their assessment? Is performance a dated and irrelevant topic? The answer is a resounding, “No!” Before we dive into the methodologies and background for performance analysis, let’s first discuss why performance analysis is crucial to modern software engineering.

Performance Apologetic

There are two primary reasons why software optimization is necessary. The first reason is that software that performs well is often power-efficient software. This obviously is important to developers whose deliverables run on mobile devices, such as laptops, tablets, and phones, but it also is important to engineers of desktop and server products, whose power issues manifest themselves differently, typically in the form of thermal issues and high operating costs. For instance, data centers often, as part of their arrangements with the utility company, have minimum and maximum daily power consumption limits. This, combined with the fact that few engineers have the luxury of knowing or controlling where their users will install their software, means that every software engineer should care about writing power-efficient software. The second reason is that performance, and the related costs, have a powerful effect on usage patterns.

Performance Is Power Efficiency

It’s natural to think about power efficiency and performance as diametrically opposed concepts. For example, consumers tend to understand that while our server-class Intel® Xeon® processor line provide stellar performance, they won’t be as power-efficient in a tablet as our mobile Intel® Atom™ platform. Many users are familiar with the tunables, exposed by the drivers of hardware such as wireless and graphics cards, asking users to choose between power-efficient operation, maximum performant operation, or a balance between the two. However, when dealing with software, this distinction doesn’t exist.

In the next few chapters, we’ll dive deeper into the details of the power management features available in Intel® processors. For now, it is enough to understand that there are two general ways the CPU saves power dynamically at runtime. The first technique, known as frequency scaling, temporarily reduces the operating frequency and voltage of the processor. For example, a CPU might have a maximum operating frequency of 1.6 GHz, and the ability to scale the frequency as low as 600 MHz during light loads.

The second technique, known as deep sleep, is where the processor is halted, that is, the CPU stops executing instructions. The amount of power saved depends on how deep of a sleep the processor can enter, because in deeper sleep states various resources, such as the caches, can be powered off for additional savings. For instance, as the author sits here typing this chapter, the processor can enter a shallow sleep state in between keystrokes, waking up to process each keystroke interrupt, and can enter a deeper sleep state while the author is proofreading what he just typed.

In general, while frequency scaling can provide some power-savings, the largest savings come from the deep sleep states. Because of this, writing power-efficient software revolves around keeping the CPU asleep, and therefore not executing instructions, for as long as possible. Therefore if work needs to be done, it needs to be done as quickly as possible, so the CPU can return to its slumber. This concept is known as “race to idle.” Harking back to the earlier definition of software optimization where it was asserted that tuning is a data-driven process requiring constant feedback, it is illuminating to note that while “race to idle” is a good rule of thumb for the majority of cases, there may be cases where it provides suboptimal power-savings. This is why measurement is so important.

Both of these power-saving techniques, frequency scaling and deep sleep, are handled transparently by the operating system, so user space applications don’t need to explicitly worry about requesting them. Implicitly however, applications do need to be conscious of them, so they can align their behavior in such a way as to allow the operating system to fully utilize them. For instance, each deep sleep state has an entry and exit latency, with the deeper sleep states having higher latencies than the lighter sleep states. The operating system has to deduce how long the processor can afford to sleep before its next scheduled task, and then pick the deepest sleep state that meets that deadline.

From this it follows that the hardware can only be as power-efficient as its least efficient software component. The harsh reality is that a power-efficient hardware platform, and a finely optimized software stack are naught if the user downloads a poorly, or maliciously, written application that prevents the hardware from taking advantage of its power-saving features.

Performance and Usage Patterns

The ultimate purpose of computing is to increase the productivity of its user, and thus users, consciously or unconsciously, evaluate software features on whether the benefits to their productivity outweigh the associated costs.

To illustrate this point, consider the task of searching for an email in a mail client. Often the user is presented with two methods to complete this task; either manually searching through a list sorted by some criteria, or utilizing a keyword search capable of querying metadata. So how does a user decide which method to use for a specific query? Avoiding a discussion of human-computer interaction (HCI) and cognitive psychology, the author submits that the user compares two perceptions.

The first perception is an estimate of the duration required for performing the manual search. For instance, perhaps the emails are sorted by the date received, and the user remembers that the email was received very recently, and thus perceives a short duration, since the email should be near the top of the list. Or instead, perhaps the user was asked to scrounge up a long forgotten email from months ago, and the user has no idea where in the list this email would appear, and thus perceives a long duration. The second perception is an estimate of the duration required for utilizing the search function. This second perception might consist of prior search performance and accuracy, as well as other considerations, such as the user-interface performance and design.

The user will compare these two perceptions and choose the one with the shorter perceived duration. Therefore, as the perceived duration for the keyword search increases, its perceived value to the user approaches zero. On the other hand, a streamlined and heavily optimized search might completely displace the manual search.

This concept scales from individual software features all the way up to entire computing devices. For instance, consider a user fleetingly interested in researching a piece of trivia. If the user’s computer is powered off and takes ten minutes to boot, the user will probably not bother utilizing it for such a small benefit. On the other hand, if the user’s computer boots in 5 s, the user will be more likely to use it.

It is important to note that these decisions are based on the user’s perceptions. For instance, a mail client might have a heavily optimized search algorithm, but the user still might perceive the search as slow if the user-interface toolkit is slow to update with the query results.

A Word on Premature Optimization

So now that we’ve discussed why performance is important, let’s address a common question about when performance analysis and optimization is appropriate. If a software engineer spends enough time writing software with others, the engineer will eventually hear, or perhaps remark to a colleague, the widely misused quote from Donald Knuth, “premature optimization is the root of all evil” (Knuth, 1974).

In order to understand this quote, it is necessary to first establish its context. The original source is a 1974 publication entitled “Structured Programming with Goto Statements” from the ACM’s Computing Surveys journal. In one of the code examples, Knuth notes a performance optimization, of which he says, “In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal …” (Knuth, 1974). So clearly Knuth didn’t mean, as some have misconstrued him to, that all optimizations are bad, only premature optimizations. This then raises the question of what constitutes a premature optimization. Knuth continues, “It is often a mistake to make a priori judgements about what parts of a program are really critical …” (Knuth, 1974). In other words, a premature optimization is one made without first analyzing the program to determine what optimizations are impactful.

These words might ring truer today than they did back in 1974. Modern systems are significantly more complex, with advances in optimizing compilers, stacked caches, hardware prefetching, out-of-order execution, and countless other technological innovations widening the gap between the code we write, and how our code executes, and thus subsequently performs. As such, we as engineers must heed Knuth’s warning and ensure that our optimizations target the critical code.

Since an optimization requires analysis, does that mean that all optimization should wait until the software is completely written? Obviously not, because by that time it will be too late to undo any inefficiencies designed into the core architecture.

In the early stages of architecture and design, performance considerations should revolve around the data structures utilized and the subsequent algorithms that accompany them. This is an area where Big-O analysis is important, and there should be a general understanding of what algorithms will be computationally challenging. For those not familiar with Big-O analysis, it is a method for categorizing, and comparing, algorithms based on the order of growth of their computational costs in relation to their best, average, and worst cases of input. Early performance measurements can also be obtained from models and prototypes.

As architecture manifests as code, frequent performance profiling should occur to verify that these models are accurate. Big-O analysis alone is not sufficient. Consider the canonical case of quicksort versus mergesort and heapsort. Although mergesort and heapsort are in the worst-case superlinear, they are often outperformed by quicksort, which is in the worst-case quadratic.

The Roadmap

So hopefully, you now have a basic understanding of why performance analysis and optimization is important, and when it is appropriate. The rest of this book focuses on the how and the where. The following content is divided into four parts. The first part provides the necessary background information to get you started, focusing on Intel® Architecture and the interactions between the hardware and the Linux software stack. The second part begins by covering performance analysis methodologies. Then it provides instructions on utilizing the most popular performance profiling tools, such as Intel® VTune™ Amplifier XE and perf. The third part outlines some of the various common performance problems, identified by the tools in Part 2, and then provides details on how to correct them. Since each performance situation is different, this part is far from comprehensive, but is designed to give the reader a good starting point.

Throughout this book, AT$SPI0026SPI$T syntax is used for assembly instructions. Hexadecimal numbers are prefixed with 0x, binary numbers are written with a subscript two, such as 01012, and all other numbers are in base ten.

