Summarizing performance

We properly defined some of the main concepts around performance, namely latency and throughput, but we still lack a concrete way to quantify our measurements. To continue with our example of cars driving down a highway, we want to find a way to answer the question, "How long a drive should I expect to go from point A to point B?" The first step is to measure our trip time on multiple occasions to collect empirical information.

The following table catalogs our observations. We still need a way to interpret these data points and summarize our measurements to give an answer:

Observed trip

Travel time in minutes

Trip 1

28

Trip 2

37

Trip 3

17

Trip 4

38

Trip 5

18

The problem with averages

A common mistake is to rely on averages to measure the performance of a system. An arithmetic average is fairly easy to calculate. This is the sum of all collected values divided by the number of values. Using the previous sample of data points, we can infer that on average we should expect a drive of approximately 27 minutes. With this simple example, it is easy to see what makes the average such a poor choice. Out of our five observations, only Trip 1 is close to our average while all the other trips are quite different. The fundamental problem with averages is that it is a lossy summary statistic. Information is lost when moving from a series of observations to the average because it is impossible to retain all the characteristics of the original observations in a single data point.

To illustrate how an average loses information, consider the three following datasets that represent the measured latency required to process a request to a web service:

The problem with averages

In the first dataset, there are four requests that take between 280 ms and 305 ms to be completed. Compare these latencies with the latencies in the second dataset, as follows:

The problem with averages

The second dataset shows a more volatile mixture of latencies. Would you prefer to deploy the first or the second service into your production environment? To add more variety into the mix, a third dataset is shown, as follows:

The problem with averages

Although each of these datasets has a vastly different distribution, the averages are all the same, and equal 292 ms! Imagine having to maintain the web service that is represented by dataset 1 with the goal of ensuring that 75% of clients receive a response in less than 300 ms. Calculating the average out of dataset 3 will give you the impression that you are meeting your objective, while in reality only half of your clients actually experience a fast enough response (requests with IDs 1 and 2).

Percentiles to the rescue

The key term in the previous discussion is "distribution." Measuring the distribution of a system's performance is the most robust way to ensure that you understand the behavior of the system. If an average is an ineffective choice to take into account the distribution of our measurements, then we need to find a different tool. In the field of statistics, a percentile meets our criteria to interpret the distribution of observations. A percentile is a measurement indicating the following value into which a given percentage of observations in a group of observations falls. Let's make this definition more concrete with an example. Going back to our web service example, imagine that we observe the following latencies:

Request

Latency in milliseconds

Request 1

10

Request 2

12

Request 3

13

Request 4

13

Request 5

9

Request 6

27

Request 7

12

Request 8

7

Request 9

75

Request 10

80

The 20th percentile is defined as the observed value that represents 20% of all the observations. As there are ten observed values, we want to find the value that represents two observations. In this example, the 20th percentile latency is 9 ms because two values (that is, 20% of the total observations) are less than or equal to 10 ms (9 ms and 7 ms). Contrast this latency with the 90th percentile. The value representing 90% of the observations: 75 ms (as nine observations out of ten are less than or equal to 75 ms).

Where the average hides the distribution of our measurements, the percentile provides us with deeper insight and highlights that tail-end observations (the observations near the 100th percentile) experience extreme latencies.

If you remember the beginning of this section, we were trying to answer the question, "How long a drive should I expect to go from point A to point B?" After spending some time exploring the tools available, we realized that the original question is not the one we are actually interested in. A more pragmatic question is, "How long do 90% of the cars take to go from point A to point B?"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset