Chapter 3. Hypothesize about Steady State

For any complex system, there are going to be many moving parts, many signals, and many forms of output. We need to distinguish in a very general way between systemic behaviors that are acceptable and behaviors that are undesirable. We can refer to the normal operation of the system as its steady state.

If you are developing or operating a software service, how do you know if it is working? How do you recognize its steady state? Where would you look to answer that question?

If your service is young, perhaps the only way you know that everything is working is if you try to use it yourself. If your service is accessible through a website, you might check by browsing to the site and trying to perform a task or transaction.

This approach to checking system health quickly reveals itself to be suboptimal: it’s labor-intensive, which means we’re less likely to do it. We can automate these kinds of tests, but that’s not enough. What if the test we’ve automated doesn’t reveal the problem we’re looking for?

A better approach is to collect data that provide information about the health of the system. If you’re reading this book, we suspect you’ve already instrumented your service with some kind of metrics collection system. There are a slew of both open-source and commercial tools that can collect all sorts of data on different aspects of your system: CPU load, memory utilization, network I/O, and all kinds of timing information, such as how long it takes to service web requests, or how much time is spent in various database queries.

System metrics can be useful to help troubleshoot performance problems and, in some cases, functional bugs. Contrast that with business metrics. It’s the business metrics that allow us to answer questions like:

  • Are we losing customers?

  • Are the customers able to perform critical site functions like checking out or adding to their cart on an e-commerce site?

  • Are the customers experiencing so much latency that they will give up and stop using the service?

For some organizations, there are clear real-time metrics that are tied directly to revenue. For example, companies like Amazon and eBay can track sales, and companies like Google and Facebook can track ad impressions.

Because Netflix uses a monthly subscription model, we don’t have these kinds of metrics. We do measure the rate of signups, which is an important metric, but signup rate alone isn’t a great indicator of overall system health.

What we really want is a metric that captures satisfaction of currently active users, since satisfied users are more likely to maintain their subscriptions. If people who are currently interacting with the Netflix service are satisfied, then we have confidence that the system is healthy.

Unfortunately, we don’t have a direct, real-time measure of customer satisfaction. We do track the volume of calls to customer service, which is a good proxy of customer dissatisfaction, but for operational purposes, we want faster and more fine-grained feedback than that. A good real-time proxy for customer satisfaction at Netflix is the rate at which customers hit the play button on their video streaming device. We call this metric video-stream starts per second, or SPS for short.

SPS is straightforward to measure and is strongly correlated with user satisfaction, since ostensibly watching video is the reason why people pay to subscribe to the service. For example, the metric is predictably higher on the East Coast at 6pm than it is at 6am. We can therefore define the steady state of our system in terms of this metric.

Netflix site reliability engineers (SREs) are more interested in a drop in SPS than an increase in CPU utilization in any particular service: it’s the SPS drop that will trigger the alert that pages them. The CPU utilization spike might be important, or it might not. Business metrics like SPS describe the boundary of the system. This is where we care about verification, as opposed to the internals like CPU utilization.

It’s typically more difficult to instrument your system to capture business metrics than it is for system metrics, since many existing data collection frameworks already collect a large number of system metrics out of the box. However, it’s worth putting in the effort to capture the business metrics, since they are the best proxies for the true health of your system.

You also want these metrics to be relatively low latency: a business metric that is only computed at the end of the month tells you nothing about the health of your system today.

For any metric you choose, you’ll need to balance:

  • the relationship between the metric and the underlying construct;

  • the engineering effort required to collect the data; and

  • the latency between the metric and the ongoing behavior of the system.

If you don’t have access to a metric directly tied to the business, you can take advantage of system metrics, such as system throughput, error rate, or 99th percentile latency. The stronger the relationship between the metric you choose and the business outcome you care about, the stronger the signal you have for making actionable decisions. Think of metrics as the vital signs of your system. It’s also important to note that client-side verifications of a service producing alerts can help increase efficiency and complement server-side metrics for a more accurate portrayal of the user experience at a given time.

Characterizing Steady State

As with human vital signs, you need to know what range of values are “healthy.” For example, we know that a thermometer reading of 98.6 degrees Fahrenheit is a healthy value for human body temperature.

Remember our goal: to develop a model that characterizes the steady state of the system based on expected values of the business metrics.

Unfortunately, most business metrics aren’t as stable as human body temperature; instead, they may fluctuate significantly. To take another example from medicine, an electrocardiogram (ECG) measures voltage differences on the surface of a human body near the heart. The purpose of this signal is to observe the behavior of the heart.

Because the signal captured by an ECG varies as the heart beats, a doctor cannot compare the ECG to a single threshold to determine if a patient is healthy. Instead, the doctor must determine whether the signal is varying over time in a pattern that is consistent with a healthy patient.

At Netflix, SPS is not a stable metric like human body temperature. Instead, it varies over time. Figure 3-1 shows a plot of SPS versus time. Note how the metric is periodic: it increases and decreases over time, but in a consistent way. This is because people tend to prefer watching television shows and movies around dinner time.

Because SPS varies predictably with time, we can look at the SPS metric from a week ago as a model of steady state behavior. And, indeed, when site reliability engineers (SREs) inside of Netflix look at SPS plots, they invariably plot last week’s data on top of the current data so they can spot discrepancies. The plot shown in Figure 3-1 shows the current week in red and the previous week in black.

SPS
Figure 3-1. SPS varies regularly over time

Depending on your domain, your metrics might vary less predictably with time. For example, if you run a news website, the traffic may be punctuated by spikes when a news event of great general public interest occurs. In some cases, the spike may be predictable (e.g., election, sporting event), and in others it may be impossible to predict in advance. In these types of cases, characterizing the steady state behavior of the system will be more complex. Either way, characterizing your steady state behavior is a necessary precondition of creating a meaningful hypothesis about it.

Forming Hypotheses

Whenever you run a chaos experiment, you should have a hypothesis in mind about what you believe the outcome of the experiment will be. It can be tempting to subject your system to different events (for example, increasing amounts of traffic) to “see what happens.” However, without having a prior hypothesis in mind, it can be difficult to draw conclusions if you don’t know what to look for in the data.

Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system. If you add requests to a mid-tier service, will the steady state be disrupted or stay the same? If disrupted, do you expected system output to increase or decrease?

At Netflix, we apply Chaos Engineering to improve system resiliency. Therefore, the hypotheses in our experiments are usually in the form of “the events we are injecting into the system will not cause the system’s behavior to change from steady state.”

For example, we do resiliency experiments where we deliberately cause a noncritical service to fail in order to verify that the system degrades gracefully. We might fail a service that generates the personalized list of movies that are shown to the user, which is determined based on their viewing history. When this service fails, the system should return a default (i.e., nonpersonalized) list of movies.

Whenever we perform experiments when we fail a noncritical service, our hypothesis is that the injected failure will have no impact on SPS. In other words, our hypothesis is that the experimental treatment will not cause the system behavior to deviate from steady state.

We also regularly run exercises where we redirect incoming traffic from one AWS geographical region to two of the other regions where we run our services. The purpose of these exercises is to verify that SPS behavior does not deviate from steady state when we perform a failover. This gives us confidence that our failover mechanism is working correctly, should we need to perform a failover due to a regional outage.

Finally, think about how you will measure the change in steady state behavior. Even when you have your model of steady state behavior, you need to define how you are going to measure deviations from this model. Identifying reasonable thresholds for deviation from normal can be challenging, as you probably know if you’ve ever spent time tuning an alerting system. Think about how much deviation you would consider “normal” so that you have a well-defined test for your hypothesis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset