Chapter 5. Run Experiments in Production

In our field, the idea of doing software verification in a production environment is generally met with derision. “We’ll test it in prod” is a form of gallows humor, which translates to “we aren’t going to bother verifying this code properly before we deploy it.”

A commonly held tenet of classical testing is that it’s better to identify bugs as far away from production as possible. For example, it’s better to identify a bug in a unit test than in an integration test. The reasoning is that the farther away you are from a full deployment in the production environment, the easier it will be to identify the reason for the bug and fix it. If you’ve ever had to debug a failed unit test, a failed integration test, and a bug that manifested only in production, the wisdom in this approach is self-evident.

When it comes to Chaos Engineering, the strategy is reversed: you want to run your experiments as close to the production environment as possible. The ideal implementation runs all experiments directly in the production environment.

When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave and we write tests to verify the behaviors of these components.

When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. In particular, state and input and other people’s systems lead to all sorts of system behaviors that are difficult to foresee.

The following sections explore why running in the production environment is vital to Chaos Engineering. We want to build confidence in the system in production, and to do that we need to explore the systemic effects there. Otherwise, we are just building confidence in a system other than the one we care about, which diminishes the value of the exercise.

State and Services

Previously, we spoke about “system state.” In this section, we’ll delve into stateful services. Software engineering would be a lot simpler if our systems did not have to maintain any state. Alas, state is a fact of life in the kinds of systems that we build.

In microservices architecture, when we talk about “state” we are often referring to “stateful services,” such as databases. Systems where the database contains only test fixtures don’t behave the same as systems where the database contains the complete set of production data. Other stateful services include caches, object storage, and durable message queues.

Configuration data is another form of state that affects the way the system behaves. Whether you use static configuration files, a dynamic configuration service like etcd, or (as we do at Netflix) a combination of the two, this configuration information is a form of state that can significantly affect your system behavior.

Even in “stateless” services, there is still state in the form of in-memory data structures that persist across requests and can therefore affect the behavior of subsequent requests.

There’s plenty of state lurking in other places. When running in a cloud environment, the number of virtual machines or containers in an autoscaling group is a form of system state that changes over time as the external demand for different clusters changes. Network hardware such as switches and routers also contain state.

Eventually, some unexpected state is going to bite you. If you’re in the target audience for this book, you probably already have a few scars. In order to catch the threats to resiliency that Chaos Engineering is interested in, you need to expose experiments to the same state problems that exist in the production environment.

Input in Production

One of the most difficult lessons for a software engineer to learn is that the users of a system never seem to interact with it in the way that you expect them to. This lesson is typically learned in the context of designing the user interface of a system, but it’s an important lesson to keep in mind for designing Chaos Engineering experiments.

Assuming your system implements a service, it will receive many different types of requests from your users. You can try to build a synthetic model of user input, but because users never behave the way we expect, the production system will invariably be subject to inputs that don’t show up in your tests with synthetic data.

The only way to truly build confidence in the system at hand is to experiment with the actual input received by the production environment.

Other People’s Systems

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Leslie Lamport

Even if we could foresee all of the states of the system under our own control, we invariably depend on external systems, whose behavior we cannot possibly hope to know. Christmas Eve of 2012 is seared into the collective memory of Netflix, when problems with the Amazon Elastic Load Balancing (ELB) service in one geographical region resulted in a significant outage for the Netflix service.

If your system is deployed in a cloud environment such as AWS or Azure, the presence of external systems that you depend on but cannot fully understand is obvious. But even if you are running your system in your own datacenter, there are still external services that you will find yourself dependent on in production, such as DNS, SMTP, and NTP. Even if you deploy these yourselves, often they must communicate with external services over which you have no control.

If you provide a web interface to your service, then your users’ web browsers form part of the system that you have no control over. Even if you have full control over the client, such as an IoT device, you are still ultimately at the mercy of the networking environment that your user is connecting from.

The behavior of other people’s systems will always differ between production and synthetic environments. This reinforces the fact that you want to run experiments in production, the only place where you will have an authentic interaction with those other systems.

Agents Making Changes

At Netflix, our system is constantly undergoing change. Every day, engineers and automated scripts are modifying the system in different ways, such as pushing out new code, changing dynamic properties, and adding new persistent data.

If we extend our concept of the system to include these agents making modifications, it becomes apparent how difficult it is to replicate the system behavior in a test environment.

External Validity

When social scientists like psychologists or education researchers run experiments, one of their primary concerns is “external validity”: do the results of this experiment generalize to the phenomenon that we are really interested in, or is the effect that we measured a product of the environment that was specific to the experiment.

When you don’t run chaos experiments directly in production, the issues that we have discussed in this chapter (state, input, other people’s systems, agents making changes) are all potential threats to the external validity of the chaos experiment.

Poor Excuses for Not Practicing Chaos

We recognize that in some environments it may be difficult or even impossible to run experiments directly in a production environment. We don’t expect engineers to inject noise into the sensors of self-driving cars containing unsuspecting passengers! However, we suspect most users are not working on these kinds of safety-critical systems.

I’m pretty sure it will break!

If you are reluctant to run chaos experiments in production because you do not have high confidence that your system is resilient to the events you are going to inject, then that’s a big warning sign that your system isn’t mature enough to do Chaos Engineering experiments at all. You should go into an experiment with a reasonably high level of confidence that your system is resilient. One of the main purposes of Chaos Engineering is to identify weaknesses in your system. If you already believe the weaknesses are present, then you should be focusing on improving your system resiliency. Once you believe that the system is resilient, go ahead with chaos experiments.

If it does break, we’re in big trouble!

Even if you have high confidence that your system is resilient, you might be hesitant to run a Chaos Engineering experiment out of fear that the experiment will do too much harm if it does reveal a weakness.

This is a legitimate concern, and it is something we wrestle with as well. Our approach is to minimize the potential harm in two ways:

  • Make it easy to abort the experiment

  • Minimize the blast radius of the experiment

When running any Chaos Engineering experiment, you should have the equivalent of a big red button (ours is a literal, albeit virtual, big red button) that aborts the experiment. Even better is to augment this with an automated system that will automatically abort the experiment when it detects a potentially harmful deviation from steady state.

The second strategy involves designing the experiment to limit the potential harm to be as small as possible while still obtaining meaningful results from the experiment. This is covered in the Chapter 7 section.

Get as Close as You Can

Even if you cannot run directly in production, the closer your experimental environment is to production, the fewer threats to external validity your experiment will have, and the more confidence you can have in the results.

Remember: it’s better to risk a small amount of harm in order to protect the system from suffering a significant outage in the future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset