Chapter 7. Minimize Blast Radius

On April 26, 1986, one of the worst nuclear accidents in human history occurred at the Chernobyl nuclear power plant in the Ukraine. Ironically, the disaster was triggered by a resiliency exercise: an experimental attempt to verify a redundant power source for coolant pumps. While most of us don’t work on projects where the stakes are as high as a nuclear power plant coolant system, each chaos experiment has the potential to cause a production outage. The professional responsibility of the chaos engineer is to understand and mitigate production risks. A well-designed system for experimentation will prevent big production outages by causing only a few customers a small amount of pain.

Unfortunately, we have run experiments that were supposed to only impact a small percentage of users but cascading failures unintentionally impacted more users than intended. In these cases, we had to perform an emergency stop of the experiment. While we never want to cause such a situation, the ability to contain and halt the experiment prevented a much larger crisis. In many ways, our experiments are looking for the unknown and unforeseen repercussions of failure, so the trick is how to shed light on these vulnerabilities without accidentally blowing everything up. We call this “minimizing the blast radius.”

The most confidence-inspiring and risky experiments impact all production traffic. Chaos experiments should take careful, measured risks that build upon each other. This escalation of scope ratchets up confidence in the system without causing unnecessary customer harm.

The lowest-risk experiments involve few users. To accomplish this, we inject failures that verify client-device functionality for a subset or small group of devices. These experiments are limited in the number of devices and user flows that they can affect. They are not a surrogate for production traffic, but they are a good early indicator. For example, if the website tests can’t succeed, it doesn’t make sense to impact real users.

When the automated experiments succeed (or if the functionality under test is not covered by the device tests) the next step is to run small-scale diffuse experiments. An experiment of this style impacts a small percentage of traffic and allows the traffic to follow normal routing rules so it ends up evenly distributed throughout the production servers. For this style of experiment, you need to filter your customer success metrics for the users impacted by the experiment so the impact of your experiment is not lost in the noise. The advantage of a small-scale diffuse experiment is that it should not cross thresholds that would open circuits so you can verify your single-request fallbacks and timeouts. This verifies the system’s resiliency to transient errors.

The next step is to run small-scale concentrated experiments, overriding the routing of requests for all users in this experiment to direct traffic to specific boxes. These boxes will then experience a high concentration of failures and/or latency, allow circuits to open, and should expose hidden resource constraints. If there is a bad fallback or weird lock contention that causes the system to stop servicing requests, then only the users in the experiment will suffer. This simulates a large-scale failure in production while minimizing the negative impact and should provide a high level of confidence.

The most risky and accurate experiment is large-scale without custom routing. At this level, the experiment will show up in your main dashboards and has the potential to impact users not in the experiment group through circuit breakers and shared resource constraints. However, nothing provides more certainty that your system can withstand a given failure scenario than subjecting all of your users to it in production.

In addition to ratcheting up the scope of the experiment, it is imperative to be able to abort in-process experiments when they cause too much pain. Sometimes your system is designed to use a degraded mode that is a minor annoyance for your customers and that is fine, but when the system becomes unavailable or unusable by your customers the experiment should be terminated immediately. This can be handled with the “big red button” as discussed earlier.

Automated termination is highly recommended, particularly if experiments are running continuously in accordance with the other advanced principles. Figuring out how to build a system that can monitor the metric of interest and unwind a chaos experiment in real time is entirely dependent on the unique construction of the system at hand, and is left as an exercise for the reader.

To increase our ability to respond when experiments run amok, we avoid running experiments at risky times. For example, we only run experiments during business hours when everyone is at the the office and ready to work.

The entire purpose of Chaos Engineering is undermined if the tooling and instrumentation of the experiment itself causes undue impact on the metric of interest. We want to build confidence in the resilience of the system, one small and contained failure at a time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset