Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Part II. The Principles of Chaos

The performance of complex systems is typically optimized at the edge of chaos, just before system behavior will become unrecognizably turbulent.

Sidney Dekker, Drift Into Failure

The term “chaos” evokes a sense of randomness and disorder. However, that doesn’t mean Chaos Engineering is something that you do randomly or haphazardly. Nor does it mean that the job of a chaos engineer is to induce chaos. On the contrary: we view Chaos Engineering as a discipline. In particular, we view Chaos Engineering as an experimental discipline.

In the quote above, Dekker was making an observation about the overall behavior of distributed systems. He advocated for embracing a holistic view of how complex systems fail. Rather than looking for the “broken part,” we should try to understand how emergent behavior from component interactions could result in a system drifting into an unsafe, chaotic state.

You can think of Chaos Engineering as an empirical approach to addressing the question: “How close is our system to the edge of chaos?” Another way to think about this is: “How would our system fare if we injected chaos into it?”

In this chapter, we walk through the design of basic chaos experiments. We then delve deeper into advanced principles, which build on real-world applications of Chaos Engineering to systems at scale. Not all of the advanced principles are necessary in a chaos experiment, but we find that the more principles you can apply, the more confidence you’ll have in your system’s resiliency.

Experimentation

In college, electrical engineering majors are required to take a course called “Signals and Systems,” where they learn how to use mathematical models to reason about the behavior of electrical systems. One technique they learn is known as the Laplace transform. Using the Laplace transform, you can describe the entire behavior of an electrical circuit using a mathematical function called the transfer function. The transfer function describes how the system would respond if you subjected it to an impulse, an input signal that contains the sum of all possible input frequencies. Once you derive the transfer function of a circuit, you can predict how it will respond to any possible input signal.

There is no analog to the transfer function for a software system. Like all complex systems, software systems exhibit behavior for which we cannot build predictive models. It would be wonderful if we could use such models to reason about the impact of, say, a sudden increase in network latency, or a change in a dynamic configuration parameter. Unfortunately, no such models appear on the horizon.

Because we lack theoretical predictive models, we must use an empirical approach to understand how our system will behave under conditions. We come to understand how the system will react under different circumstances by running experiments on it. We push and poke on our system and observe what happens.

However, we don’t randomly subject our system to different inputs. We use a systematic approach in order to maximize the information we can obtain from each experiment. Just as scientists use experiments to study natural phenomena, we use experiments to reveal system behavior.

FIT: Failure Injection Testing

Experience with distributed systems informs us that various systemic issues are caused by unpredictable or poor latency. In early 2014 Netflix developed a tool called FIT, which stands for Failure Injection Testing. This tool allows an engineer to add a failure scenario to the request header of a class of requests at the edge of our service. As those requests propagate through the system, injection points between microservices will check for the failure scenario and take some action based on the scenario.

For example: Suppose we want to test our service resilience to an outage of the microservice that stores customer data. We expect some services will not function as expected, but perhaps certain fundamental features like playback should still work for customers who are already logged in. Using FIT, we specify that 5% of all requests coming into the service should have a customer data failure scenario. Five percent of all incoming requests will have that scenario included in the request header. As those requests propagate through the system, any that send a request to the customer data microservice will be automatically returned with a failure.

Advanced Principles

As you develop your Chaos Engineering experiments, keep the following principles in mind, as they will help guide your experimental design. In the following chapters, we delve deeper into each principle:

Hypothesize about steady state.
Vary real-world events.
Run experiments in production.
Automate experiments to run continuously.
Minimize blast radius.

¹ Preetha Appan, Indeed.com, “I’m Putting Sloths on the Map”, presented at SRECon17 Americas, San Francisco, California, on March 13, 2017.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for II. The Principles of Chaos

Create new playlist

Sign In

Sign Up

Part II. The Principles of Chaos

Experimentation

Advanced Principles

Table of Contents for
II. The Principles of Chaos