Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Automate Experiments to Run Continuously

Automation is the longest lever. In the practice of Chaos Engineering, we automate the execution of experiments, the analysis of experimental results, and aspire to automate the creation of new experiments.

Automatically Executing Experiments

Doing things manually and performing one-off experiments are great first steps. As we conjure up new ways to search the failure space we frequently begin with a manual approach, handling everything with kid gloves to gain confidence in both the experiment and the system. All stakeholders are gathered, and a heads-up is broadcast to CORE¹ that a new kind of experiment is going to begin.

This apprehension and extreme level of care is appropriate to establish a) the experiment runs correctly and b) the experiment has a minimal blast radius. Once we have successfully conducted the experiment, the next step is to automate the experiment to run continuously.

If experimentation is not automated, it is obsolescent.

The intractable complexity of modern systems means that we cannot know a priori which changes to the production environment will alter the results of a chaos experiment. Since we can’t know which changes can impact our experiments, we have to assume they all do. Through shared state, caching, dynamic configuration management, continuous delivery, autoscaling, and time-aware code, production is in a perpetual state of change. As a result, the confidence in a result decays with time.

Ideally, experiments would run with each change, kind of like a Chaos Canary. When a new risk is discovered, the operator can choose whether or not they should block the roll out of the change and prioritize a fix, being reasonably sure that the rolled out change is the cause. This approach provides insight into the onset and duration of availability risks in production. At the other extreme, annual exercises lead to more difficult investigations that essentially start from scratch and don’t provide easy insight into how long the potential issue has been in production.

If experimentation is not automated, it won’t happen.

At Netflix, each team owns the availability of the services they author and maintain. Our Chaos Engineering team helps service owners increase their availability through education, tools, encouragement, and peer pressure. We can not—and should not—ask engineers to sacrifice development velocity to spend time manually running through chaos experiments on a regular basis. Instead, we invest in creating tools and platforms for chaos experimentation that continually lower the barriers to creating new chaos experiments and running them automatically.

Chaos Automation Platform (ChAP)

Our Chaos Engineering team spent the better part of 2015 running chaos exercises with critical microservices on a consulting basis. This was necessary to really understand the power and limitations of FIT, but we knew that hands-on consulting would not scale. We needed a mechanism to scale the practice across the organization.

By early 2016, we had the seeds for a plan to bring the Principles of Chaos Engineering to the microservices layer. We noted several issues with FIT that discouraged automation and widespread adoption. Some of these could be fixed in FIT, and some would require a larger engineering effort beyond the request header manipulation and IPC injection points that FIT provide.

The Chaos Automation Platform, called ChAP for short, was launched in late 2016 to address these deficiencies.

Most of the issues with FIT revolved around a lack of automation. The human involvement of setting up a failure scenario and then watching key metrics while it runs, proved to be an obstacle to adoption. We chose to lean on the existing canary analysis (see “Canary Analysis”) to automatically judge whether an exercise was performing within acceptable boundaries.

Then we automated a template for true experimentation. In the FIT example above, we affected 5% of incoming traffic and looked for an impact in SPS. If we didn’t see any impact, we would crank the affected traffic up to 25%. Any impact could still be lost in the noise for the SPS metric. Affecting large swaths of incoming traffic like this is risky, it provided us with low confidence that we could isolate small effects, and it prevented multiple failure scenarios from running simultaneously.

In order to minimize the blast radius, ChAP launches a new experiment with both a control and an experiment cluster for each microservice examined. If we are testing the customer data microservice as in the example above, ChAP will interrogate our continuous delivery tool, Spinnaker, about that cluster. Using that information, it will launch two identical nodes of the same service: one as the control and the other as the experiment. It will then redirect a fraction of a percentage of traffic and split it evenly between the control and experiment nodes. The failure scenario will be applied only to the experiment node. As the requests propagate through the system, we can directly compare success rates and operational concerns between traffic that went to the control and to the experiment.

With this automation of the experiment, we have high confidence that we can detect even small effects with a one-to-one comparison between the control and the experiment. We are affecting a minimal small amount of incoming traffic, and have isolated the experiment so we can run a very large number of experiments in parallel.

In late 2016, we integrated ChAP with the continuous delivery tool, Spinnaker, so that microservices can run chaos experiments every time they deploy a new code base. This new functionality is similar to a canary, but in this case we want it to be nonblocking, because we are uncovering potential future systemic effects, not something that will immediately degrade the service. By providing the microservice, owners with context about these resilience vulnerabilities, we give them the opportunity to prevent a service degradation before it occurs.

Automatically Creating Experiments

If you can set up experiments that run automatically, on a regular basis, you’re in great shape. However, there’s another level of automation we can aspire to: automating the design of the experiments.

The challenge of designing Chaos Engineering experiments is not identifying what causes production to break, since the data in our incident tracker has that information. What we really want to do is identify the events that shouldn’t cause production to break, and that have never before caused production to break, and continuously design experiments that verify that this is still the case.

Unfortunately, this is a difficult challenge to address. The space of possible perturbations to the system is enormous, and we simply don’t have the time or resources to do a brute-force search across all possible combinations of events that might lead to a problem.

Lineage Driven Fault Injection (LDFI)

One notable example of research on automatically creating experiments is a technique called Lineage-Driven Fault Injection (LDFI). Developed by Prof. Peter Alvaro of University of California, Santa Cruz, LDFI can identify combinations of injected faults that can induce failures in distributed systems. LDFI works by reasoning about the system behavior of successful requests in order to identify candidate faults to inject.

In 2015, Peter Alvaro worked in collaboration with Netflix engineers to determine if LDFI could be implemented on our systems. They did successfully implement a version of LDFI on top of Netflix’s FIT framework, which was able to identify some combination of faults that could lead to critical failures.

For more information on how this work was applied at Netflix, see the paper “Automating Failure Testing Research at Internet Scale” published in the Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC ’16).

¹ CORE, which stands for Critical Operations Response Engineering, is the name of the SRE team at Netflix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Automate Experiments to Run Continuously

Create new playlist

Sign In

Sign Up

Chapter 6. Automate Experiments to Run Continuously

Automatically Executing Experiments

Automatically Creating Experiments

Table of Contents for
6. Automate Experiments to Run Continuously