Chapter 8. Designing Experiments

Now that we’ve covered the principles, let’s talk about the nitty gritty of designing your Chaos Engineering experiments. Here’s an overview of the process:

  1. Pick a hypothesis

  2. Choose the scope of the experiment

  3. Identify the metrics you’re going to watch

  4. Notify the organization

  5. Run the experiment

  6. Analyze the results

  7. Increase the scope

  8. Automate

1. Pick a Hypothesis

The first thing you need to do is decide what hypothesis you’re going to test, which we covered in the section Chapter 4. Perhaps you recently had an outage that was triggered by timeouts when accessing one of your Redis caches, and you want to ensure that your system is vulnerable to timeouts in any of the other caches in your system. Or perhaps you’d like to verify that your active-passive database configuration fails over cleanly when the primary database server encounters a problem.

Don’t forget that your system includes the humans that are involved in maintaining it. Human behavior is critical in mitigating outages. Consider an organization that uses a messaging app such as Slack or HipChat to communicate during an incident. The organization may have a contingency plan for handling the outage when the messaging app is down during an outage, but how well do the on-call engineers know the contingency plan? Running a chaos experiment is a great way to find out.

2. Choose the Scope of the Experiment

Once you’ve chosen what hypothesis you want to test, the next thing you need to decide is the scope of the experiment. Two principles apply here: “run experiments in production” and “minimize blast radius.” The closer your test is to production, the more you’ll learn from the results. That being said, there’s always a risk of doing harm to the system and causing customer pain.

Because we want to minimize the amount of customer pain as much as possible, we should start with the smallest possible test to get a signal and then ratchet up the impact until we achieve the most accurate simulation of the biggest impact we expect our systems to handle.

Therefore, as described in Chapter 7, we advocate running the first experiment with as narrow a scope as possible. You’ll almost certainly want to start out in your test environment to do a dry run before you move into production. Once you do move to production, you’ll want to start out with experiments that impact the minimal amount of customer traffic. For example, if you’re investigating what happens when your cache times out, you could start by calling into your production system using a test client, and just inducing the timeouts for that client.

3. Identify the Metrics You’re Going to Watch

Once you know the hypothesis and scope, it’s time to select what metrics you are going to use to evaluate the outcome of the experiments, a topic we covered in Chapter 3. Try to operationalize your hypothesis using your metrics as much as possible. If your hypothesis is “if we fail the primary database, then everything should be ok,” you’ll want to have a crisp definition of “ok” before you run the experiment. If you have a clear business metric like “orders per second,” or lower-level metrics like response latency and response error rate, be explicit about what range of values are within tolerance before you run the experiment.

If the experiment has a more serious impact than you expected, you should be prepared to abort early. A firm threshold could look like: 5% or more of the requests are failing to return a response to client devices. This will make it easier for you to know whether you need to hit the big red “stop” button when you’re in the moment.

4. Notify the Organization

When you first start off running chaos experiments in the production environment, you’ll want to inform members of your organization about what you’re doing, why you’re doing it, and (only initially) when you’re doing it.

For the initial run, you might need to coordinate with multiple teams who are interested in the outcome and are nervous about the impact of the experiment. As you gain confidence by doing more experiments and your organization gains confidence in the approach, there will be less of a need to explicitly send out notifications about what it is happening.

5. Run the Experiment

Now that you’ve done all of the preparation work, it’s time to perform the chaos experiment! Watch those metrics in case you need to abort. Being able to halt an experiment is especially important if you are running directly in production and potentially causing too much harm to your systems, or worse, your external customers. For example, if you are an e-commerce site, you might be keeping a watchful eye on your customers’ ability to checkout or add to their cart. Ensure that you have proper alerting in place in case these critical metrics dip below a certain threshold.

6. Analyze the Results

After the experiment is done, use the metrics you’ve collected to test if your hypothesis is correct. Was your system resilient to the real-world events you injected? Did anything happen that you didn’t expect?

Many issues exposed by Chaos Engineering experiments will involve interactions among multiple services. Make sure that you feed back the outcome of the experiment to all of the relevant teams so they can mitigate any weaknesses.

7. Increase the Scope

As described in the Chapter 7 section, once you’ve gained some confidence from running smaller-scale experiments, you can ratchet up the scope of the experiment. Increasing the scope of an experiment can reveal systemic effects that aren’t noticeable with smaller-scale experiments. For example, a microservice might handle a small number of downstream requests timing out, but it might fall over if a significant fraction start timing out.

8. Automate

As described in the Chapter 6 section, once you have confidence in manually running your chaos exercises, you’ll get more value out of your chaos experiments once you automate them so they run regularly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset