Adopting Your Own Monkey

When Chaos Monkey launched, most developers were surprised by how many vulnerabilities it uncovered. Even services that had been in production for ages turned out to have subtle configuration problems. Some of them had cluster membership rosters that grew without bounds. Old IP addresses would stay on the list, even though the owner would never be seen again. (Or worse, if that IP came back it was as a different service!)

Prerequisites

First of all, your chaos engineering efforts can’t kill your company or your customers.

In a sense, Netflix had it easy. Customers are familiar with pressing the play button again if it doesn’t work the first time. They’ll forgive just about anything except cutting off the end of Stranger Things. If every single request in your system is irreplaceably valuable, then chaos engineering is not the right approach for you. The whole point of chaos engineering is to disrupt things in order to learn how the system breaks. You must be able to break the system without breaking the bank!

You also want a way to limit the exposure of a chaos test. Some people talk about the “blast radius”...meaning the magnitude of bad experiences both in terms of the sheer number of customers affected and the degree to which they’re disrupted. To keep the blast radius under control, you often want to pick “victims” based on a set of criteria. It may be as simple as “every 10,000th request will fail” when you get started, but you’ll soon need more sophisticated selections and controls.

You’ll need a way to track a user and a request through the tiers of your system, and a way to tell if the whole request was ultimately successful or not. That trace serves two purposes. If the request succeeds, then you’ve uncovered some redundancy or robustness in the system. The trace will tell you where the redundancy saves the request. If the request fails, the trace will show you where that happened, too.

You also have to know what “healthy” looks like, and from what perspective. Is your monitoring good enough to tell when failure rates go from 0.01 percent to 0.02 percent for users in Europe but not in South America? Be wary that measurements may fail when things get weird, especially if monitoring shares the same network infrastructure as production traffic. Also, as Charity Majors, CEO of Honeycomb.io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.

Finally, make sure you have a recovery plan. The system may not automatically return to a healthy state when you turn off the chaos. So you will need to know what to restart, disconnect, or otherwise clean up when the test is done.

Designing the Experiment

Let’s say you’ve got great measurements in place. Your A/B testing system can tag a request as part of a control group or a test group. It’s not quite time to randomly kill some boxes yet. First you need to design the experiment, beginning with a hypothesis.

The hypothesis behind Chaos Monkey was, “Clustered services should be unaffected by instance failures.” Observations quickly invalidated that hypothesis. Another hypothesis might be, “The application is responsive even under high latency conditions.”

As you form the hypothesis, think about it in terms of invariants that you expect the system to uphold even under turbulent conditions. Focus on externally observable behavior, not internals. There should be some healthy steady state that the system maintains as a whole.

Once you have a hypothesis, check to see if you can even tell if the steady state holds now. You might need to go back and tweak measurements. Look for blind spots like a hidden delay in network switches or a lost trace between legacy applications.

Now think about what evidence would cause you to reject the hypothesis. Is a non-zero failure rate on a request type sufficient? Maybe not. If that request starts outside your organization, you probably have some failures due to external network conditions (aborted connections on mobile devices, for example). You might have to dust off those statistics textbooks to see how large a change constitutes sufficient evidence.

Injecting Chaos

The next step is to apply your knowledge of the system to inject chaos. You know the structure of the system well enough to guess where you can kill an instance, add some latency, or make a service call fail. These are all “injections.” Chaos Monkey does one kind of injection: it kills instances.

Killing instances is the most basic and crude kind of injection. It will absolutely find weaknesses in your system, but it’s not the end of the story.

Latency Monkey adds latency to calls. This strategy finds two additional kinds of weaknesses. First, some services just time out and report errors when they should have a useful fallback. Second, some services have undetected race conditions that only become apparent when responses arrive in a different order than usual.

When you have deep trees of service calls, your system may be vulnerable to loss of a whole service. Netflix uses failure injection testing (FIT) to inject more subtle failures.[96] (Note that this is not the same “FIT” as the “framework for integrated testing” in Nonbreaking API Changes.) FIT can tag a request at the inbound edge (at an API gateway, for example) with a cookie that says, “Down the line, this request is going to fail when service G calls service H.” Then at the call site where G would issue the request to H, it looks at the cookie, sees that this call is marked as a failure, and reports it as failed, without even making the request. (Netflix uses a common framework for all its outbound service calls, so it has a way to propagate this cookie and treat it uniformly.)

Now we have three injections that can be applied in various places. We can kill an instance of any autoscaled cluster. We can add latency to any network connection. And we can cause any service-to-service call to fail. But which instances, connections, and calls are interesting enough to inject a fault? And where should we inject that fault?

Introducing Chaos to Your Neighbors
by Nora Jones , Senior Software Engineer and Coauthor of Chaos Engineering (O'Reilly, 2017)
Nora Jones

I was hired as the first and only person working on internal tools and developer productivity at a brand new e-commerce startup during a pivotal time. We had just launched the site, we were releasing code multiple times a day, and not to mention our marketing team was crushing it, so we already had several customers expecting solid performance and availability from the site from day one.

The lightning feature development speed led to a lack of tests and general caution, which ultimately led to precarious situations at times that were not ideal (read: being paged at 4 a.m. on a Saturday). About two weeks into my role at this company, my manager asked me if we could start experimenting with chaos engineering to help detect some of these issues before they became major outages. Given that I was new to the company and didn’t know all my colleagues yet, I started this effort by sending an email to all the developers and business owners informing them we were beginning implementation of chaos engineering in QA and if they considered their services “unsafe to chaos” to let me know and they could opt out the first round. I didn’t get much response. After a couple weeks of waiting and nagging I assumed the silence implied consent and unleashed my armies of chaos. We ended up taking QA down for a week and I pretty much ended up meeting everyone that worked at the company. Moral of the story: chaos engineering is a quick way to meet your new colleagues, but it’s not a great way. Proceed with caution and control your failures delicately, especially when it’s the first time you’re enabling chaos.

Targeting Chaos

You could certainly use randomness. This is how Chaos Monkey works. It picks a cluster at random, picks an instance at random, and kills it. If you’re just getting started with chaos engineering, then random selection is as good a process as any. Most software has so many problems that shooting at random targets will uncover something alarming.

Once the easy stuff is fixed, you’ll start to see that this is a search problem. You’re looking for faults that lead to failures. Many faults won’t cause failures. In fact, on any given day, most faults don’t result in failures. (More about that later in this chapter.) When you inject faults into service-to-service calls, you’re searching for the crucial calls. As with any search problem, we have to confront the challenge of dimensionality.

Suppose there’s a partner data load process that runs every Tuesday. A fault during one part of that process causes bad data in the database. Later, when using that data to present an API response, a service throws an exception and returns a 500 response code. How likely are you to find that problem via random search? Not very likely.

Randomness works well at the beginning because the search space for faults is densely populated. As you progress, the search space becomes more sparse, but not uniform. Some services, some network segments, and some combinations of state and request will still have latent killer bugs. But imagine trying to exhaustively search a images/_pragprog/svg-0003.png dimensional space, where n is the number of calls from service to service. In the worst case, if you have x services, there could be images/_pragprog/svg-0004.png possible faults to inject!

At some point, we can’t rely just on randomness. We need a way to devise more targeted injections. Humans can do that by thinking about how a successful request works. A top-level request generates a whole tree of calls that support it. Kick out one of the supports, and the request may succeed or it may fail. Either way we learn something. This is why it’s important to study all the times when faults happen without failures. The system did something to keep that fault from becoming a failure. We should learn from those happy outcomes, just as we learn from the negative ones.

As humans, we apply our knowledge of the system together with abductive reasoning and pattern matching. Computers aren’t great at that, so we still have an edge when picking targets for chaos. (But see Cunning Malevolent Intelligence, for some developing work.)

 

Automate and Repeat

So far, this sounds like an engineering lab course. Shouldn’t something called “chaos” be fun and exciting? No! In the best case, it’s totally boring because the system just keeps running as usual.

Assuming we did find a vulnerability, things probably got at least a little exciting in the recovery stages. You’ll want to do two things once you find a weakness. First, you need to fix that specific instance of weakness. Second, you want to see what other parts of your system are vulnerable to the same class of problem.

With a known class of vulnerability, it’s time to find a way to automate testing. Along with automation comes moderation. There’s such a thing as too much chaos. If the new injection kills instances, it probably shouldn’t kill the last instance in a cluster. If the injection simulates a request failure between service G to service H, then it isn’t meaningful to simultaneously fail requests from G to every fallback it uses when H isn’t working!

Companies with dedicated chaos engineering teams are all building platforms that let them decide how much chaos to apply, when, to whom, and which services are off-limits. These make sure that one poor customer doesn’t get flagged for all the experiments at once! For example, Netflix calls its the “Chaos Automation Platform” (ChAP).[97]

The platform makes decisions about what injections to apply and when, but it usually leaves the “how” up to some existing tool. Ansible is a popular choice, since it doesn’t require a special agent on the targeted nodes. The platform also needs to report its tests to monitoring systems, so you can correlate the test events with changes in production behavior.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset