Adopting Your Own Monkey
When Chaos Monkey launched, most developers were surprised by how many
vulnerabilities it uncovered. Even services that had been in production
for ages turned out to have subtle configuration problems. Some of them
had cluster membership rosters that grew without bounds. Old IP addresses
would stay on the list, even though the owner would never be seen
again. (Or worse, if that IP came back it was as a different service!)
Prerequisites
First of all, your chaos engineering efforts can’t kill your company or
your customers.
In a sense, Netflix had it easy. Customers are familiar with pressing the
play button again if it doesn’t work the first time. They’ll forgive just
about anything except cutting off the end of Stranger Things. If
every single request in your system is irreplaceably valuable, then chaos
engineering is not the right approach for you. The whole point of chaos
engineering is to disrupt things in order to learn how the system
breaks. You must be able to break the system without breaking the bank!
You also want a way to limit the exposure of a chaos test. Some people
talk about the “blast radius”...meaning the magnitude of bad
experiences both in terms of the sheer number of customers affected and the
degree to which they’re disrupted. To keep the blast radius under
control, you often want to pick “victims” based on a set of
criteria. It may be as simple as “every 10,000th request will fail”
when you get started, but you’ll soon need more sophisticated selections
and controls.
You’ll need a way to track a user and a request through the tiers of your
system, and a way to tell if the whole request was ultimately successful or
not. That trace serves two purposes. If the request succeeds, then you’ve
uncovered some redundancy or robustness in the system. The trace will
tell you where the redundancy saves the request. If the request fails,
the trace will show you where that happened, too.
You also have to know what “healthy” looks like, and from what
perspective. Is your monitoring good enough to tell when failure rates go
from 0.01 percent to 0.02 percent for users in Europe but not in South America? Be wary
that measurements may fail when things get weird, especially if
monitoring shares the same network infrastructure as production
traffic. Also, as Charity Majors, CEO of Honeycomb.io says, “If you have
a wall full of green dashboards, that means your monitoring tools aren’t
good enough.” There’s always something weird going on.
Finally, make sure you have a recovery plan. The system may not
automatically return to a healthy state when you turn off the chaos. So
you will need to know what to restart, disconnect, or otherwise clean up
when the test is done.
Designing the Experiment
Let’s say you’ve got great measurements in place. Your A/B testing system
can tag a request as part of a control group or a test group. It’s not
quite time to randomly kill some boxes yet. First you need to design the
experiment, beginning with a hypothesis.
The hypothesis behind Chaos Monkey was, “Clustered services should be
unaffected by instance failures.” Observations quickly invalidated that
hypothesis. Another hypothesis might be, “The application is responsive
even under high latency conditions.”
As you form the hypothesis, think about it in terms of invariants that
you expect the system to uphold even under turbulent conditions. Focus on
externally observable behavior, not internals. There should be some
healthy steady state that the system maintains as a whole.
Once you have a hypothesis, check to see if you can even tell if the
steady state holds now. You might need to go back and tweak
measurements. Look for blind spots like a hidden delay in network switches
or a lost trace between legacy applications.
Now think about what evidence would cause you to reject the
hypothesis. Is a non-zero failure rate on a request type sufficient?
Maybe not. If that request starts outside your organization, you probably
have some failures due to external network conditions (aborted
connections on mobile devices, for example). You might have to dust off
those statistics textbooks to see how large a change constitutes
sufficient evidence.
Injecting Chaos
The next step is to apply your knowledge of the system to inject
chaos. You know the structure of the system well enough to guess where
you can kill an instance, add some latency, or make a service call
fail. These are all “injections.” Chaos Monkey does one kind of
injection: it kills instances.
Killing instances is the most basic and crude kind of injection. It will
absolutely find weaknesses in your system, but it’s not the end of the
story.
Latency Monkey adds latency to calls. This strategy finds two additional kinds of
weaknesses. First, some services just time out and report errors when they
should have a useful fallback. Second, some services have undetected race
conditions that only become apparent when responses arrive in a different
order than usual.
When you have deep trees of service calls, your system may be vulnerable
to loss of a whole service. Netflix uses failure injection testing
(FIT)
to inject more subtle failures. (Note that this is not the same “FIT”
as the “framework for integrated testing” in Nonbreaking API Changes.) FIT can tag a request
at the inbound edge (at an API gateway, for example) with a cookie that
says, “Down the line, this request is going to fail when service G calls
service H.” Then at the call site where G would issue the request to H,
it looks at the cookie, sees that this call is marked as a failure, and
reports it as failed, without even making the request. (Netflix uses a
common framework for all its outbound service calls, so it has a way to
propagate this cookie and treat it uniformly.)
Now we have three injections that can be applied in various places. We
can kill an instance of any autoscaled cluster. We can add latency to any
network connection. And we can cause any service-to-service call to
fail. But which instances, connections, and calls are
interesting enough to inject a fault? And where should we
inject that fault?
Introducing Chaos to Your Neighbors
by
Nora Jones
,
Senior Software Engineer and Coauthor of Chaos Engineering
(O'Reilly, 2017)
Nora Jones
I was hired as the first and only person working on internal tools and
developer productivity at a brand new e-commerce startup during a pivotal
time. We had just launched the site, we were releasing code multiple times
a day, and not to mention our marketing team was crushing it, so we already
had several customers expecting solid performance and availability from the
site from day one.
The lightning feature development speed led to a lack of tests and general
caution, which ultimately led to precarious situations at times that were
not ideal (read: being paged at 4 a.m. on a Saturday). About two weeks into
my role at this company, my manager asked me if we could start
experimenting with chaos engineering to help detect some of these issues
before they became major outages. Given that I was new to the company and didn’t
know all my colleagues yet, I started this effort by sending an email to
all the developers and business owners informing them we were beginning
implementation of chaos engineering in QA and if they considered their
services “unsafe to chaos” to let me know and they could opt out the first
round. I didn’t get much response. After a couple weeks of waiting and
nagging I assumed the silence implied consent and unleashed my armies of
chaos. We ended up taking QA down for a week and I pretty much ended up
meeting everyone that worked at the company. Moral of the story: chaos
engineering is a quick way to meet your new colleagues, but it’s not a
great way. Proceed with caution and control your failures delicately,
especially when it’s the first time you’re enabling chaos.
Targeting Chaos
You could certainly use randomness. This is how Chaos Monkey works. It
picks a cluster at random, picks an instance at random, and kills it. If
you’re just getting started with chaos engineering, then random selection
is as good a process as any. Most software has so many problems that
shooting at random targets will uncover something alarming.
Once the easy stuff is fixed, you’ll start to see that this is a search
problem. You’re looking for faults that lead to failures. Many faults
won’t cause failures. In fact, on any given day, most faults don’t result
in failures. (More about that later in this chapter.) When you inject faults into
service-to-service calls, you’re searching for the crucial calls. As with
any search problem, we have to confront the challenge of dimensionality.
Suppose there’s a partner data load process that runs every Tuesday. A
fault during one part of that process causes bad data in the
database. Later, when using that data to present an API response, a
service throws an exception and returns a 500 response code. How likely
are you to find that problem via random search? Not very likely.
Randomness works well at the beginning because the search space for
faults is densely populated. As you progress, the search space becomes
more sparse, but not uniform. Some services, some network segments, and some
combinations of state and request will still have latent killer bugs. But
imagine trying to exhaustively search a dimensional space,
where n is the number of calls from service to service. In the worst
case, if you have x services, there could be
possible faults to inject!
At some point, we can’t rely just on randomness. We need a way to devise
more targeted injections. Humans can do that by thinking about how a
successful request works. A top-level request generates a whole tree of
calls that support it. Kick out one of the supports, and the request may
succeed or it may fail. Either way we learn something. This is why it’s
important to study all the times when faults happen without failures. The
system did something to keep that fault from becoming a failure. We
should learn from those happy outcomes, just as we learn from the
negative ones.
As humans, we apply our knowledge of the system together with abductive
reasoning and pattern matching. Computers aren’t great at that, so we
still have an edge when picking targets for chaos. (But see Cunning Malevolent Intelligence, for some developing work.)
Automate and Repeat
So far, this sounds like an engineering lab course. Shouldn’t something
called “chaos” be fun and exciting? No! In the best case, it’s totally
boring because the system just keeps running as usual.
Assuming we did find a vulnerability, things probably got at least a
little exciting in the recovery stages. You’ll want to do two things once
you find a weakness. First, you need to fix that specific instance of
weakness. Second, you want to see what other parts of your system are
vulnerable to the same class of problem.
With a known class of vulnerability, it’s time to find a way to automate
testing. Along with automation comes moderation. There’s such a thing as
too much chaos. If the new injection kills instances, it probably
shouldn’t kill the last instance in a cluster. If the injection
simulates a request failure between service G to service H, then it isn’t
meaningful to simultaneously fail requests from G to every fallback it
uses when H isn’t working!
Companies with dedicated chaos engineering teams are all building
platforms that let them decide how much chaos to apply, when, to whom,
and which services are off-limits. These make sure that one poor customer
doesn’t get flagged for all the experiments at once! For example, Netflix
calls its the “Chaos Automation
Platform” (ChAP).
The platform makes decisions about what injections to apply and when, but
it usually leaves the “how” up to some existing tool. Ansible is a
popular choice, since it doesn’t require a special agent on the targeted
nodes. The platform also needs to report its tests to monitoring systems, so
you can correlate the test events with changes in production behavior.