Governor

In Force Multiplier, we looked into an outage that Reddit.com suffered. As a quick reminder, Reddit’s configuration management system restarted a part of its infrastructure management that scales server instances up and down. This was in the middle of a ZooKeeper migration, so the autoscaler read a partial configuration and decided to shut down nearly every machine instance in Reddit.

The flip side of that coin is a job scheduler that spins up too many compute instances in order to process a queue before a deadline. The work still can’t get done fast enough, and, to add insult to injury, the cloud provider’s invoice that month is written in scientific notation.

Automation has no judgment. When it goes wrong, it tends to go wrong really quickly. By the time a human perceives the problem, it’s a question of recovery rather than intervention. How can we allow human intervention without putting a human in the loop for everything? We should use automation for things humans are bad at: repetitive tasks and fast response. We should use humans for what automation is bad at: perceiving the whole situation at a higher level.

Believe it or not, we can look to eighteenth-century technology for an answer. Before the era of steam engines, power came from muscles (human or animal). Steam engineers quickly discovered that it is possible to run machines so fast that the metal breaks. Parts fly apart from tension or they seize up under compression. Bad things happen to the machines and to anyone nearby. The solution was the governor. A governor limits the speed of an engine. Even if the source of power could drive it faster, the governor prevents it from running at unsafe RPMs.

We can create governors to slow the rate of actions. Reddit did this with its autoscaler by adding logic that says it can only shut down a certain percentage of instances at a time.

A governor is stateful and time-aware. It knows what actions have been taken over a period of time. It should also be asymmetric. Most actions have a “safe” direction and an “unsafe” one. Shutting down instances is unsafe. Deleting data is unsafe. Blocking client IP addresses is unsafe.

You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad. That means actions may also be safe within a defined range but unsafe outside the range. Your AWS budget may allow for a thousand EC2 instances, but if the autoscaler starts heading toward two thousand, then it needs to slow down. You can think about this U-shaped curve as defining the response curve for the governor. Inside the safe zone, the actions are fast. Outside the range, the governor applies increasing resistance.

The whole point of a governor is to slow things down enough for humans to get involved. Naturally that means connecting to monitoring both to alert humans that there’s a situation and to give them enough visibility to understand what’s happening.

Remember This

Slow things down to allow intervention.

When things are about to go off the rails, we often find automation tools pushing the throttle to its limit. Humans are better at situational thinking, so we need to create opportunities for us to intervene.

Apply resistance in the unsafe direction.

Some actions are inherently unsafe. Shutting down, deleting, blocking things...these are all likely to interrupt service. Automation will make them go fast, so you should apply a Governor to provide humans with time to intervene.

Consider a response curve.

Actions may be safe within a defined range. Outside that range they should encounter increasing “resistance” by slowing down the rate by which they can occur.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset