Cascading Failures

System failures start with a crack. That crack comes from some fundamental problem. Maybe there’s a latent bug that some environmental factor triggers. Or there could be a memory leak, or some component just gets overloaded. Things to slow or stop the crack are the topics of the next chapter. Absent those mechanisms, the crack can progress and even be amplified by some structural problems. A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.

An obvious example is a database failure. If an entire database cluster goes dark, then any application that calls the database is going to experience problems of some kind. What happens next depends on how the caller is written. If the caller handles it badly, then the caller will also start to fail, resulting in a cascading failure. (Just like we draw trees upside-down with their roots pointing to the sky, our problems cascade upward through the layers.)

Pretty much every enterprise or web system looks like a set of services grouped into distinct farms or clusters, arranged in layers. Outbound calls from one service funnel through a load balancer to reach the provider. Time was, we talked about “three-tier” systems: web server, app server, and database server. Sometimes search servers were off to the side. Now, we’ve got dozens or hundreds of interlinked services, each with their own database. Each service is like its own little stack of layers, which are then connected into layers of dependencies beyond that. Every dependency is a chance for a failure to cascade.

Crucial services with a high fan-in—meaning ones with many callers—spread their problems widely, so they are worth extra scrutiny.

Cascading failures require some mechanism to transmit the failure from one layer to another. The failure “jumps the gap” when bad behavior in the calling layer gets triggered by the failure condition in the provider.

Cascading failures often result from resource pools that get drained because of a failure in a lower layer. Integration points without timeouts are a surefire way to create cascading failures.

The layer-jumping mechanism often takes the form of blocked threads, but I’ve also seen the reverse—an overly aggressive thread. In one case, the calling layer would get a quick error, but because of a historical precedent it would assume that the error was just an irreproducible, transient error in the lower layer. At some point, the lower layer was suffering from a race condition that would make it kick out an error once in a while for no good reason. The upstream developer decided to retry the call when that happened. Unfortunately, the lower layer didn’t provide enough detail to distinguish between the transient error and a more serious one. As a result, once the lower layer started to have some real problems (losing packets from the database because of a failed switch), the caller started to pound it more and more. The more the lower layer whined and cried, the more the upper layer yelled, “I’ll give you something to cry about!” and hammered it even harder. Ultimately, the calling layer was using 100 percent of its CPU making calls to the lower layer and logging failures in calls to the lower layer. A Circuit Breaker,, would really have helped here.

Speculative retries also allow failures to jump the gap. A slowdown in the provider will cause the caller to fire more speculative retry requests, tying up even more threads in the caller at a time when the provider is already responding slowly.

Just as integration points are the number-one source of cracks, cascading failures are the number-one crack accelerator. Preventing cascading failures is the very key to resilience. The most effective patterns to combat cascading failures are Circuit Breaker and Timeouts.

Remember This

Stop cracks from jumping the gap.

A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.

Scrutinize resource pools.

A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.

Defend with Timeouts and Circuit Breaker.

A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled point.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset