Cascading failures

Most software systems start simple. They are built as an all-in-one monolithic app, with modules packaging various code components. All packages are linked together in one big binary. Such a typical early version system is depicted in the following diagram:

It takes requests and performs something of value using three components (modules/building blocks). These interactions are shown in the following diagram; the numbers describe the sequence of things that happen to fulfill a request.

However, as the system evolves and features get added in, there comes a time where we need to make calls to an external service (a dependent). Now, this external service can fail for multiple reasons that are outside our control, and obviously this will cause our application requests to fail:

But consider what happens if the external service is just slow to respond. Here, the client and all resources in the original service are waiting for the request to complete, and this will impact on new requests, which may not even need the failing service. This is most evident in languages/runtimes such as Java or Tomcat, where each request effectively has a thread allocated, and if the client times out and retries for the slow request, we can very quickly degenerate to a situation such as this:

With increasing complexity and feature requests, the team decides to decompose the monolith into microservices. But this amplifies the problem! See the following diagram:

Today's systems are interconnected like never before, and with microservices, new services crop up at regular intervals of time. This means that the overall system will always be evolving—it is in a state of continuous change. In addition, in today's fast-paced development cycles, features will be added every day, and there are deployments multiple times in a day. This velocity, however, brings in greater risk of things going wrong and a fault in a specific service can cascade up to dependent services and bring multiple other parts of the systems down.

To guard against this catastrophe, and to build anti-fragile systems, the architect needs to apply specific design patterns when engineering such systems. The following section goes into detail on achieving resilience in such distributed systems.

Table of Contents for Cascading failures

Create new playlist

Sign In

Sign Up

Table of Contents for
Cascading failures