High availability

Would you go live with your service running on a single machine? Of course not! The machine going down, or a disk failing on the server, will bring down the entire service and affect customers. The machine becomes a single point of failure (SPOF):

(Source: http://timkellogg.me/blog/2013/06/09/dist-sys-antipatterns )

Single points of failure can be removed by engineering redundancy—which means having multiple instances of the service/resource. Redundancy can be architected in two modes:

Active Mode: If, as described in service-level reliability engineering, the service is stateless, redundancy is easily achieved by having multiple instances. If one fails, that load/traffic can be diverted to another healthy instance. We will see how this is done in the Routing and health section.
Standby Mode: For stateful resources (such as databases), just having multiple instances is not sufficient. In this mode, when a resource fails, functionality is recovered on a secondary instance using a process called failover. This process will typically require some time, in order for the backup instance to gain state/content—but during this time, there will be unavailability. It is possible to minimize this time by having the secondary resource pre-launched but in a dormant state, and having state/context-sharing between the active and standby instance.

A system is said to be highly available when it can withstand the failure of an individual component (servers, disks, network links). Running multiple instances is not enough to build fault-tolerance. The key to the high availability of the system is that failures in individual instances don't bring down the whole system. This mandates reliable routing of requests to a healthy instance so that unhealthy instances don't get production traffic and compromise the health of the service as a whole.

To detect faults, you first need to find a measure of health. Health is relevant both at the host (or instance) level and the overall service level. A service is typically deployed with multiple instances behind a virtual IP (VIP) supported by a load balancer (LB). The LB should route requests to only those service instances that are healthy—but how does the LB know about instance health? Generally, there are periodic health check pings to a designated URL on the service (/health). If the instance responds with a normal response, we know it's healthy, otherwise it should be booted out of the pool maintained by the LB for that VIP. These checks run in the background periodically.

Many developers do engineer the /health URL, but hardcode a 200 OK response. This isn't a great idea. Ideally, the service instance should collect metrics about various operations and errors in the service and the health response handler should analyze these metrics to get a measure of the instance's health.

Network health is usually monitored and made resilient by the networking protocols such as IP and TCP. They figure out optimal routes across redundant links and handle faults such as dropped, out-of-order, or duplicate packets.

This section assumes the server-side discovery of instances. As we saw in Chapter 5, Going Distributed, client-side service discovery is also possible. It comes with its own high-availability solutions, but these are outside the scope of this book.

Table of Contents for High availability

Create new playlist

Sign In

Sign Up

Table of Contents for
High availability