As a quick recap from Chapter 5, Going Distributed, we saw that microservices interact with one another over the network using either APIs or Messaging. The basic idea is that, using a specific protocol, microservices will exchange data in a standardized format over the network to enable macro-behavior and fulfill the requirement. There are multiple places where things can go wrong here, as shown in the following diagram:
Preceding diagram is described as follows:
- A service may go down either during the service of a request from the client, or when it's idle. The service may go down because the machine went down (hardware/hypervisor errors) or because there was an uncaught exception in the code.
- A database hosting persistent data may go down. The durable storage might get corrupted. The DB can crash in the middle of a transaction!
- A service may spawn an in-memory job, respond with OK to the client, and then go down, removing any reference to the job.
- A service may consume a message from the broker but may crash just before acting on it.
- The network link between two services may go down or be slow.
- A dependent external service may start acting slow or start throwing errors.
Reliability in a system is engineered at multiple levels:
- Individual services are built as per the specification and work correctly
- Services are deployed in a high-availability setup so that a backup/alternate instance can take the place of an unhealthy one
- The architecture allows the composite of individual services to be fault-tolerant and rugged
We will look at dependency management in couple of the Dependencies and Dependency resilience section. For the rest, we will cover engineering reliability in the following subsections.