The asynchronous computation pattern

Consider a typical sequence of an API flow:

  1. Client calls service: POST/dosomework.
  2. Service spawns a goroutine to handle the API request.
  3. The API processing is involved and takes some time. The handler also needs to call an external dependency (DependencyI) to get the work done.
  4. The client waits for the service to finish the work.

What can go wrong here? Well, multiple things! Consider the following:

  • The client service interconnect network might experience a discontinuity. The client's socket will be closed and it will most likely retry. This is especially common if the communication is happening over the internet/WAN.
  • If the client retry occurs, the Service might have already progressed in handling of /dosomework. Database entries might have been created, hotels booked, and so on. The service needs to ensure that such handling is idempotent!
  • DependencyI might be down—or worse, take a long time to respond. In this case, if the client retries, DependencyI will also need to be idempotent.
  • Since /dosomework takes some time, and the client is waiting for the response, the web service serving the request will need to exclusively assign resources while the operation is in progress.

Machines/networks can often go down. It is important that the software architecture is resilient to such failures and provides efficiency and consistency guarantees. One way to solve this issue is to have an async architecture. The service can just log (in durable storage) that such-and-such client requires /dosomework, and responds with a job ID. A bunch of background workers can then pick up this job and fulfill it. The client can gauge the progress of the job through a separate URL. This architecture pattern is depicted here:

Messaging systems such as Kafka (covered in detail in Chapter 6, Messaging) lend themselves well to performing this log-a-job pattern.

A well-documented example of this architecture is the grep-the-web sample architecture for AWS, as described by Jeff Barr in his whitepaper (https://aws.amazon.com/blogs/aws/white-paper-on/):

(Source: https://aws.amazon.com/blogs/aws/white-paper-on/)

This problem statement is to build a solution that runs a regular expression against millions of documents from the web and returns the results that match the query. The operation is meant to be long-running and involve multiple stages. Also, the deployment is assumed to be an elastic one in the cloud—where machines (virtual machines (VMs)) can do down without being noticed. As shown in the preceding diagram, the solution architecture consists of the following:

  • Launch Controller: This service takes a grep job and spawns/monitors the rest of the services in the pipeline. The actual grep is done by MapReduce jobs using Hadoop.
  • Monitoring Controller: It monitors the MapReduce job, updates the status in the Status DB, and writes the final output.
  • Status DB: All services update the current stage, status, and metrics of the pipeline for each job in this DB.
  • Billing Controller: Once a Job is scheduled, it's also provisioned for billing through the Billing Queue and the Billing Controller. This service has all the knowledge of how to bill the customer for each job.
  • Shutdown Controller: Once a job is finished, the Monitoring Controller enqueues a message in the Shutdown Queue, and this triggers the Shutdown Controller to clean up after the job is down.

Here are some salient features of the architecture:

  • The architecture follows the async design pattern.
  • The system is tolerant of machine failures. A job stage carries on from the stage where the job failed.
  • There is no coupling between the services (controllers). If needed, behavior can be extended by plugging in new queues and controllers, with a high level of confidence that the current processing won't break.
  • Each stage (controller) of the job can be independently scaled.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset