Many monolithic applications rely on transactions to guarantee consistency and isolation when changing application state. Obtaining these properties is straightforward: an application typically interacts with a single database, with strong consistency guarantees, using frameworks that provide support for starting, committing, or rolling back transactional operations. Each logical transaction might involve several distinct entities; for example, placing an order will update transactions, reserve stock positions, and charge fees.
You’re not so lucky in a microservice application. As you learned earlier, each independent service is responsible for a specific capability. Data ownership is decentralized, ensuring a single owner for each “source of truth.” This level of decoupling helps you gain autonomy, but you sacrifice some of the safety you were previously afforded, making consistency an application-level problem. Decentralized data ownership also makes retrieving data more complex. Queries that previously used database-level joins now require calls to multiple services. This is acceptable for some use cases but painful for large data sets.
Availability also impacts your application design. Interactions between services might fail, causing business processes to halt, leaving your system in an inconsistent state.
In this chapter, you’ll learn how to use sagas to coordinate complex transactions across multiple services and explore best practices for efficiently querying data. Along the way, we’ll examine different types of event-based architectures, such as event sourcing, and their applicability to microservice applications.
Imagine you’re a customer at SimpleBank and you want to sell some stock. If you recall chapter 2, this involves several operations (figure 5.1):
From your perspective as a customer, this operation appears to be atomic: charging a fee, reserving stock, and creating an order happen at the same time, and you can’t sell stock that you don’t have or sell a stock you do have more than once.
In many monolithic applications,1 those requirements are easy to meet: you can wrap your database operations in an ACID transaction and rest easy in the knowledge that errors will cause an invalid state to be rolled back.
By contrast, in your microservice application, each of the actions in figure 5.1 is performed by a distinct service responsible for a subset of application state. Decentralized data ownership helps ensure services are independent and loosely coupled, but it forces you to build application-level mechanisms to maintain overall data consistency.
Let’s say an orders service is responsible for coordinating the process of selling a stock. It calls account transactions to reserve stock and then the fees service to charge the customer. But that transaction fails. (See figure 5.2.)
At this stage, your system is in an inconsistent state: stock is reserved, an order is created, but you haven’t charged the customer. You can’t leave it like this — so the implementation of orders needs to initiate corrective action, instructing the account transactions service to compensate and remove the stock reservation. This might look simple, but it becomes increasingly complex when many services are involved, transactions are long-running, or an action triggers further interleaved downstream transactions.
Faced with this problem, your first impulse might be to design a system that achieves transactional guarantees across multiple services. A common approach is to use the two-phase commit(2PC)protocol.2 In this approach, you use a transaction manager to split operations across multiple resources into two phases: prepare and commit (figure 5.3).
This sounds great — like what you’re used to. Unfortunately, this approach is flawed. First, 2PC implies synchronicity of communication between the transaction manager and resources. If a resource is unavailable, the transaction can’t be committed and must roll back. This in turn increases the volume of retries and decreases the availability of the overall system. To support asynchronous service interactions, you would need to support 2PC with services and the messaging layer between them, limiting your technical choices.
Handing off significant orchestration responsibility to a transaction manager also violates one of the core principles of microservices: service autonomy. At worst, you’d end up with dumb services representing CRUD operations against data, with transaction managers wholly encapsulating the interesting behavior of your system.
Finally, a distributed transaction places a lock on the resources under transaction to ensure isolation. This makes it inappropriate for long-running operations, as it increases the risk of contention and deadlock. What should you do instead?
Earlier in this book, we discussed using events emitted by services as a communication mechanism. Asynchronous events aid in decoupling services from each other and increase overall system availability, but they also encourage service authors to think in terms of eventual consistency. In an eventually consistent system, you design complex outcomes to result from several independent local transactions over time, which leads you to explicitly design underlying resources to represent tentative states. From the perspective of Eric Brewer’s CAP theorem,3 this design approach prioritizes the availability of underlying data.
To illustrate the difference between a synchronous and an asynchronous approach, let’s return to the sell order example. In a synchronous approach (figure 5.4), the orders service orchestrates the behavior of other services, invoking a sequence of steps until the order is placed to the market. If any steps fail, the orders service is responsible for initiating rollback action with other services, such as reversing the charge.
In this approach, the orders service takes on substantial responsibility:
Although this type of interaction is easy to reason through — as the call graph is logical and sequential — this level of responsibility tightly couples the orders service to other services, limiting its independence and increasing the difficulty of making future changes.
You can redesign this scenario to use events (figure 5.5). Each service subscribes to events that interest it to know when it must perform some work:
OrderRequested
event.OrderCreated
event. Events allow you to take an optimistic approach to availability. For example, if the fees service were down, the orders service would still be able to create orders. When the fees service came back online, it could continue processing a backlog of events. You can extend this to rollback: if the fees service fails to charge because of insufficient funds, it could emit a ChargeFailed
event, which other services would then consume to cancel order placement.
This interaction is choreographed: each service reacts to events, acting independently without knowledge of the overall outcome of the process. These services are like dancers: they know the steps and what to do in each section of a musical piece, and they react accordingly without you needing to explicitly invoke or command them. In turn, this design decouples services from each other, increasing their independence and making it easier to deploy changes independently.
The choreographed approach is a basic example of the saga pattern. A saga is a coordinated series of local transactions; a previous step triggers each step in the saga.
The concept itself significantly predates the microservice approach. Hector Garcia-Molina and Kenneth Salem originally described sagas in a 1987 paper4 as an approach toward long-lived transactions in database systems. As with distributed transactions, locking in long-lived transactions reduces availability — a saga solves this as a sequence of interleaved, individual transactions.
As each local transaction is atomic — but not the saga as a whole — a developer must write their code to ensure that the system ultimately reaches a consistent state, even if individual transactions fail. Pat Helland’s famous paper, “Life Beyond Distributed Transactions,”5 suggests that you can think of this as uncertainty — an interaction across multiple services may not have a guaranteed outcome. In a distributed transaction, you manage uncertainty using locks on data; without transactions, you manage uncertainty through semantically appropriate workflows that confirm, cancel, or compensate for actions as they occur.
Before we talk about sell orders and services, let’s look at a simple real-world saga: purchasing a cup of coffee.6 Typically, this might involve four steps: ordering, payment, preparation, and delivery (figure 5.6). In the normal outcome, the customer pays for and receives the coffee they ordered.
This can go wrong! The coffee shop machine might break; the barista might make a cappuccino, but I wanted a flat white; they might give my coffee to the wrong customer; and so on. If one of these events occurs, the barista will naturally compensate: they might make my coffee again or refund my payment (figure 5.7). In most cases, I’ll eventually get my coffee.
You use compensating actions in sagas to undo previous operations and return your system to a more consistent state. The system isn’t guaranteed to be returned to the original state; the appropriate actions depend on business semantics. This design approach makes writing business logic more complex — because you need to consider a wide range of potential scenarios — but is a great tool for building reliable interactions between distributed services.
Let’s return to the earlier example — sell orders — to better understand how you can apply the saga pattern to your microservices. The actions in this saga are choreographed: each action, TX, is performed in response to another, but without an overall conductor or orchestrator. You can break this task into five subtasks:
Figure 5.8 illustrates the optimistic — most likely — path of this interaction.
Let’s explain the five steps of this process:
Each of these tasks might fail — in which case, your application should roll back to a sane, consistent state. Each of your tasks has a compensating action:
What triggers these actions? You guessed it — events! For example, imagine that placing the order to market fails. The market service will cancel the order by emitting an event — OrderFailed — that each other service involved in this saga consumes. When receiving the event, each service will act appropriately: the orders service will cancel the customer’s order; the transaction service will cancel the stock reservation; and the fees service will reverse the fee charged, executing actions C1, C2, and C3, respectively. This is shown in figure 5.9.
This form of rollback is intended to make the system semantically, not mathematically consistent. Your system on rollback of an operation may not be able to return to the exact same initial state. Imagine one of the tasks executed on calculating the fees was sending out an email. You can’t unsend an email, so you’d instead send another one acknowledging the error and saying the amount that the fees service had charged was deposited back to the account.
Every action involved in a process might have one or more appropriate compensating actions. This approach adds to system complexity — both in anticipating scenarios and in coding for them and testing them — especially because the more services involved in an interaction, the greater the possible intricacy of rolling back.
Anticipating failure scenarios is a crucial part of building services that reflect real-world circumstance, rather than operating in isolation. When designing microservices, you need to take compensation into account to ensure that the wider application is resilient.
The choreographed style of interaction is helpful because participating services don’t need to explicitly know about each other, which ensures they’re loosely coupled. In turn, this increases the autonomy of each service. Unfortunately, it’s not perfect.
No single piece of your code knows how to execute a sell order. This can make validation challenging, spreading those rules across multiple distinct services. It also increases the complexity of state management: each service needs to reflect distinct states in the processing of an order. For example, the orders service must track whether an order has been created, placed, canceled, rejected, and so on. This additional complexity increases the difficulty of reasoning about your system.
Choreography also introduces cyclic dependencies between services: the orders service emits events that the market service consumes, but, in turn, it also consumes events that the market service emits. These types of dependencies can lead to release time coupling between services.
Generally, when opting for an asynchronous communication style, you must invest in monitoring and tracing to be able to follow the execution flow of your system. In case of an error, or if you need to debug a distributed system, the monitoring and tracing capabilities act as a flight recorder. You should have all that happens stored there so you can later investigate every single event to make sense of what happened in a multitude of systems. This capability is crucial for choreographed interactions.
A choreographed approach makes it difficult to know how far along a process is. Likewise, the order of rollback might be important; this isn’t guaranteed by choreography, which has looser time guarantees than an orchestrated or synchronous approach. For simple, near-instant workflows, knowing where you’re at is often irrelevant, but many business processes aren’t instant — they might take multiple days and involve disparate systems, people, and organizations.
Instead of choreography, you can use orchestration to implement sagas. In an orchestrated saga, a service takes on the role of orchestrator (or coordinator): a process that executes and tracks the outcome of a saga across multiple services. An orchestrator might be an independent service — recall the verb-oriented services from chapter 4 — or a capability of an existing service.
The sole responsibility of the orchestrator is to manage the execution of the saga. It may interact with participants in the saga via asynchronous events or request/response messages. Most importantly, it should track the state of execution for each stage in the process; this is sometimes called the saga log.
Let’s make the orders service a saga coordinator. Figure 5.10 illustrates the happy path where a customer places an order successfully.
You’ll quickly see the key difference between this and the choreographed example from figure 5.8: the orders service tracks the execution of each substep in the process of placing an order. It’s useful to think of the coordinator as a state machine: a series of states and transitions between those states. Each response from a collaborator triggers a state change, moving the orchestrator toward the saga outcome.
As you know, a saga won’t always be successful. In an orchestrated saga, the coordinator is responsible for initiating appropriate reconciliation actions to return the entities affected by the failed transaction to a valid, consistent state.
Like you did earlier, imagine the market service can’t place the order to market. The orchestrating service will initiate compensating actions:
In turn, the orchestrator also could track the outcome of actions 1 and 2. Figure 5.11 illustrates this failure scenario.
But if the desired actions you want to happen can fail, the compensating actions — or the orchestrator itself — also could fail. You should design compensating actions to be safe to retry without unintentional side effects (for example, double refunds). At worst, repeated failure during rollback might require manual intervention. Thorough error monitoring should catch these scenarios.
Centralizing the saga’s sequencing logic in a single service makes it significantly easier to reason about the outcome and progress of that saga, as well as change the sequencing in one place. In turn, this can simplify individual services, reducing the complexity of states they need to manage, because that logic moves to the coordinator.
This approach does run the risk of moving too much logic to the coordinator. At worst, this makes the other services anemic wrappers for data storage, rather than autonomous and independently responsible business capabilities.
Many microservice practitioners advocate peer-to-peer choreography over orchestration, as they see this approach to reflect the “smart endpoints, dumb pipes” aim of microservice architecture, in contrast to the heavy workflow tools (such as WS-BPEL) people often used in enterprise SOA. But orchestrated approaches are becoming increasingly popular in the community, especially for building long-running interactions, as seen by the popularity of projects like Netflix Conductor and AWS Step Workflows.
Unlike ACID transactions, sagas aren’t isolated. The result of each local transaction is immediately visible to other transactions affecting that entity. This visibility means that a given entity might get simultaneously involved in multiple, concurrent sagas. As such, you need to design your business logic to expect and handle intermediate states. The complexity of the interleaving required primarily depends on the nature of the underlying business logic.
For now, imagine that a customer placed an order by accident and wanted to cancel it. If they issued their request before the order was placed to market, the order placement saga would still be in progress, and this new instruction would potentially need to interrupt it (figure 5.12).
Three common strategies for handling interwoven sagas are available: short-circuiting, locking, and interruption.
You could prevent the new saga from being initiated while the order is still within another saga. For example, the customer couldn't cancel the order until after the market service attempted to place it to the market. This isn’t great for a user but is probably the easiest strategy!
You could use locks to control access to an entity. Different sagas that want to change the state of the entity would wait to obtain the lock. You’ve already seen an example of this in action: you place a reservation — or lock — on a stock balance to ensure that a customer can’t sell a holding twice if it’s involved in an active order.
This can lead to deadlocks if multiple sagas block each other trying to access the lock, requiring you to implement deadlock monitoring and timeouts to make sure the system doesn’t grind to a halt.
Lastly, you could choose to interrupt the actions taking place. For example, you could update the order status to “failed.” When receiving a message to send an order to market, the market gateway could revalidate the latest order status to ensure the order was still valid to send, and in this case it would see a “failed” status. This approach increases the complexity of business logic but avoids the risk of deadlocks.
Although sagas rely heavily on compensating actions, they’re not the only approach you might use to achieve consistency in service interactions. So far, we’ve encountered two patterns for dealing with failure: compensating actions (refund my coffee payment) and retries (try to make the coffee again). Table 5.1 outlines other strategies.
# | Name | Strategy |
1 | Compensating action | Perform an action that undoes prior action(s) |
2 | Retry | Retry until success or timeout |
3 | Ignore | Do nothing in the event of errors |
4 | Restart | Reset to the original state and start again |
5 | Tentative operation | Perform a tentative operation and confirm (or cancel) later |
The use of these strategies will depend on the business semantics of your service interaction. For example, when processing a large data set, it might make sense to ignore individual failures (applying strategy #3), because the cost of processing the overall data set is large. When interacting with a warehouse — for example, to fulfill orders — it’d be reasonable to place a tentative hold (strategy #5) on a stock item in a customer’s basket to reduce the possibility of overselling.
So far, we’ve assumed that entity state and events are distinct: the former is stored in an appropriate transactional store, whereas the latter are published independently (figure 5.13).
An alternative to this approach is the event sourcing pattern: rather than publishing events about entity state, you represent state entirely as a sequence of events that have happened to an object. To get the state of an entity at a specific time, you aggregate events before that date. For example, imagine your orders service:
Figure 5.14 illustrates the event sourcing approach for tracking an order’s history.
This architecture solves a common problem in enterprise applications: understanding how you reached your current state. It removes the division between state and events; you don’t need to stick events on top of your business logic, because your business logic inherently generates and manipulates events. On the other hand, it makes complex queries more difficult: you’d need to materialize views to perform joins or filter by field values, as your event storage format would only support retrieving entities by their primary key.
Event sourcing isn’t a requirement for a microservice application, but using events to store application state can be a particularly elegant tool, especially for applications involving complex sagas where tracking the history of state transitions is vital. If you’re interested in learning more about event sourcing, Nick Chamberlain’s awesome-ddd list (https://github.com/heynickc/awesome-ddd) has a great collection of resources and further reading.
Decentralized data ownership also makes retrieving data more challenging, as it’s no longer possible to aggregate related data at, or close to, the database level — for example, through joins. Presenting data from disparate services is often necessary at the UI layer of an application.
For example, imagine you’re building an administrative UI that shows a list of customers, together with their current open orders. In a SQL database, you’d join these two tables in a single query, returning one dataset. In a microservice application, this composition would typically take place at the API level: a service or an API gateway could perform this (figure 5.15). Correlation IDs — roughly analogous to foreign keys in a relational database — identify relationships between data that each service owns; for example, each order would record the associated customer ID.
The two-step approach in figure 5.15 works well for single entities or small datasets but will scale poorly for bulk requests. If the first query returns N customers, then the second query will be performed N times, which could quickly get out of hand. If we were querying a SQL database, this would be trivial to solve with a join, but because our data is spread across multiple data stores, an easy solution like using a join isn’t possible.
We could improve this query by introducing bulk request endpoints and paging, as in listing 5.1. Rather than getting every customer, you’d get the first page; rather than retrieving customer orders one-by-one, you could retrieve them with a list of IDs. You should note, though, that if each customer had thousands of orders, having to page those as well would add substantial overhead.
Listing 5.1 Different endpoints for data retrieval
/customers?page=1&size=20 ①
/orders?customerIds=4,5,10,20 ②
API composition is simple and intuitive, and for many use cases, such as individual aggregates or small enumerables, the performance of this approach will be acceptable. For others, such as the following, performance will be inefficient and far from ideal:
Lastly, API composition is impacted by availability. Composition requires synchronous calls to underlying services, so the total availability of a query path is the product of the availability of all services involved in that path. For example, if the two services and the API gateway in figure 5.15 each have an availability of 99%, their availability when called together would be 99%^3: 97.02%. Over the next three sections, we’ll discuss how you also can use events to build efficient queries in microservice applications.
You can elect to have services store or cache data that they receive from other services via events. For example, in figure 5.16, when the fees service receives an OrderCreated
message, it might elect to store additional detail about the order, beyond the correlation ID. This service can now handle queries like “What was the value of this order?” without needing to retrieve that data with an additional call to the orders service.
This technique can be quite useful but risky:
By maintaining canonical data in multiple locations — updated via asynchronous events, which could be delayed, or fail, or be delivered multiple times — you have to cope with eventual consistency and the chance that the copies of data you retrieve have become stale.
Whether it’s fine for data to be stale sometimes is down to the business semantics of the particular feature. But it’s a hard tradeoff. The CAP theorem8 says that you can’t have things both ways: you need to choose between availability — returning a successful result, without a guarantee that data is fresh — and consistency — returning the most recent state, or an error.
Guaranteeing consistency tends to result in increased coordination between systems — such as distributed locks — which hampers transaction speed. In contrast, a system that maximizes availability ultimately relies on compensating actions and retries — a lot like sagas. From an architectural perspective, availability is usually easier to achieve and, because of the reduced coordination cost, more amenable to building scalable applications.
You can generalize the previous approach — using events to build views — further. In many systems, queries are substantially different from writes: whereas writes affect singular, highly normalized entities, queries often retrieve denormalized data from a range of sources. Some query patterns might benefit from completely different data stores than writes; for example, you might use PostgreSQL as a persistent transactional store but Elasticsearch for indexing search queries. The command-query responsibility segregation pattern (CQRS) is a general model for managing these scenarios by explicitly separating reads (queries) from writes (commands) within your system.9
Let’s sketch out this architecture. In figure 5.17, you can see that CQRS partitions commands and queries:
You can apply this pattern both within services and across your whole application — using events to build dedicated query services that own and maintain complex views of application data. For example, imagine you wanted to aggregate order fees across your entire customer base, potentially slicing them by multiple attributes (for example, type of order, asset categories, payment method). This wouldn’t be possible at a service level, because neither the fees, orders, nor customers service has all the data needed to filter those attributes.
Instead, as figure 5.18 illustrates, you could build a query service, CustomerOrders, to construct appropriate views. A query service is a good way to handle views that don’t clearly belong to any other services, ensuring a reasonable separation of concerns.
So far, this all sounds great! In a microservices application, CQRS offers two key benefits:
But it’s not without drawbacks. Let’s explore those now.
Like the data caching example, CQRS requires you to consider eventual consistency because of replication lag: inherently, the command state of a service will be updated before the query state. Because events update query models, someone querying that data might receive an out of date view. This might be a frustrating user experience (figure 5.19). Imagine you update the value of an order, but on clicking Confirm, you see the details of the original order! Web UIs that use a POST/redirect/GET10 pattern will often suffer from this problem.
In some systems, this might not matter. For example, delayed updates are common for activity feeds11 — if I post an update on Twitter, it doesn’t matter if my followers don’t all receive it at thesame time. And in fact, attempting to achieve greater consistency can lead to substantial scalability challenges that might not be worth it.
In other systems, it’ll be important to ensure you don’t query invalid state. You can apply three strategies (figure 5.20) in these scenarios: optimistic updates, polling, or publish-subscribe.
You could update the UI optimistically, based on the expected result of a command. If the command fails, you can roll back the UI state. For example, imagine you like a post on Instagram. The app will show a red heart before the Instagram backend saves that change. If that save fails, Instagram will roll back the optimistic UI change, and you’ll have to like it again for it to show a red heart.
This approach relies on having — or being able to derive — all the information you need to update the UI from the command input, so it works best when working with simple entities.
The UI could poll the query API until an expected change has occurred. When initiating a command, the client would set a version, such as a timestamp. For subsequent queries, the client would continue to poll until the version number was equal or greater to the version number specified, indicating that the query model had been updated to reflect the new state.
Instead of polling for changes, a UI could subscribe to events on a query model — for example, over a web socket channel. In this case, the UI would only update when the read model published an “updated” event.
As you can see, it’s challenging to reason through CQRS, and it requires a different mindset from what you’d have when dealing with normal CRUD APIs. But it can be useful in a microservice application. Done right, CQRS helps to ensure performance and availability in queries, even as you distribute data and responsibility across multiple distinct services and data stores.
You can generalize the CQRS technique to other use cases, such as analytics and reporting. You can transform a stream of microservice events and store them in a data warehouse, such as Amazon Redshift or Google BigQuery (figure 5.21). A transformation stage may involve mapping events to the semantics and data model of the target warehouse or combining events with data from other microservices. If you don’t yet know how you want to treat or query events, you can store them in commodity storage, such as Amazon S3, for later querying or reprocessing with big data tools such as Apache Spark or Presto.
We’ve covered a lot of ground in this chapter, but some topics, like sagas, event sourcing, and CQRS, can each fill entire books. In case you’re interested in knowing more about those topics, we recommend the following books: