Chapter 3. Messages

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Messages

This chapter covers

Using messages as the key to designing microservice architectures
Deciding when to go synchronous, and when to go asynchronous
Pattern matching and transport independence
Examining patterns for microservice message interactions
Understanding failure modes of microservice interactions

The term microservices invites you to think in terms of services. You’ll naturally be drawn to ask, “What are the microservices in this system?” Resist that temptation. Microservice systems are powerful because they allow you to think in terms of messages. If you take a messages-first approach to your system design, you free yourself from premature implementation decisions. The intended behavior of the system can be described in terms of a language of messages, independent of the underlying microservices that generate and react to those messages.

3.1. Messages are first-class citizens

A messages-first approach is more useful than a services-first approach to system design. Messages are a direct expression of intent. Business requirements are expressed as the activities that should happen in the system. Strangely, we’re traditionally taught to extract the nouns from business requirements, so that we can build objects. Are business requirements that obscure? Perhaps we should pay more attention to the activities they describe.

If break business requirements down into activities, this naturally suggests the messages within the system; see table 3.1. Messages represent actions, not things. Take, for example, a typical e-commerce website. You put some items in your shopping cart, and then you proceed to the checkout. On checkout, the system records the purchase, sends you an email confirmation, and sends the warehouse delivery instructions, among other things. Doesn’t that naturally break down into some obvious activities?

Table 3.1. Messages representing activities

Activity description	Message name	Message data
Checking out	checkout	Cart items and prices
Recording the purchase	record-purchase	Cart items and prices; sales tax and total; customer details
Sending an email confirming the purchase	checkout-email	Recipient; cart summary; template identifier
Delivering the goods	deliver-cart	Cart items; customer address

How about this: checking out, recording the purchase, sending an email confirming the purchase, delivering the goods. You’re one step away from the messages that contain the data that describe these activities. Write down the activities in a table, write down a code name for the activity (which gives you a name for the message), and write down the data contents of the message. By doing this, you can see that there’s no need at this level of analysis to think about which services handle these messages.

Analytical thinking, using messages as a primitive element, can scale, in the sense that you can develop a good understanding of the system design at this level.^[1] Designing the system using services as the primitive element doesn’t scale in the same way. Although there are fewer kinds of services than messages, in practice it’s hard not to think of them from the perspective of network architecture. You have to decide which services talk to which other services, which services handle which messages, and whether services observe or consume messages. You end up having to think about many more different kinds of things. Even worse, you lock yourself into architectural decisions. And your thinking is static from the start, rather than allowing you to adapt message behavior as requirements change and emerge.

¹
The quality of good in this sense means your understanding generates “true” statements: that is, statements the business stakeholders agree with.

Messages are just one kind of thing. They’re homogeneous entities. Thinking about them is easy. You can list them, and you can list the messages that each message generates, so you have a natural description of causality. You can group and regroup messages to organize them conceptually, which may suggest appropriate services to generate or handle those messages. Messages provide a conceptual level that bridges informal business requirements and formal, technical, system specifications. In particular, messages make services and their communication arrangements of less interest—these are implementation details.

The three-step analytical strategy discussed in this book (requirements to messages to services) is shown in figure 3.1.

Figure 3.1. Conceptual levels of system design

3.1.1. Synchronous and asynchronous

Microservice messages fall into two broad types: synchronous and asynchronous. A synchronous message is composed of two parts: a request and a response. The request is the primary aspect of the message, and the response is secondary. Synchronous messages are often implemented as HTTP calls, because the HTTP protocol is such a neat fit to the request/response model. An asynchronous message doesn’t have a directly associated response. It’s emitted by a service and observed or consumed later by other services.

The essential difference between synchronous and asynchronous messages isn’t in the nature of the message transfer protocol; it’s in the intent of the originating microservice. The originator of a synchronous message needs an immediate response to continue its work and is blocked until it gets a response. The originator of an asynchronous message isn’t blocked, is prepared to wait for results, and can handle scenarios where there are no results:^[2]

²
Microservices don’t need to wait on human time scales. The waiting period could be on the order of milliseconds.

Synchronous— A shopping-cart service, when adding a product to a cart, needs the sales-tax service to calculate the updated sales tax before providing a new total to the user. The scenario is a natural fit for a synchronous message.
Asynchronous— Alternatively, to display a list of recommended products below the current shopping cart, the recommendation service first issues a message asking for recommendations. Call this a need message. Multiple recommendation services, using different algorithms, may respond with messages containing recommendations. The recommendation service needs to collect all of these recommendations, so call these collect messages. The recommendation service can aggregate all the collect messages it receives to generate the list of recommendations. It may receive none, in which case it falls back to some default behavior (say, offering a random list of recommendations, after a timeout).

Synchronous and asynchronous messages are convertible

Workflows that use synchronous messages can always be converted to workflows that use asynchronous messages. Any synchronous message can be broken into explicit request and response messages, which are asynchronous. Any asynchronous message that triggers messages in reaction can be aggregated into a single synchronous message that blocks, waiting for the first response.

Beware: Moving between these message workflows requires refactoring of your microservices. The decision to make a given message type synchronous or asynchronous is a core design decision and an expression of your intent as an architect.

3.1.2. When to go synchronous

The synchronous style is well suited to the command pattern, where messages are commands to do something. The response is an acknowledgment that the thing to be done, was done, with such-and-such results. Activities that fit this model include data-storage and data-manipulation operations, instructions to and from external services, serial workflows, control instructions, and, perhaps most commonly, instructions from user interfaces.

The user-interface question

Should user interfaces be built using the microservice architecture? At the time of this writing, this is an open question, because there are no viable examples in the wild, especially at scale. User interface implementations are monolithic in nature as a matter of practicality. This isn’t to say that the flexibility of the microservice approach as an inspiration for a user interface component model isn’t credible, nor does it preclude a message-oriented approach. Nonetheless, the microservice architecture is driven by a very different set of needs and requirements: those relating to building scalable systems at an accelerated pace of development.

This book takes an open-minded position on this question. As the philosopher Ludwig Wittgenstein said, “Whereof one cannot speak, thereof one must be silent.”^[3]

³
From Ludwig Wittgenstein’s Tractatus Logico-Philosophicus (1922), one of the most hard-line works of early twentieth century positivist philosophy. This is a dangerous quote, because it may be a fair assessment of this entire book!

Synchronous messages are a naturalistic style and can often be the first design that comes to mind. In many cases, a serial workflow of synchronous messages can be unwound into a parallel set of asynchronous messages. For example, when building a complete content page of a website, each content unit is a mostly independent rectangular area. In a traditional monolith, such pages are often built with simple linear code that blocks, waiting for each item of content to be retrieved. It’s more effort to put in place infrastructure code to parallelize the page construction. In a microservice context, a first cut of the page-construction service might work the same way by issuing content-retrieval messages in series, following the linear mental model of the monolith. But because the microservice context offers an asynchronous model as well, and because the page-construction service is isolated from the rest of the system, it’s far less effort to rewrite the service to use a parallel approach.^[4] Thus, it shouldn’t be a matter of much anxiety if you find yourself designing a system that uses synchronous messages to a significant degree.

⁴
This is a good example of the ease of refactoring when you use microservices; a similar refactoring inside a monolith would involve much code munging (to use the technical term). In the microservice case, refactoring is mostly reconfiguration of message interactions.

Business requirements that are expressed as explicit workflows tend to need synchronous messages as part of their implementation. You’ve already seen such a workflow in the example of the e-commerce checkout process. Such workflows contain gates that prevent further work unless specific operations have completed, and this maps well to the request/response mental model. Traditionally, heavyweight solutions are often used to define such workflows; but in the microservice world, the correct approach is to encode the workflow directly in a special-purpose orchestrating microservice. In practice, this orchestration typically involves both synchronous and asynchronous elements.

Synchronous messages do have drawbacks. They create stronger coupling between services. They can often be seen as remote procedure calls, and adopting this mindset leads to the distributed monolith anti-pattern. A bias toward synchronous messages can lead to deep service dependency trees, where an original inbound message triggers a cascade of multilevel synchronous submessages. This is inherently fragile. Finally, synchronous messages block code execution while waiting for a response. In thread-based language platforms,^[5] this can lead to complete failure if all message-handling threads become blocked. In event-based platforms,^[6] the problem is less severe but is still a waste of compute resources.

⁵
Such as the Java JVM or .NET CLR.

⁶
Node.js is the prime example.

3.1.3. When to go asynchronous

The asynchronous style takes a little getting used to. You have to stop thinking in terms of the programming model of function calls that return results, and start thinking in a more event-driven style. The payoff is much greater flexibility. It’s easier to add new microservices to an asynchronous interaction. The trade-off is that interactions are no longer linear chains of causality.

The asynchronous approach is particularly strong when you need to extend the system to handle new business requirements. By announcing key pieces of information, you allow other microservices to react appropriately without needing any knowledge of those services. Returning to the earlier example, a shopping cart service can announce the fact that a checkout has occurred by publishing a checkout message. Microservices that store a record of the purchase, send out confirmation emails, and perform delivery can each listen for this message independently. There’s no need for specific command messages to trigger these activities, nor is there a need for the shopping-cart service to know about these other services. This makes it easier to add new activities, such as a microservice to add loyalty points, because no changes are required to existing production services.

As a general principle, even when you’re using synchronous messages, you should consider also publishing asynchronous information messages. Such messages announce a new fact about the world that others can choose to act on. This gives you an extremely decoupled, extensible architecture.

The drawback of this approach is the implicit nature of the system’s behavior. As the number of messages and microservices grows, understanding of all the system interactions will be lost. The system will begin to exhibit emergent behavior and may develop undesired modes of behavior. These risks can be mitigated by a strictly incremental approach to system change twinned with meaningful measurement of system behavior. Microservices can’t evaporate inherent complexity, but at least they make it more observable.

3.1.4. Thinking distributed from day one

Microservice systems are distributed systems. Distributed systems present difficult problems that are often intractable. When faced with such challenges, there’s a psychological tendency to pretend they don’t exist and to hope that they’ll just go away by themselves. This style of thinking is the reason many distributed computing frameworks try to make remote and local appear the same: remote procedures and remote objects are presented behind facades that make them look local.^[7] This approach trades temporary relief for ongoing insanity. Hiding inherent complexity and the fundamental properties of a system makes the architecture fragile and subject to catastrophic failure modes.

⁷
This is a feature of approaches such as Java RMI and CORBA.

Fallacies of distributed computing

The following fallacies are a warning to programmers everywhere to tread carefully when you take your first steps into the world of distributed computing. They were first outlined informally by David Deutsch (a Sun Microsystems engineer) in 1994:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There’s one administrator.
Transport cost is zero.
The network is homogeneous.

Microservices don’t allow you to escape from these fallacies. It’s a feature of the microservice world view that we embrace the warnings of the fallacies rather than try to solve them.^[8]

⁸
For a wonderfully sardonic discussion of the fallacies, you won’t do better than Arnon Rotem-Gal-Oz’s paper “Fallacies of Distributed Computing Explained,” www.rgoarchitects.com/Files/fallacies.pdf.

No matter how good your code is, distributed computing will always be difficult. This is because there are fundamental logical limits on the ability of distributed systems to communicate consistently. A good illustration is the problem of the Byzantine Generals.^[9]

⁹
Leslie Lamport, Robert Shostak, and Marshall Pease, “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems 4, no. 3 (July 1982), www.cs.cornell.edu/courses/cs614/2004sp/papers/lsp82.pdf. This seminal paper explains the basis of distributed consensus.

Let’s start with two generals of the Byzantine Empire: one is the leader, and the other is a follower. Each general commands an army of 1,000 soldiers on opposite sides of an enemy army of 1,500. If they both attack at the same time, their combined force of 2,000 will be victorious. To attack alone will ensure certain defeat. The leader must choose the time of the attack. To communicate, the generals can send messengers who sneak through enemy lines. Some of these messengers will be captured, but the messages are secured with an unbreakable secret code.

The problem is this: what pattern of messages can be used to guarantee that the two generals attack at the same time? A simple protocol suggests itself immediately: require an acknowledgment of each message. But the acknowledgment requires a messenger, who could be captured. And even if the acknowledgment does arrive, the sender doesn’t know this.

Place yourself in the position of the follower. You receive this message: “Attack at dawn!” You acknowledge the message by sending back a messenger of your own. Dawn arrives—do you attack?

What if your acknowledgment messenger was captured? You have no way of knowing if they arrived safely. And your leader has no way of knowing what you know. You’ll both reach the logical conclusion not to attack, even though all messages have been successfully delivered! More acknowledgments won’t help. The last general to send a message can never be sure of its delivery.

For more fun and games, you can increase the number of generals, question the trustworthiness and sanity of both generals and messengers, allow the enemy to subvert or invent messages, and so on. The general case of the Byzantine Generals problem isn’t so far from the reality of running a large-scale microservices deployment!

Pragmatically, this problem is solved by accepting that certainty isn’t possible and then asking how many messages, of what kind, subject to limits such as timeouts, with what meta-information such as sequence numbers, can be used to establish an acceptable level of probability that agreement has been reached. The TCP/IP protocol is a fine example of such a solution.^[10]

¹⁰
Transmission Control Protocol/Internet Protocol (TCP/IP) uses algorithms such as slow start (increasing data volume until maximum capacity is found) and exponential backoff (waiting longer and longer when retrying sends) to achieve an acceptable level of transmission reliability.

The key point is that you must accept the unreliable nature of message transmission between microservices. Your design thinking can never be based on an assumption that messages will be delivered as you intend them to be delivered. This assumption isn’t safe. Messages aren’t like method calls in a monolithic application, where the metaphor of objects exchanging messages is just that: a metaphor. In a distributed system, message delivery can’t be made reliable.

What you can do is make message delivery—and, indirectly, system behavior—predictable to within acceptable tolerances. You can limit failure by spending money and time. It’s your job, and your responsibility as an engineer, to deliver systems to agreed tolerances.

3.1.5. Tactics to limit failure

Just because failure is inevitable doesn’t mean you can’t do anything about it. As an ethical engineer, you should know and apply reasonable mitigating tactics:

Message delivery will be unreliable. Accept that this is a reality and will happen. You can get closer to 100% reliability by spending ever-larger sums of money, but you’ll suffer from diminishing marginal returns, and you’ll never get to 100%. Always ask, “What happens if this message is lost?” Timeouts, duplication, logging, and other mitigations can help, but not if you aren’t asking that question.
Latency and throughput are trade-offs. Latency tells you how quickly messages are acted on (most commonly, by measuring the response time of synchronous messages). Low latency is a legitimate goal. Throughput tells you how many messages you can handle. High throughput is also a legitimate goal. To a first approximation, these goals are inversely related. Designing a system for high throughput means you need to put in place scalability housekeeping (such as proxy servers) that increases latency, because there are extra network hops. Conversely, designing for low latency means high throughput, although possible, will be much more expensive (for example, you might be using more-powerful machines).
Bandwidth matters. The networked nature of microservice systems means they’re vulnerable to bandwidth limitations. Even if you start out with a plentiful supply, you must adopt a mentality of scarcity. Misbehaving microservices can easily cause an internally generated denial-of-service attack. Keep your messages small and lean. Don’t use them to send much data; send references to bulk data storage, instead.^[11] Bandwidth as a resource is becoming cheaper and more plentiful, but we aren’t yet in a post-scarcity world.

¹¹
To send an image between services, don’t send the image binary data, send a URL pointing to the image.
Security doesn’t stop at the firewall. You may be tempted to engage in security theater, such as encrypting all interservice messages. Or you might be forced into it by your clients. In some ways, this is mostly harmless, although it does drain resources. It’s more effective to adopt the stance, within each microservice, that inbound messages are potentially malign and may come from malicious actors. Semantic attacks^[12] are your primary concern. Microservices are naturally more resistant to syntactic attacks because they offer a significantly reduced attack surface—you can only get in via messages. Semantic attacks, in the form of malicious messages generated by improper external interactions, are the principle attack vector. Message schema validation won’t help here, because the dangerous messages are exactly those that work and are, by definition, syntactically correct.

¹²
Explained in detail by Bruce Schneier, a respected security expert, on his blog: “Semantic Attacks: The Third Wave of Network Attacks,” Schneier on Security, October 15, 2000, http://mng.bz/Ai4Q.
Avoid local solutions to global problems. Let’s say you’ve decided to use a synchronous approach to microservices, with HTTP REST as the message transport. Every microservice needs to know the network location of every other microservice it wants to talk to, because it has to direct those HTTP requests somewhere. So what do you do? Verschlimmbessern!^[13] Obviously, a distributed key store running some sort of distributed shared-state consensus algorithm will give you a service-discovery solution, and problem solved! It isn’t that this is not a legitimate solution; the real question is whether the problem (service discovery) is one you should be solving in the first place.

¹³
One of those wonderful German loan words, perhaps even better than schadenfreude. To verschlimmbessern is to make something much worse by earnestly trying to make it better, all the while blissfully ignoring the real problem.
Automate or die. There’s no such thing as a free lunch. For all the benefits of microservices, you can’t escape the fact that you have to manage lots of little things spread over many servers. Traditional tools and approaches won’t work, because they can’t handle the ramp-up in complexity. You’ll need to use automation that scales. More on this in chapter 5.

3.2. Case study: Sales tax rules

Our case study in this chapter focuses on one part of the work of building a general e-commerce solution: calculating sales tax. This may not seem very challenging, but a full solution has wonderful hidden depths.

To set the scene, imagine a successful company that has grown quickly and offers products in multiple categories and geographies. The company started out selling one type of good in one country; but then things took off, and the company quickly had to adapt to rapid growth. For the sake of argument, let’s imagine this company began by selling books online and then branched into electronics, shoes, and online services.^[14]

¹⁴
Our example is deliberately implausible. Such a thing would never catch on.

Here’s the business problem when it comes to sales tax: you have to account for differences based on many variables. For example, determining the correct rate might involve the category of good, the country of the seller, the country of the buyer, the location in the country of either, local bylaws, the date and time of the purchase, and so on. You need a way to take in all of these variables and generate the right result.

3.2.1. The wider context

An e-commerce website is a good example of a common problem space for microservices. There’s a user interface that must be low latency, and there’s a backend that has a set of horizontal and vertical functionalities. Horizontal functionalities, such as user account management, transactional email workflows, and data storage, are mostly similar for many different kinds of applications. Vertical functionalities are particular to the business problem at hand; in the e-commerce case, these are functionalities such as the shopping cart, order fulfillment, and the product catalog.

Horizontal functionalities can often be implemented using a standard set of prebuilt microservices. Over time, you should invest in developing such a set of functionalities, because they can be used in many applications, thus kick-starting your development. While working on this project, you’ll almost certainly extend, enhance, and obsolesce these starter microservices, because they won’t be sufficient for full delivery. How to do this in a safe way, with pattern matching on messages, is one of the core techniques we’ll explore in this chapter.

Vertical microservices start as greenfield implementations, which makes them even more vulnerable to obsolescence. You’ll invariably make incorrect assumptions about the business domain. You’ll also misunderstand the depth of the requirements. The requirements will change in any case, because business stakeholders develop a deeper understanding too. To deal with this, you can use the same strategic approach as with horizontals: pattern matching. And for verticals, it’s a vital strategy to avoid running up technical debt.

3.3. Pattern matching

Pattern matching is one of the key strategies for building scalable microservice architectures—not just technically scalable, but also psychologically scalable, so that human software developers can understand the system. A large part of the complexity of microservice systems comes from the question of how to route messages. When a microservice emits a message, where should it go? And when you’re designing the system, how should you specify these routes?

The traditional way to describe network relationships, by defining the dependencies and data flows between services, doesn’t work well for microservices. There are too many services and too many messages. The solution is to turn the problem on its head. Start instead from the properties of the messages, and allow the network architecture to emerge dynamically from there.

In the e-commerce example, some messages will interact with the shopping cart. There’s a subset of messages in the message language that the shopping cart would like to receive: say, add-product, remove-product, and checkout. Ultimately, all messages are collections of data. You can think of them as collections of key-value pairs. Imagine an all-seeing messaging deity that observes all messages and, by identifying the key-value pairs, sends the messages to the correct service.

Message routing can be reduced to the following simple steps:

Represent the message as key-value pairs (regardless of the actual message data).
Map key-value pairs to microservices, using the simplest algorithm you can think of.

The way you represent the message as key-value pairs isn’t important. Key-value pairs are just the lowest-common-denominator data representation that has enough information to be useful. The algorithm isn’t important; but as a practical matter, it should be easy enough for fallible humans to run in their heads.

This approach—routing based on message properties—has been used many ways; it isn’t a new idea. Any form of intelligent proxying is an example. What’s new is the explicit decision to use it as the common basis for message routing. This means you have a homogeneous solution to the problem and no special cases. Understanding the system reduces to reviewing the list of messages and the patterns used to route them; this can even be done in isolation from the services that will receive them!

The big benefit of this approach is that you no longer need metainformation to route a message; you don’t need a service address. Service addresses come in many flavors—some not so obvious. A domain name is obviously an address, but so is a REST URL. The topic or channel name on a message bus and the location of the bus on the network are also addresses. The port number representing a remote service, exposed by an overlay network, is still an address! Microservices shouldn’t know about other microservices.

To be clear, addressing information has to exist somewhere. It exists in the configuration of the abstraction layer that you use to send and receive messages. Although this requires work to configure, it isn’t the same as having that information embedded in the service. From a microservice perspective, messages arrive and depart, and they can be any messages. If the service knows how to deal with a message, because it recognizes that message in some way, all to the good; but it’s that microservice’s business alone. The mapping from patterns to delivery routes is an implementation detail.

Let’s explore the consequences of this approach. You’ll see how it makes it much easier to focus on solving the business problem, rather than obsessing about accidental technical details.

3.3.1. Sales tax: starting simple

Let’s start with the business requirement for handling sales tax. We’ll focus narrowly on the requirement to recalculate sales tax after an item is added to the cart.

When a user adds a product to their shopping cart, they should see an updated cart listing all the products they previously added together with the new one. The cart should also have an entry for total sales tax due.^[15] Let’s use a synchronous add-product message that responds with an updated cart and a synchronous calculate-sales-tax message that responds with the gross price, to model this business requirement. We won’t concern ourselves with the underlying services.

¹⁵
For our purposes, consider value-added tax (VAT, as used in Europe) to be calculated in the same way.

The list of messages has only two entries:

add-product— Triggers calculate-sales-tax, and contains details of the product and cart
calculate-sales-tax— Triggers nothing, and contains the net price and relevant details of the product

Let’s focus on the calculate-sales-tax—what properties might it have?

Net price
Product category
Customer location
Time of purchase
Others you haven’t thought of

The microservice architecture allows you to put this question to one side. You don’t need to think about it much, because you’re not trying to solve everything at once. The best practice is to build the simplest possible implementation you can think of: solve the simple, general case first.

Let’s make some simplifying assumptions. You’re selling one type of product in one country, and this fully defines the sales tax rate to apply. You can handle the calculate-sales-tax message by writing a sales-tax microservice that responds synchronously. It applies a fixed rate, hardcoded into the service, to the net price, to generate a gross price: gross = net * rate.

Let’s also return to the useful fiction that every microservice sees every message. How does the sales-tax microservice recognize calculate-sales-tax messages? Which properties are relevant? In fact, there’s nothing special about the properties listed previously. There isn’t enough information to distinguish this message from others that also contain product details, such as add-product messages. The simple answer is to label the message. This isn’t a trick question: labels are a valid ways to namespace the set of messages. Let’s give calculate-sales-tax messages a label with the string value "sales-tax".

The label allows you to perform pattern matching. All messages that match the pattern label:sales-tax go to the sales-tax microservice. You’re careful not to say how that happens in terms of data flowing over the network; nor are you saying where that intelligence lies. You’re only concerned with defining a pattern that can pick out messages you care about.

Here’s an example message:

label: sales-tax
net: 1.00

The sales-tax service, with a hardcoded rate of, say, 10%, responds like this:

gross: 1.10

What is the pattern-matching algorithm? There’s no best choice. The best practice is to keep it simple. In this example, you might say, “Match the value of the label property.” As straightforward as that.

3.3.2. Sales tax: handling categories

Different categories of products can have different sales tax rates. Many countries have special, reduced sales tax rates that apply only to certain products or only in certain situations. Luckily, you include a category property in your messages. If you didn’t, you’d have to add one anyway. In general, if you’re missing information in your messages, you update the microservice generating those messages first, so that the new information is present even if nobody is using it. Of course, if you use strict message schemas, this is much more difficult. It’s precisely this type of flexibility that microservices enable; by having a general rule that you ignore new properties you don’t understand, the system can continue functioning.

To support different rates for different product categories, a simple approach is to modify the existing sales-tax microservice so that it has a lookup table for the different rates. You keep the pattern matching the same. Now you can deploy the new version of sales-tax in a systematic fashion, ensuring a smooth transition from the old rules to the new. Depending on your use case, you might tolerate some discrepancies during the transition, or you might use feature flags in your messages to trigger the new rules once everything is in place.

An alternative approach is to write a new sales-tax microservice for each category. This is a better approach in the long term. With the right automation and management in place, the marginal cost of adding new microservices is low. Initially, they’ll be simple—effectively, just duplicates of the single-rate service with a different hardcoded rate.

You might feel uneasy about this suggestion. Surely these are now nanoservices; it feels like the granularity is too fine. Isn’t a lookup table of categories and rates sufficient? But the lookup table is an open door to technical debt. It’s a data structure that models the world and must be maintained. As new complexities arise, it will need to be extended and modified. The alternative—using separate microservices—keeps the code simple and linear.

The statement of the business problem was misleading. Yes, different product categories have different sales tax rates. But if you look at the details, every tax code of every country in the world contains a morass of subdivisions, special cases, and exclusions, and legislatures add more every day. It’s impossible to know what business rules you’ll need to apply using your lookup table data model. And after a few iterations, you’ll be stuck with that model, because it will be internally complex.

Separating the product categories into separate microservices is a good move when faced with this type of business rule instability. At first, it seems like overkill, and you’re definitely breaking the DRY principle, but it quickly pays dividends. The code in each microservice is more concrete, and the algorithmic complexity is lower.

With this approach, the next question is how to route messages to the new micro-services. Pattern matching again comes into play. Let’s use a better, but still simple, algorithm. Each sales-tax microservice examines inbound messages and responds to them if they have a label:sales-tax property and a category property whose value matches the category they can handle. For example, the message

label: sales-tax
net: 1.00
category: standard

is handled by the existing sales-tax microservice. But the message

label: sales-tax
net: 1.00
category: reduced

is handled by the sales-tax-reduced microservice. The mapping between patterns and services is listed table 3.2.

Table 3.2. Pattern to service mapping

Pattern	Microservice
label:sales-tax	sales-tax
label:sales-tax,category:standard	sales-tax
label:sales-tax,category:reduced	sales-tax-reduced

Notice that the general-case microservice, sales-tax, handles messages that have no category. The calculation may be incorrect, but you’ll get something back, rather than failure. If you’re going for availability rather than consistency, this is an acceptable trade-off.

It’s the responsibility of the underlying microservice messaging implementation to adhere to these pattern-matching rules. Perhaps you’ll hardcode them into a thin layer over a message queue API. Or maybe you’ll use the patterns to construct message-queue topics. Or you can use a discovery service that responds with a URL to call for any given message. For now, the key idea is this: choose a simple pattern-matching algorithm, and use it to declaratively define which types of messages should be handled by which types of services. You’re thus communicating your intent as architect of the microservice system.

The pattern-matching algorithm

You might ask, “Which pattern-matching algorithm should I use? How complex should it be?” The answer is, as simple as possible. You should be able to scan the list of pattern-to-microservice mappings and manually assign any given message to a micro-service based on the content of the message. There’s no magic pattern-matching algorithm that suits all cases. The Seneca framework^[16] used for the case study in chapter 9 uses the following algorithm:

¹⁶
I’m the maintainer of this open source project. See http://senecajs.org.

Each pattern picks out a finite set of top-level properties by name.
The value of each property is considered to be a character string.
A message matches a pattern if all property values match under string equality.
Patterns with more properties have precedence over patterns with fewer.
Patterns with the same number of properties use alphabetical precedence.

Given the following pattern-to-microservice mapping:

a:0 maps to microservice A.
b:1 maps to microservice B.
a:0,b:1 maps to microservice A.
a:0,c:2 maps to microservice C.

The following messages will be mapped as indicated:

Message {a:0, x:9} goes to A.
Matches pattern a:0, because property a in the message has value 0.
Message {b:1, x:8} goes to B.
Matches pattern b:1, because property b in the message has value 1.
Message {a:0, b:1, x:7} goes to A.
Matches pattern a:0,b:1 rather than a:0 or b:1, because a:0,b:1 has more properties.
Message {a:1, b:1, x:6} goes to B.
Matches pattern b:1, because property b in the message has value 1, and there’s no pattern for a:1.
Message {a:0, c:2, x:5} goes to C.
Matches pattern a:0,c:2 by the values of properties a and c.
Message {a:0, b:1, c:2, x:9} goes to A.
- Matches pattern a:0,b:1 by the values of properties a and b but not c, because b is before c alphabetically.

In all of these cases, property x is data and isn’t used for pattern matching. That said, there’s nothing to prevent future patterns from using x, should it become necessary due to emerging requirements.

3.3.3. Sales tax: going global

The business grows, as businesses do. This creates the proverbial nice problem to have: the business wants to expand into international markets. The e-commerce system now needs to handle sales tax in a wide range of countries. How do you do this? It’s a classic case of combinatorial explosion, in that you have to think in terms of per-country, per-category rules.

The traditional strategy is to refactor the data models and try to accommodate the complexity in the models. It’s an empirical observation that this leads to technical debt; the traditional approach leads to traditional problems. Alternatively, if you follow the microservice approach, you end up with per-country, per-category microservices. Doesn’t the number of microservices increase exponentially as you add parameters?

In reality, there’s a natural limit on this combinatorial explosion: the natural complexity size of an individual microservice. We explored this in chapter 2 and agreed that it’s about one week of developer effort, perhaps coarsely graded by skill level. This fact means you’ll tend to build data models up to this level of complexity, but no further. And many of these data models can handle multiple combinations of messages.

Certain countries have similar systems with similar business rules. These can be handled by a microservice with that dreaded lookup table. But exceptions and special cases won’t be handled by extending the model: instead, you’ll write a special-case microservice.

Let’s look at an example. Most European Union countries use the standard and reduced-rates structure. So, let’s build an eu-vat microservice that uses a lookup table keyed by country and category. If this seems heretical, given the existing sales-tax and sales-tax-reduced microservices, you could be right—you may need to embrace the heresy, end-of-life those services, and use a different approach. This isn’t a problem! You can run both models at the same time and use pattern matching to route messages appropriately.

3.3.4. Business requirements change, by definition

Any approach to software development that expects business requirements to remain constant over the course of a project is doomed to deliver weak results. Even projects that have strict, contract-defined specifications suffer from requirement drift. Words on paper are always subject to reinterpretation, so you can take for granted that the initial business requirements will change almost as soon as the ink is dry. They will also change during the project and after the project goes live.

The agile family of software project management methodologies attempts to deal with this reality by using a set of working practices, that encourages flexibility. The agile working practices—iterations, unit testing, pair programming, refactoring, and so on—are useful, but they can’t overcome the weight of technical debt that accumulates in monolithic systems. Unit testing, in particular, is meant to enable refactoring, but in practice it doesn’t achieve this effectively.

Why is refactoring a monolithic code base difficult? Because monolithic code bases enable the growth of complex data structures. The ease of access to internal representations of the data model means it’s easy to extend and enhance. Even if the poor project architect initially defines strict API boundaries, there’s no strong defense against sharing data and structures. In the heat of battle, rules are broken to overcome project deadlines.

The microservice architecture is far less vulnerable to this effect, because it’s much harder for one microservice to reach inside another and interfere with its data structures. Messages naturally tend to be small. Large messages aren’t efficient when transported over the network. A natural force makes messages concise, including only the data that’s needed, and in the form of simpler data structures.

Combined, the small size of microservices, the smaller size of messages, and the resulting lower complexity of data structures internal to microservices result in an architecture that’s easier to refactor. This ease derives from the engineering approach, rather than project management discipline, and so is robust in response to variances in team size, ability, and politics.

3.3.5. Pattern matching lowers the cost of refactoring

The pattern-matching tactic has a key role to play in making refactoring easier. When first building microservices, developers have a tendency to apply schema validation to the messages. This is an attempt to enforce a quality of correctness in the system, but it’s misguided.

Consider the scenario where you want to upgrade a particular microservice. For scale, you might have multiple instances of the microservice running. Let’s say this microservice is at version 1.0. You need to deploy version 2.0, which adds new functionality. The new functionality uses new fields in the messages that the microservice sends and receives.

Taking advantage of the deployment flexibility of microservices, you deploy a new instance of v2.0 while leaving the existing v1.0 instances running. This lets you monitor the system to see whether you’ve broken anything. Unfortunately, you have! The strict schema you’re enforcing for v1.0 messages isn’t compatible with the changes in v2.0, and you can’t run both instances of the service at the same time. This negates one of the main reasons for using microservices in the first place.

Alternatively, you can use pattern matching. With pattern matching, as you’ve seen in the sales tax example, it’s easy to modify the messages without breaking anything. Older microservices ignore new fields that they don’t understand. Newer microservices can claim messages with new fields and operate on them correctly. It becomes much easier to refactor.

Postel’s law

Jon Postel, who was instrumental in the design of the TCP/IP protocol (among others), espoused this principle: “Be conservative in what you do, be liberal in what you accept from others.” This principle informed the design of TCP/IP. Formally, it’s defined as being contravariant on input (that is, you accept supersets of the protocol specification) and covariant on output (you emit subsets of the protocol specification). This approach is a powerful way to ensure that many independent systems can continue to work well with one another. It’s useful in the microservice world.

This style of refactoring has a natural safety feature. The production system is changed by adding and subtracting microservice instances, but it isn’t changed by modifying code and redeploying. It’s easier to control and measure possible side effects at the granularity level of the microservice—all changes occur at the same level, so every change can be monitored the same way. We’ll discuss techniques for measuring microservice systems in chapter 6—in particular, how measurement can be used to lower deployment risk in a quantifiable way.

3.4. Transport independence

Microservices must communicate with each other, but this doesn’t mean they need to know about each other. When a microservice knows about another microservice, this creates an explicit coupling. One of the strongest criticisms of the microservice architecture is that it’s a more complex, less manageable version of traditional monolithic architectures. In this criticism, messages over the network are nothing more than elaborate remote procedure calls, and the system is a mess of dependencies—the so-called distributed monolith.

This problem arises if you consider the issue of identity to be essential. The naïve model of microservice communication is one where you need to know the identity of the microservice to which you want to send a message: you need the address of the receiving microservice. An architecture that uses direct HTTP calls for request/response messages suffers from this problem—you need a URL endpoint to construct your message-sending calls. Under this model, a message consists not only of the content of the message but also the identity of the recipient.

There’s an alternative. Each microservice views the world as a universe from which it receives messages and to which it emits messages. But it has no knowledge of the senders or receivers. How is this possible? How do you ensure that messages go to the right place? This knowledge must exist somewhere. It does: in the configuration of the transport system for microservice messages. But ultimately, that’s merely an implementation detail. The key idea is that microservices don’t need to know about each other, how messages are transported from one microservice to another, or how many other microservices see the message. This is the idea of transport independence, and using it means your microservices can remain fully decoupled from each other.

3.4.1. A useful fiction: the omnipotent observer

Transport independence means you can defer consideration of practical networking questions. This is extremely powerful from the perspective of the microservice developer. It enables the false, but highly useful, assumption that any microservice can receive and send any message. This is useful because it allows you to separate the question of how messages are transported from the behavior of microservices.

With transport independence, you can map messages from the message language to microservices in a completely flexible way. You’re free to create new microservices that group messages in new ways, without impacting the design or implementation of other microservices.

This is why you were able to work at the design level with the messages describing the e-commerce system, without first determining what services to build. Implicitly, you assumed that any service can see any messages, so you were free to assign messages to services at a late stage in the design process. You get even more confidence that you haven’t painted yourself into a corner when you realize that this implicit assumption remains useful for production systems. Microservices are disposable, so reassigning messages to new services is low cost.

This magical routing of messages is enabled by dropping the idea that individual microservices have identities, and by using pattern matching to define the mapping from message to microservice. You can then fully describe the design of the system by listing the patterns that each microservice recognizes and emits. And as you’ve seen with the sales tax example, this is the place you want to be.

The implementation and physical transport of messages isn’t something to neglect—it’s a real engineering problem that any system, particularly large systems, must deal with. But microservices should be written as if transport is an independent consideration. This means you’re free to use any transport mechanism, from HTTP to messages queues, and free to change the transport mechanism at any time.

You also get the freedom to change the way messages are distributed. Messages may be participants in request/response patterns, publish/subscribe patterns, actor patterns, or any other variant, without reference to the microservices. Microservices neither know nor care who else interacts with messages.

In the real world, at the deployment level, you must care. We’ll discuss mechanisms for implementing transport independence in chapter 5.

3.5. Message patterns

The core principles of pattern matching and transport independence allow you to define a set of message patterns, somewhat akin to object-oriented design patterns. We’ll use the conventions introduced in chapter 2 to describe these patterns.

As a reminder, we’re focusing on two aspects of message interaction:

Synchronous/asynchronous (solid/dashed line)— The message expects/doesn’t expect a response.
Observe/consume (empty/full arrowhead)— The message is either observed (others can see it too) or consumed (others can’t see it).

We’ll consider these interactions in the context of increasing numbers of services. A further reminder: in all cases, unless explicitly noted, we assume a scalable system where each microservice is run as multiple instances. We assume that the deployment infrastructure, or microservice framework, provides capabilities such as load balancing to make this possible.

To formalize the categorization of the patterns, you can think of them in terms of the number of message patterns and the number of microservices (not instances!). In the simplest case, there’s a single message of a particular pattern between two microservices. This is a 1/2 pattern, using the form m/n, where m is the number of message patterns and n is the number of microservices.

3.5.1. Core patterns: one message/two services

In these patterns, you enumerate the four permutations of the synchronous/asynchronous and observe/consume axes. In general, enumerating the permutations of message interactions, especially with higher numbers of microservices, is a great way to discover possibilities for microservice interaction patterns. But we’ll start simple, with the big four.

1/2: Request/response

This pattern, illustrated in figure 3.2, describes the common HTTP message transport model. Messages are synchronous: the initiating microservice expects a response. The listening microservice consumes the message, and nobody else gets to see it. This mode of interaction covers traditional REST-style APIs and a great many first-generation microservice architectures. Considerable tooling is available to make this interaction highly robust.^[17]

¹⁷
The open source toolset from Netflix is well worth exploring: https://netflix.github.io.

Figure 3.2. Request/response pattern: the calculation of sales tax, where microservice A is the shopping-cart service, microservice B is the sales-tax service, and the message is calculate-sales-tax. The shopping-cart service expects an immediate answer so it can update the cart total.

If the cardinality of the listening microservice is greater than a single instance, then you have an actor-style pattern. Each listener responds in turn, with a load balancer^[18] distributing work according to some desired algorithm.

¹⁸
The load balancer isn’t necessarily a standalone server. Client-side load balancing has the advantage that it can be achieved with lightweight libraries and removes a server from the deployment configuration of your system.

When you develop a microservice system locally, it’s often convenient to run the system as a set of individual microservices, with a single instance of each type. Message transport is implemented by hardcoding routes as HTTP calls to specific port numbers, and most messages are of the request/response form. It can’t be stressed forcefully enough that the microservices must be sheltered from the details of this local configuration. The benefit is that it’s much easier to develop and verify message interactions in this simplified configuration.

1/2: Sidewinder

This pattern may at first seem rather strange (see figure 3.3). It’s a synchronous message that isn’t consumed. So who else observes the message? This is a good example of the level that this model operates at—you aren’t concerned with the details of the network traffic. How others observe this message is a secondary consideration. This pattern communicates intent. The message is observable. Other microservices, such as an auditing service, may have an interest in the message, but there’s still one microservice that will supply the response. This is also a good example of the way a model generates new ways of looking at systems by enumerating the combinations of elements the model provides.

Figure 3.3. Sidewinder pattern: in the e-commerce system, the shopping-cart service may send a checkout message to the delivery service. A recommendation service can observe the checkout messages to understand the purchasing behavior of customers and generate recommendations for other products they might be interested in. But the existence of the recommendation service isn’t known to the shopping-cart and delivery services, which believe they’re participating in a simple request/response interaction.

1/2: Winner-take-all

This is a classic distributed-systems pattern (see figure 3.4). Workers take tasks from a queue and operate on them in parallel. In this world view, asynchronous messages are sent out to instances of listening microservices; one of the services is the winner and acts on the message. The message is asynchronous, so no response is expected.

Figure 3.4. Winner-take-all pattern: in the e-commerce system, you can configure the delivery service to work in this mode. For redundancy and fault tolerance, you run multiple instances of the delivery service. It’s important that the physical goods are delivered only once, so only one instance of the delivery service should act on any given checkout message. The message interaction is asynchronous in this configuration, because shopping-cart doesn’t expect a response (ensuring delivery isn’t its responsibility!). You could use a message bus that provides work queues to implement this mode of interaction.

In reality, a message queue is a good mechanism for implementing this behavior, but it isn’t absolutely necessary—perhaps you’re using a sharding approach and ignoring messages that don’t belong to your shard. As an architect using this pattern, you’re again providing intent, rather than focusing on the network configuration of the microservices.

1/2: Fire-and-forget

This is another classic distributed pattern: publish/subscribe (see figure 3.5). In this case, all the listening microservice instances observe the message and act on it in some way.

Figure 3.5. Fire-and-forget pattern: the general form of this interaction involves a set of different microservices observing an emitted message. The shopping-cart service emits a checkout message, and various other services react to it: delivery, checkout-email, and perhaps audit. This is a common pattern.

From a idealistic perspective, this is the purest form of microservice interaction. Messages are sent out into the world, and whoever cares may act on them. All the other patterns can be interpreted as constraints on this model.

Strictly speaking, figure 3.5 shows a special case: multiple instances of the same microservice will all receive the message. Diagrams of real systems typically include two or more listening services. This special case is sometimes useful because although it performs the same work more than once, you can use it to deliver guaranteed service levels. The catch is that the task must be idempotent.^[19]

¹⁹
An idempotent task can be performed over and over again and always has the same output: for example, setting a data field to a specific value. No matter how many times the data record is updated, the data field always gets the same value, so the result is always the same.

For example, the e-commerce website will display photos of products. These come in a standard large format, and you need to generate thumbnail images for search result listings. Resizing the large image to a thumbnail always generates the same output, so the operation is idempotent. If you have a catalog of millions of products (remember, your company is quite successful at this stage), then some resizing operations will fail due to disk failures or other random issues. If 2% of resizings fail on average, then performing the resizing twice using different microservice instances means only 0.04% (2% x 2%) of resizings will fail in production. Adding more microservice instances gives you even greater fault tolerance. Of course, you pay the price in redundant work. We’ll examine this trade-off in chapter 5.

3.5.2. Core patterns: two messages/two services

These are the simplest message-interaction patterns; they capture causality between messages. These patterns describe how one type of message generates other types of messages. In one sense, this is an odd perspective: normally, you’d think in terms of the dependency relationships between microservices, rather than causality between messages.

From a messages-first perspective, it makes more sense. The causal relationships between messages are more stable than the relationships between the microservices that support the messages. It’s easier to handle the messages with a different grouping of microservices than it is to change the message language.

2/2: Request/react

This is a classic enterprise pattern (see figure 3.6).^[20] The requesting microservice enables the listening microservice to respond asynchronously by accepting a separate message in reaction to the initial request message. The requesting microservice is responsible for correlating the outbound request message and its separate response message. This is a more manual version of traditional request/response.

²⁰
For more, see the excellent book SOA Patterns by Arnon Rotem-Gal-Oz, (Manning, 2012), https://www.manning.com/books/soa-patterns.

Figure 3.6. Request/react pattern

The advantage here is that you create temporal decoupling. Resources aren’t consumed on the requesting side while you wait for a response from the listener. This is most effective when you expect the listener to take a nontrivial amount of time to complete the request. For example, generating a secure password hash necessarily requires expending significant CPU time. Doing so on the main line of microservices that respond to user requests would negatively impact response times. Offloading the work to a separate set of worker microservices solves the problem, but you must allow for the fact that the work is still too slow for a normal request/response pattern. In this case, request/react is a much better fit.

2/2: Batch progress reporter

This a variant of the request/react pattern that gives you a way to bring batch processes into the fold of microservices (see figure 3.7). Batch processes, such as daily data uploads, consistency checks, and date-based business rules, are often written as programs that run separately from the main request-serving system. This arrangement is fragile, because those batch processes don’t fall under the same control and monitoring mechanisms.

Figure 3.7. Batch progress reporter

By turning batch processes into microservices, you can bring them under control in the same way. This is much more efficient from a systems management perspective. And it follows the underlying principle that you should think of your system in microservice terms: small services that respond to messages.

In this case, the message interaction is similar to request/react, but there’s series of reaction messages announcing the state of the batch process. This allows the triggering microservice, and others, to monitor the state of the batch process. To enable this, reaction messages (in this case) are observed rather than consumed.

3.5.3. Core patterns: one message/n services

In these patterns, you begin to see some of the core benefits of using microservices: the ability to deploy code to production in a partial and staged way. In monolithic applications, you’d implement some of these patterns using feature flags or other custom code. That approach is difficult to measure and incurs technical debt. With microservices, you can achieve the same effect under the same simple model as all other deployments. This level of homogeneity makes it easier to manage and measure.

1/n: Orchestra

In this pattern, an orchestrating service coordinates the activities of a set of supporting services (see figure 3.8, which shows that microservice A interacts with B first, then C, in the context of some workflow a). A criticism of the microservice architecture is that it’s difficult to understand microservice interactions, and thus service orchestration must be an important infrastructural support. This function can be performed directly by orchestration microservices that coordinate workflows directly, removing the need for a specialist network component to perform this role. In most large production microservice systems, you’ll find many microservices performing orchestration roles to varying degrees.

Figure 3.8. Orchestra

1/n: Scatter/gather

This is one of the more important microservice patterns (see figure 3.9). Instead of a deterministic, serial procedure for generating and collecting results, you announce your need and collect results as they come in. Let’s say you want to construct a product page for the e-commerce site. In the old world, you’d gather all the content from its providers and only return the page once everything was available. If one content provider failed, the whole page would fail, unless you’d written specific code to deal with this situation.

Figure 3.9. Scatter/gather

In the microservice world, individual microservices generate the content pieces and can’t affect others if they fail. You must construct the result (the product page) asynchronously, because your content components arrive asynchronously, so you build in fault tolerance by default. The product page can still display if certain elements aren’t ready—these can be injected as they become available. You get this flexibility as part of the basic architecture.

How do you coordinate responses and decide that your work is complete? There are a number of approaches. When you announce your need, you can set a time limit for responses; any response received after the time limit expires isn’t included. This is the approach taken in the case study in chapter 9. You can also return once you get a minimum number of results.

For an extreme version, consider the sales tax use case. If you’re willing to take availability over consistency, you can apply sales tax calculations and adapt to changes in rules in a flexible way. You announce that you need a sales tax calculation. All of the sales tax microservices respond, but you can rank the results in order of specificity. The more specific the result, the more likely it is to be the correct sales tax calculation, because it takes into account more information about the purchase. Perhaps this seems like a recipe for disaster and inaccurate billing. But consider that pricing and sales tax errors occur every minute of every day, and businesses deal with them by making compensating payments or absorbing the cost of errors. This is normal business practice. Why? Because companies prefer to stay open rather than close their doors—there’s business to be done! We’ve allowed ourselves as software developers to believe that our systems must achieve perfect accuracy, but we should always ask whether the business wants to pay for it.

1/n: Multiversion deployment

This is the scenario we discussed in the section on the sales tax microservices earlier (see figure 3.10). You want to be able to deploy updated versions of a given microservice as new instances while keeping the old instances running. To do this, you can use the actor-style patterns (winner-take-all, fire-and-forget); but instead of distributing messages to a set of instances, all of which are the same version, you distribute to instances of differing versions. The figure shows some interaction, a, where messages are sent to versions 1.0 and 2.0 of microservice A.

Figure 3.10. Multiversion deployment

The facts that this deployment configuration is an extension of an existing pattern, and easy to achieve by changing the details of the deployment, show again the power of microservices. We’ve converted a feature from custom implementation in code to explicit declarative configuration.

This pattern becomes even more powerful when you combine it with the deployment and measurement techniques we’ll discuss in chapters 5 and 6. To lower the risk of deploying a new version of a service, you don’t need to serve production results back from that new version. Instead, you can duplicate traffic so that the old versions of a microservice, proven in production, keep everything working correctly as before. But now you have the output from the new microservice, and you can compare that to the output from the old microservice and see whether it’s correct. Any new bugs or errors can be detected using production traffic, but without causing issues. You can iterate deployment of the new version in this traffic-duplication mode until you have a high degree of confidence, using production traffic, that it won’t break things. This is a fantastic way to enable high-velocity continuous delivery.

1/n: Multi-implementation deployment

Expanding the possibilities offered by multiversion deployment, you can do multi-implementation deployment (see figure 3.11). This means you can try out different approaches to solve the same problem. In particular, and powerfully, you can do A/B testing. And you can do it without additional infrastructure. A/B testing becomes a capability of your system, without being something you need to build in or integrate. The figure shows microservices A and B both performing the same role in message interaction a.

Figure 3.11. Multi-implementation deployment

You can take this further than user interface A/B testing: you can A/B-test all aspects of your system, trialing new algorithms or performance enhancements, without taking massive deployment risks.

3.5.4. Core patterns: m messages/n services

The possibilities offered by configuration of multiple services and multiple messages expand exponentially. That said, bear in mind that all message patterns can be decomposed into the four core 1/2 patterns. This is often a good way to understand a large microservice system. There are a few larger-scale patterns that are worth knowing, which we’ll explore here.

3.5.5. m/n: chain

The chain represents a serial workflow. Although this is often implemented using an orchestrating microservice (discussed in section 3.5.3), it can also be implemented in the configuration shown in figure 3.12, where the messages of the a workflow are choreographed;^[21] the serial nature of the workflow is an emergent property of the individual microservices following their local rules.

²¹
Sam Newman, author of Building Microservices (O’Reilly, 2011), introduced the terms orchestration and choreography to describe microservice configurations.

Figure 3.12. Chain

Generally, distributed systems that can perform work in parallel are always bounded by the work that can’t be parallelized. There’s always some work that must be done serially. In an enterprise software context, this is often the case where actions are gated: certain conditions must be met before work can proceed.

3.5.6. m/n: Tree

The tree represents a complex workflow with multiple parallel chains (see figure 3.13, which shows the message flow subsequences). It arises in contexts where triggering actions cause multiple independent workflows. For example, the checkout process of an e-commerce site requires multiple streams of work, from customer communication to fulfillment.

Figure 3.13. Tree

3.5.7. Scaling messages

There can be no question that the microservice architecture introduces additional load on the network and thus reduces performance overall. The foremost way to address this issue is to ask whether you have an actual problem. In many scenarios, the performance cost of microservices is more than offset by their wide-ranging benefits. You’ll gain a deeper understanding of how this trade-off works, and how it can be adjusted, in chapters 5, 6, and 7.

To frame this discussion, it’s important to be specific about terminology:

Latency— The amount of time it takes for the system to complete an action. You could measure latency by measuring the average response time of inbound requests, but this isn’t a good measure because it doesn’t capture highly variant response times. It’s better to measure latency using percentiles: a latency of 100 ms at the 90th percentile means 90% of requests responded within 100 ms. This approach captures behavior that spikes unacceptably. Low latency is the desired outcome.
Throughput— The amount of load the system can sustain, given as a rate: the number of requests per second. By itself, the throughput rate isn’t very useful. It’s better to quote throughput and latency together: such a rate at such a percentile.

High throughput with low latency is the place you want to be, but it isn’t something you can achieve often. Like so much else in systems engineering, you must sacrifice one for the other or spend exponentially large amounts of money to compensate.

Choosing the microservice architecture means making an explicit trade-off: higher throughput, but also higher latency. You get higher throughput because it’s much easier to scale horizontally—just add more service instances. Because different microservices handle different levels of load, you can do this in a precise way, scaling up only those microservices that need to scale and allocating resources far more efficiently. But you also get higher latency, because you have more network traffic and more network hops. Messages need to travel between services, and this takes time.

Lowering latency

When you build a microservices system with a messages-first approach, you find that certain message interactions require lower latency than others: those that must respond to user input, those that must provide data in a timely fashion, and those that use resources under contention. These message interactions, if implemented asynchronously, will have higher latency. To reduce latency in these cases, you’ll have to make another trade-off: increase the complexity of the system. You can do this by introducing synchronous messages, particularly of the request/response variety. You can also combine microservices into single processes that are larger than average, but that increase in size brings with it the associated downsides of monolithic architectures.

These are legitimate trade-offs. As with all performance optimizations, addressing them ahead of time, before you have a proper measurement, is usually wasted effort. The microservice architecture makes your life easier by giving you the measurement aspect for a lower cost, because you monitor message flow rates in any case. Identifying performance bottlenecks is much easier.

It’s worth noting that lower-level optimizations at the code level are almost useless in a microservices context. The latency introduced by network hops swamps everything else by several orders of magnitude.

Increasing throughput

Achieving higher message throughput is easier, because it requires fewer compromises and can be achieved by spending money. You can increase the number of service instances, or use message queues^[22] with better performance characteristics, or run on bigger CPUs.

²²
Kafka is fast, if that’s what you need: http://kafka.apache.org.

In addition to adding muscle at the system level, you can make your life easier at the architectural level. Messages should be small: resist the urge to use them for transferring large volumes of data or large amounts of binary data, such as images. Instead, messages should contain references to the original data. Microservices can then retrieve the original data directly from the data-storage layer of your system in the most efficient manner.

Be careful to separate CPU-bound activity into separate processes, so it doesn’t impact throughout. This is a significant danger in event-based platforms such as a Node.js, where CPU activity blocks input/output activity. For thread-based platforms, it’s less of an issue, but resources are still consumed and have limits. In the sales tax example, as the complexity of the rules grows, computation time increases; it’s better to perform the calculation in a separate microservice. For highly CPU-intensive activities such as image resizing and password hashing, this is even more important.

3.6. When messages go bad

Microservice systems are complex. They don’t even start out simple, because they require deployment automation from the beginning. There are lots of moving parts, in terms of both the types of messages and microservices and on-the-ground instances of them. Such systems demonstrate emergent behavior—their internal complexity is difficult to comprehend as a whole. Not only that, but with so many interacting parts, subtle feedback loops can develop, leading to behavior that’s almost impossible to explain.

Such systems are discussed, in the general case, by Nassim Nicholas Taleb, author of the books Black Swan (Penguin, 2008) and Antifragile (Random House, 2014). His conceptual model is a useful framework for understanding microservice architectures. He classifies systems as fragile, robust, and antifragile:

Fragile systems degrade when exposed to disorder and pressure. Most software falls into this category, failing to cope with the disorder generated by high loads.
Robust systems can resist disorder, up to a point. They’re designed to impose rigid discipline. The problem is that they suffer catastrophic failure modes—they work until they don’t. Most large-scale software is of this variety: a significant investment is made in debugging, testing, and process control, and the system can withstand pressure until it hits a boundary condition, such as a schema change.
Antifragile systems benefit from disorder and become stronger. The human immune system is an example. The microservice architecture is weakly antifragile by design and can be made more so by accepting the realities of communication over the network.

The key to antifragility is to accept lots of small failures, in order to avoid large failures. You want failures to be high in frequency but of low consequence. Traditional software quality and deployment practices, geared for the monolith, are biased toward low-frequency, high-consequence failures. It’s much better for individual microservices to fail by design (rebooting themselves as needed) than for the entire system to fail.

The best strategy for microservices, when handling internal errors, is to fail fast and reset. Microservices are crash-first software. Instances are small and plentiful, so there’s always somebody to take up the slack.

Dealing with external failure is more challenging. From the perspective of the individual microservice, external failures appear as misbehaving messages. When messages go bad, it’s easy for them to consume resources and cause the system to degrade. It’s therefore vital that the message-transportation infrastructure and the microservices adopt a stance that expects failure from external actors. No microservice should expect good behavior from the rest of the network. Luckily, many of the failure modes can be dealt with using a common set of strategies.

3.6.1. The common failure scenarios, and what to do about them

In this section, the failure modes are organized under the message interactions that are most susceptible to them. All failure modes can occur with any message interaction, but it’s useful to organize your thinking. All message interactions ultimately break down into the four core 1/2 interactions, so I’ll use the core interactions to catalog the failure modes.

For each failure mode, I’ll present options for mitigation. No failure mode can be eliminated entirely, and the principle of antifragility teaches that this is a bad idea anyway. Instead, you should adopt mitigation strategies that keep the whole healthy, even if a part must be sacrificed.

Note

In the following scenarios, microservice A is the sending service, and microservice B is the listening service.

3.6.2. Failures dominating the request/response interaction

The request/response interaction is vulnerable to high latency, which can cause excessive resource consumption as you wait for responses.

Slow downstream

Microservice A is generating work for microservice B. A might be an orchestrater, for example. B performs resource-intensive work. If B becomes slow for some reason, perhaps because one of its dependencies has become slow, then resources in A will be consumed waiting for responses from B. In an event-based platform, this will consume memory; but in a thread-based platform, it will consume threads. This is much worse, because eventually all threads will block waiting for B, meaning no new work can proceed.

Mitigation: use a circuit breaker. When B slows below a triggering level of throughput, consider B to be dead, and remove it from interactions with A. B either is irretrievably corrupt and will die anyway in due course and be replaced, or is overloaded and will recover once load is reduced. In either case, a healthy B will eventually result. The circuit breaker logic can operate either within A or in an intelligent load balancer in front of B. The net result is that throughput and latency remain healthy, if less performant than usual.

Upstream overload

Similar to the previous scenario, microservice A is generating work for microservice B. But in this scenario, A isn’t well behaved and doesn’t use a circuit breaker. B must fend for itself. As the load increases, B is placed under increasing strain, and performance suffers. Even if you assume that the system has implemented some form of automated scaling, load can increase faster than new instances of B can be deployed. And there are always cost limits to adding new instances of B. Simply blindly increasing the system size creates significant downside exposure to denial-of-service attacks.

Mitigation: B must selectively drop messages from A that push B beyond its performance limits. This is known as load shedding. Although it may seem aggressive, it’s more important to keep the system up for as many users as possible. Under high-load scenarios, performance tends to degrade across the board for all users. With load shedding, some users will receive no service, but at least those that do will receive normal service levels.

If you allow your system to have supervisory microservices that control scaling in response to load, you can also use signaling from B to mitigate this failure mode. B should announce via asynchronous messages to the system in general that it’s under too much load. The system can then reduce inbound requests further upstream from A. This is known as applying backpressure.

3.6.3. Failures dominating the sidewinder interaction

Failure in this interaction is insidious—you may not notice it for many days, leading to data corruption.

Lost actions

In this scenario, microservices A and B have the primary interaction, but C is also observing, performing its own actions. Any of these microservices can be upgraded independently of the others. If this introduces changes to message content or different message behavior, then either B or C may begin to fail. Because one of the key benefits of microservices is the ability to perform independent updates, and because this enables continuous delivery, this failure mode is almost to be expected. It’s often used as a criticism of the microservice architecture, because the naïve mitigation is to attempt coordinated deployments. This is a losing strategy that’s difficult to do reliably in practice.

Mitigation: measure the system. When A sends the message, it should be received by B and C. Thus, the outbound and inbound flow rates for this message must be in the ratio 1:2. Allowing for transient variations, this ratio can be used as a health indicator. In production systems, there are many instances of A, B, and C. Updates don’t replace all of A, for example; rather, following microservice best practice, A is updated in stages, one instance at a time. This allows you to monitor the message-flow-rate ratio to see whether it maintains its expected value. If not, you can roll back and review. This is discussed more fully in chapter 6.

3.6.4. Failures dominating the winner-take-all interaction

These failures can bring your system down and keep it down. Simple restarts won’t fix the problem, because the issue is in the messages themselves.

Poison messages

This is a common failure mode for actor-style distributed systems. Microservice A generates what it thinks is a perfectly acceptable message. But a bug in microservice B means B always crashes on the message. The message goes back onto the queue, and the next instance of B attempts to process it and fails. Eventually, all instances of B are stuck in a crash-reboot cycle, and no new messages are processed.

Mitigation: microservice B needs to keep track of recently received messages and drop duplicates on the floor. This requires messages to have some sort of identifier or signature. To aid debugging, the duplicate should be consumed but not acted on. Instead, it should be sent to a dead-letter queue.^[23] This mitigation should happen at the transport layer, because it’s common to most message interactions.

²³
The dead-letter queue is the place you send copies of broken messages for later analysis. It can be as simple as a microservice that takes no actions but logs every message sent to it.

Guaranteed delivery

Asynchronous messages are best delivered using a message queue. Some message-queue solutions claim to guarantee that messages will be delivered to listeners at most once, exactly once, or at least once. None of these guarantees can be given in practice, because it’s fundamentally impossible to make them. Message delivery, in general, suffers from the Byzantine Generals problem: it’s impossible to know for sure whether your message has been delivered.

Mitigation: skew delivery to prefer at-least-once behavior. Then, you have to deal with the problem of duplicate messages. Making behavior idempotent as much as possible reduces the effect of duplicates, because they then have no effect. As explained earlier in the chapter, idempotency refers to the property of a system where it can safely perform the same action multiple times and end up in the same state: for example, the sales tax calculation is naturally idempotent because it always returns the same result for the same inputs.

3.6.5. Failures dominating the fire-and-forget interaction

These failure modes are a reminder that the gods will always find our mortal plans amusing. They gave us brains just powerful enough to build machines that we can never fully comprehend.

Emergent behavior

As your system grows, with many microservices interacting, emergent behavior will occur. This may take the form of messages appearing that shouldn’t appear, microservices taking unexpected actions, or unexplained surges in message volume.

Mitigation: a microservice system isn’t a neural network, however appealing the analogy. It’s still a system designed to operate in a specific fashion, with defined interactions. Emergent behavior is difficult to diagnose and resolve, because it arises from the interplay of microservice behavior rather than from one microservice misbehaving. Thus, you must debug each case individually. To make this possible, use correlation identifiers. Each message should contain metadata to identify the message, because it moves between microservices; this allows you to trace the flow of messages. You should also include the identifiers of originating messages, because this will let you trace causality—see which messages generated further messages.

Catastrophic collapse

Sometimes, emergent behavior suffers from a feedback loop. In this case, the system enters a death spiral caused by ever-increasing numbers of unwanted messages triggering even more messages. Rolling back recent microservice deployments—the standard safety valve—doesn’t work in this scenario, because the system has entered a chaotic state. The triggering microservice may not even be a participant in the problematic behavior.

Mitigation: the last resort is to bring the system down completely and boot it up again, one set of services at a time. This is a disaster scenario, but it can be mitigated. It may be sufficient to shut down some parts of the system and leave others running. This may be enough to cut the feedback loop. All of your microservices should include kill switches that allow you to shut down arbitrary subsets of the system. They can be used in an emergency to progressively bring down functionality until nominal behavior is restored.

3.7. Summary

One of the most important strategies for avoiding the curse of the distributed monolith is to put messages at the core your microservices thinking. With a messages-first approach, the microservice architecture becomes much easier to specify, design, run, and reason about.
It’s better to think about your business requirements in terms of the messages that represent them. This is an action-oriented stance rather than a data-oriented one. It’s powerful because it gives you the freedom to define a language of messages without predetermining the microservices that will handle them.
The synchronous/asynchronous dichotomy is fundamental to the way messages interact. It’s important to understand the constraints that each of these message-interaction models imposes, as well as the possibilities the models offer.
Pattern matching is the principle mechanism for deciding which microservice will act on which message. Using pattern matching, instead of service discovery and addressing, gives you a flexible and understandable model for defining message behavior.
Transport independence is the principle mechanism for keeping services fully decoupled from the concrete topology of the network. Microservices can be written in complete isolation, seeing the world only in terms of inbound and outbound messages. Message transport and routing become an implementation and configuration concern.
Message interactions can be understood along two axes: synchronous/asynchronous and observed/consumed. This model generates four core message-interaction patterns, which can be used to define interactions between many messages and microservices.
The failure modes of message interactions can also be cataloged and understood in the context of this model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3. Messages

Create new playlist

Sign In

Sign Up

Chapter 3. Messages

3.1. Messages are first-class citizens

Table 3.1. Messages representing activities

Figure 3.1. Conceptual levels of system design

3.1.1. Synchronous and asynchronous

3.1.2. When to go synchronous

3.1.3. When to go asynchronous

3.1.4. Thinking distributed from day one

3.1.5. Tactics to limit failure

3.2. Case study: Sales tax rules

3.2.1. The wider context

3.3. Pattern matching

3.3.1. Sales tax: starting simple

3.3.2. Sales tax: handling categories

Table 3.2. Pattern to service mapping

3.3.3. Sales tax: going global

3.3.4. Business requirements change, by definition

3.3.5. Pattern matching lowers the cost of refactoring

3.4. Transport independence

3.4.1. A useful fiction: the omnipotent observer

3.5. Message patterns

3.5.1. Core patterns: one message/two services

1/2: Request/response

Figure 3.2. Request/response pattern: the calculation of sales tax, where microservice A is the shopping-cart service, microservice B is the sales-tax service, and the message is calculate-sales-tax. The shopping-cart service expects an immediate answer so it can update the cart total.

1/2: Sidewinder

1/2: Winner-take-all

1/2: Fire-and-forget

3.5.2. Core patterns: two messages/two services

2/2: Request/react

Figure 3.6. Request/react pattern

2/2: Batch progress reporter

Figure 3.7. Batch progress reporter

3.5.3. Core patterns: one message/n services

1/n: Orchestra

Figure 3.8. Orchestra

1/n: Scatter/gather

Figure 3.9. Scatter/gather

1/n: Multiversion deployment

Figure 3.10. Multiversion deployment

1/n: Multi-implementation deployment

Figure 3.11. Multi-implementation deployment

3.5.4. Core patterns: m messages/n services

3.5.5. m/n: chain

Figure 3.12. Chain

3.5.6. m/n: Tree

Figure 3.13. Tree

3.5.7. Scaling messages

Lowering latency

Increasing throughput

3.6. When messages go bad

3.6.1. The common failure scenarios, and what to do about them

Note

3.6.2. Failures dominating the request/response interaction

Slow downstream

Upstream overload

3.6.3. Failures dominating the sidewinder interaction

Lost actions

3.6.4. Failures dominating the winner-take-all interaction

Poison messages

Guaranteed delivery

3.6.5. Failures dominating the fire-and-forget interaction

Emergent behavior

Catastrophic collapse

3.7. Summary

Table of Contents for
Chapter 3. Messages